Chapter 5: Probability

Overview

Probability quantifies uncertainty. Every time a medical test returns a result, probability determines how much you should trust it. In this walkthrough, we explore COVID testing data to see how sensitivity, specificity, and Bayes’ theorem work in practice — and why a “positive” test does not always mean what you think it means. This walkthrough accompanies Chapter 5 of Margin of Error.

Setup

# Loads tidyverse, book color palette, and theme_moe()
# Download _common.R from the Datasets page if running locally
source("_common.R")

Load and Explore the Data

The dataset contains 2,000 COVID test results. Each row is a patient with a true infection status and a test result.

covid <- read_csv("data/covid-testing.csv")

Rows: 2000 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): truly_infected, test_result, age_group, risk_category
dbl (1): patient_id

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(covid)

Rows: 2,000
Columns: 5
$ patient_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ truly_infected <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
$ test_result    <chr> "Negative", "Negative", "Negative", "Negative", "Negati…
$ age_group      <chr> "60-74", "30-44", "30-44", "18-29", "30-44", "18-29", "…
$ risk_category  <chr> "Low", "Medium", "Low", "Medium", "Low", "High", "Low",…

How prevalent is COVID in this sample? The prevalence — the proportion of people who are truly infected — is the starting point for every probability calculation that follows.

prevalence <- mean(covid$truly_infected == "Yes")
cat("Prevalence (proportion truly infected):", round(prevalence, 4), "\n")

Prevalence (proportion truly infected): 0.05

Two-Way Table: Test Result vs. True Status

A two-way table (also called a confusion matrix) cross-tabulates test results against the truth. This single table contains everything we need to compute sensitivity, specificity, and predictive values.

conf_table <- table(
  Test = covid$test_result,
  Truth = covid$truly_infected
)
conf_table

          Truth
Test         No  Yes
  Negative 1810    9
  Positive   90   91

Add marginal totals to see the full picture.

addmargins(conf_table)

          Truth
Test         No  Yes  Sum
  Negative 1810    9 1819
  Positive   90   91  181
  Sum      1900  100 2000

Sensitivity and Specificity

Sensitivity is the probability of testing positive given that you are truly infected: P(Positive | Infected). A test with high sensitivity rarely misses true cases.

Specificity is the probability of testing negative given that you are truly not infected: P(Negative | Not Infected). A test with high specificity rarely raises false alarms.

# True Positives, False Negatives, True Negatives, False Positives
TP <- conf_table["Positive", "Yes"]
FN <- conf_table["Negative", "Yes"]
TN <- conf_table["Negative", "No"]
FP <- conf_table["Positive", "No"]

sensitivity <- TP / (TP + FN)
specificity <- TN / (TN + FP)

cat("Sensitivity: P(Positive | Infected) =", round(sensitivity, 4), "\n")

Sensitivity: P(Positive | Infected) = 0.91

cat("Specificity: P(Negative | Not Infected) =", round(specificity, 4), "\n")

Specificity: P(Negative | Not Infected) = 0.9526

Positive Predictive Value (PPV)

The question patients actually care about is: given that I tested positive, what is the probability I am truly infected? That is the PPV — P(Infected | Positive).

ppv <- TP / (TP + FP)
cat("PPV: P(Infected | Positive) =", round(ppv, 4), "\n")

PPV: P(Infected | Positive) = 0.5028

Notice that PPV depends heavily on prevalence. Even a highly specific test produces many false positives when the disease is rare.

Bayes’ Theorem by Hand

Bayes’ theorem lets us calculate PPV from sensitivity, specificity, and prevalence without needing the raw table:

\[ PPV = \frac{\text{Sensitivity} \times \text{Prevalence}}{\text{Sensitivity} \times \text{Prevalence} + (1 - \text{Specificity}) \times (1 - \text{Prevalence})} \]

ppv_bayes <- (sensitivity * prevalence) /
  (sensitivity * prevalence + (1 - specificity) * (1 - prevalence))

cat("PPV via Bayes' theorem:", round(ppv_bayes, 4), "\n")

PPV via Bayes' theorem: 0.5028

The two PPV values should match (within rounding).

Law of Large Numbers: Simulating Coin Flips

Probability also governs random processes like coin flips. The Law of Large Numbers says that as the number of trials grows, the observed proportion converges to the true probability.

set.seed(42)
n_flips <- 1000
flips <- sample(0:1, n_flips, replace = TRUE)
running_prop <- cumsum(flips) / 1:n_flips

lln_data <- tibble(
  flip_number = 1:n_flips,
  running_proportion = running_prop
)

ggplot(lln_data, aes(x = flip_number, y = running_proportion)) +
  geom_line(color = moe_colors$teal, linewidth = 0.6) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = moe_colors$coral) +
  labs(
    title = "Law of Large Numbers in Action",
    subtitle = "Running proportion of heads stabilizes near 0.5",
    x = "Number of flips",
    y = "Proportion of heads"
  ) +
  theme_moe()

Figure 1: Running proportion of heads converges to 0.5

Binomial Distribution: Positive Tests in a Batch

If we test 20 patients and each has a fixed probability of testing positive, the number of positives follows a binomial distribution. We use dbinom() to compute and plot the exact probabilities.

# Use the observed positive rate as our probability
p_positive <- mean(covid$test_result == "Positive")

binom_data <- tibble(
  k = 0:20,
  probability = dbinom(0:20, size = 20, prob = p_positive)
)

ggplot(binom_data, aes(x = k, y = probability)) +
  geom_col(fill = moe_colors$navy, width = 0.7) +
  labs(
    title = "Binomial Distribution of Positive Tests",
    subtitle = paste0("n = 20 patients, P(positive) = ", round(p_positive, 3)),
    x = "Number of positive tests",
    y = "Probability"
  ) +
  scale_x_continuous(breaks = 0:20) +
  theme_moe()

Figure 2: Binomial distribution: number of positive tests out of 20

Visualization: Confusion Matrix as a Grouped Bar Chart

A grouped bar chart makes the confusion matrix easy to read at a glance: how do test results break down by true infection status?

covid |>
  count(truly_infected, test_result) |>
  ggplot(aes(x = truly_infected, y = n, fill = test_result)) +
  geom_col(position = "dodge", width = 0.7) +
  scale_fill_moe() +
  labs(
    title = "Confusion Matrix as a Grouped Bar Chart",
    subtitle = "Correct classifications dominate, but errors exist in both directions",
    x = "Truly Infected?",
    y = "Count",
    fill = "Test Result"
  ) +
  theme_moe()

Figure 3: Test results by true infection status

Try It Yourself

What happens to PPV if prevalence drops to 1%? Using the sensitivity and specificity you calculated above, apply Bayes’ theorem with prevalence = 0.01. How does the PPV change? What does this imply about mass testing in low-prevalence populations?
Simulate 10,000 coin flips. Repeat the Law of Large Numbers simulation with n_flips <- 10000 and plot the running proportion. How quickly does it stabilize compared to 1,000 flips?