R Companion — Chapter 5

R Companion — Chapter 5: Probability

This walkthrough computes the chapter’s central quantities — prevalence, sensitivity, specificity, and PPV — and shows how the same numbers can be reached two ways: by counting the contingency table, and by Bayes’ theorem.

Setup

source("_common.R")

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load the COVID testing data

covid <- read_csv("../datasets/covid-testing.csv")
glimpse(covid)

Rows: 2,000
Columns: 5
$ patient_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, …
$ truly_infected <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No", "…
$ test_result    <chr> "Negative", "Negative", "Negative", "Negative", "Negati…
$ age_group      <chr> "60-74", "30-44", "30-44", "18-29", "30-44", "18-29", "…
$ risk_category  <chr> "Low", "Medium", "Low", "Medium", "Low", "High", "Low",…

2,000 simulated test results, each with the patient’s true infection status (the unobservable ground truth) and the test result (what the test said). The dataset is calibrated to published rapid-antigen test characteristics: ~91% sensitivity, ~95% specificity, with a 5% community prevalence.

Prevalence

prevalence <- mean(covid$truly_infected == "Yes")
prevalence

[1] 0.05

About 5% of patients are truly infected. This is the base rate the chapter keeps coming back to — and the number that AI tools routinely ignore when reasoning about test results.

The contingency table

conf_table <- table(
  Test  = covid$test_result,
  Truth = covid$truly_infected
)
conf_table

          Truth
Test         No  Yes
  Negative 1810    9
  Positive   90   91

addmargins(conf_table)

          Truth
Test         No  Yes  Sum
  Negative 1810    9 1819
  Positive   90   91  181
  Sum      1900  100 2000

This 2×2 table contains everything you need to compute every test characteristic that follows.

Sensitivity, specificity, PPV, NPV

TP <- conf_table["Positive", "Yes"]
FN <- conf_table["Negative", "Yes"]
TN <- conf_table["Negative", "No"]
FP <- conf_table["Positive", "No"]

sensitivity <- TP / (TP + FN)   # P(Positive | Infected)
specificity <- TN / (TN + FP)   # P(Negative | Not infected)
ppv         <- TP / (TP + FP)   # P(Infected | Positive)
npv         <- TN / (TN + FN)   # P(Not infected | Negative)

tibble(
  metric = c("Sensitivity", "Specificity", "PPV", "NPV"),
  value  = round(c(sensitivity, specificity, ppv, npv), 4)
)

# A tibble: 4 × 2
  metric      value
  <chr>       <dbl>
1 Sensitivity 0.91 
2 Specificity 0.953
3 PPV         0.503
4 NPV         0.995

The chapter’s central insight: even with high sensitivity and high specificity, the PPV — the probability you are actually infected given a positive test — can be far lower than people expect when prevalence is low. The base rate dominates.

Bayes’ theorem from first principles

The same PPV, computed from prevalence + sensitivity + specificity:

ppv_via_bayes <- (sensitivity * prevalence) /
  (sensitivity * prevalence + (1 - specificity) * (1 - prevalence))

round(ppv_via_bayes, 4)

[1] 0.5028

round(ppv, 4)

[1] 0.5028

The two values match (within rounding). That is the connection the chapter draws: Bayes’ theorem and a contingency table are two ways of seeing the same arithmetic.

What happens to PPV when prevalence drops?

This is the chapter’s punchline made visible:

prevalence_grid <- seq(0.001, 0.20, by = 0.005)
ppv_grid <- (sensitivity * prevalence_grid) /
  (sensitivity * prevalence_grid +
     (1 - specificity) * (1 - prevalence_grid))

tibble(prevalence = prevalence_grid, ppv = ppv_grid) |>
  ggplot(aes(x = prevalence, y = ppv)) +
  geom_line(color = moe_colors$teal, linewidth = 1.2) +
  geom_vline(xintercept = prevalence, linetype = "dashed",
             color = moe_colors$coral) +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(
    x = "Disease prevalence in the tested population",
    y = "PPV: P(infected | positive test)",
    title = "PPV collapses when prevalence is low",
    subtitle = "Sensitivity 91%, specificity 95% (rapid antigen ballpark)"
  ) +
  theme_moe()

Figure 21.1: PPV as a function of prevalence, holding sensitivity and specificity fixed at the test’s values.

At 5% prevalence (the dashed line), PPV is around 50%. At 1% prevalence, PPV drops below 20%. At 0.1%, PPV is in the single digits. The test has not changed; the prior probability has. Mass-screening healthy populations with even a “good” test produces lots of false positives because the base rate is low.

The binomial distribution

The chapter introduces the binomial in the context of “if I test 20 patients and the population positive-rate is p, what is the distribution of the number of positives?”

p_pos <- mean(covid$test_result == "Positive")

tibble(
  k = 0:20,
  prob = dbinom(0:20, size = 20, prob = p_pos)
) |>
  ggplot(aes(x = k, y = prob)) +
  geom_col(fill = moe_colors$navy, width = 0.7) +
  scale_x_continuous(breaks = 0:20) +
  labs(
    x = "Number of positives in 20 tests",
    y = "Probability",
    title = paste0("Binomial(20, ", round(p_pos, 3), ")")
  ) +
  theme_moe()

Figure 21.2: Binomial distribution: number of positive tests in a batch of 20, given the observed positive rate.

dbinom() returns the probability mass for each value of k. pbinom() returns the cumulative probability if you need “probability of at least 5 positives.” rbinom() simulates draws.

Try it yourself

Drop prevalence to 1%. Recompute PPV via Bayes with prevalence = 0.01. What does that imply about mass-screening healthy populations?
Improve specificity. Bump specificity from 95% to 99% (a more accurate test). How much does PPV improve at 5% prevalence? Is the change bigger or smaller than improving sensitivity from 91% to 99%?