source("_common.R")R Companion — Chapter 1
R Companion — Chapter 1: Why Statistics Matters Now
This walkthrough reproduces every analysis the chapter narrative discusses, using the same Flint Water Study data the chapter is built on. The aim is twofold: show you exactly how to compute the chapter’s results in R, and build the verification habits that turn a CSV into a defensible analysis.
Setup
The _common.R file loads tidyverse and defines the book’s color palette and theme_moe() ggplot theme. If you are running this on your own machine, download _common.R and place it next to your script, or replace the source() line with library(tidyverse).
Load the Flint dataset
The dataset contains first-draw lead measurements from 271 Flint homes during the Virginia Tech citizen-science sampling effort in summer 2015.
flint <- read_csv("../datasets/flint-water-lead.csv")
glimpse(flint)Rows: 271
Columns: 5
$ sample_id <dbl> 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19, 20, 21, …
$ zip_code <dbl> 48504, 48507, 48504, 48507, 48505, 48507, 48507, 48503, 4850…
$ ward <dbl> 6, 9, 1, 8, 3, 9, 9, 5, 9, 3, 9, 5, 2, 7, 9, 9, 5, 6, 2, 6, …
$ lead_ppb <dbl> 0.344, 8.133, 1.111, 8.007, 1.951, 7.200, 40.630, 1.100, 10.…
$ notes <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
Five columns: a sample identifier, ZIP code, ward (geographic district), the lead concentration in parts per billion, and a notes field used to flag a handful of homes that were sampled twice.
Always summary first
Before any analysis, look at the data.
summary(flint$lead_ppb) Min. 1st Qu. Median Mean 3rd Qu. Max.
0.344 1.578 3.521 10.646 9.050 158.000
Median 3.5 ppb, mean 10.65 ppb. The mean is roughly three times the median — a clear signal of right skew. A small number of homes with high readings are pulling the average up, while most homes test below the median.
Verify the chapter’s headline claim
Chapter 1 reports that 16.6% of homes exceeded the EPA action level of 15 ppb and that seven came in above 100 ppb, with a maximum first-draw reading of 158 ppb. Reproduce those numbers:
flint |>
summarize(
n_total = n(),
n_above_15 = sum(lead_ppb > 15),
pct_above_15 = mean(lead_ppb > 15) * 100,
n_above_100 = sum(lead_ppb > 100),
max_lead_ppb = max(lead_ppb)
)# A tibble: 1 × 5
n_total n_above_15 pct_above_15 n_above_100 max_lead_ppb
<int> <int> <dbl> <int> <dbl>
1 271 45 16.6 7 158
The output should reproduce the chapter’s numbers: 271 samples total, 45 above 15 ppb (16.6%), 7 above 100 ppb, maximum 158 ppb. If your numbers differ, something went wrong at the read step — re-run from read_csv() and recheck.
Reproduce Figure 1.1: distribution of lead levels
The chapter’s first figure is a right-skewed histogram with a vertical line at the EPA action level.
ggplot(flint, aes(x = lead_ppb)) +
geom_histogram(binwidth = 5, fill = moe_colors$teal, color = "white") +
geom_vline(xintercept = 15, linetype = "dashed",
color = moe_colors$coral, linewidth = 1) +
annotate("text", x = 22, y = Inf, label = "EPA action level (15 ppb)",
hjust = 0, vjust = 2, color = moe_colors$coral,
fontface = "bold", size = 3.5) +
labs(
x = "Lead concentration (ppb)",
y = "Number of homes",
title = "Lead Levels in Flint Drinking Water"
) +
theme_moe()
The shape — heavy weight near zero, a long thin tail to the right — is exactly the kind of distribution where the mean is a misleading summary. The median (3.5 ppb) describes a typical home; the mean (10.65 ppb) describes the average pull of the whole sample, including the long-tail homes.
Geographic breakdown by ward
The chapter mentions that some Flint wards were more affected than others. A grouped summary surfaces the geography:
ward_summary <- flint |>
group_by(ward) |>
summarize(
n_homes = n(),
median_lead = round(median(lead_ppb), 1),
mean_lead = round(mean(lead_ppb), 1),
max_lead = round(max(lead_ppb), 1),
pct_above_15 = round(mean(lead_ppb > 15) * 100, 1),
.groups = "drop"
) |>
arrange(desc(median_lead))
ward_summary# A tibble: 10 × 6
ward n_homes median_lead mean_lead max_lead pct_above_15
<dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 8 29 5.4 14.8 158 20.7
2 7 39 5.2 11 105. 23.1
3 6 32 4.4 19.7 120 28.1
4 9 41 3.8 6.6 40.6 9.8
5 3 22 3.4 12.5 118. 13.6
6 2 24 3 9.8 75.8 16.7
7 1 30 2.4 4.2 23.9 6.7
8 5 18 2.4 10.3 66.2 22.2
9 4 35 2 8.6 139. 11.4
10 0 1 0.7 0.7 0.7 0
Wards with the highest median lead are not always the wards with the highest mean — a single very high reading can flip the rankings. Use the median-based ranking when the question is “where do typical homes have the highest lead”; use the mean-based ranking when the question is “where is the total exposure burden highest.”
ward_summary |>
ggplot(aes(x = reorder(factor(ward), mean_lead), y = mean_lead)) +
geom_col(fill = moe_colors$navy) +
geom_hline(yintercept = 15, linetype = "dashed", color = moe_colors$coral) +
coord_flip() +
labs(
x = "Ward",
y = "Mean lead (ppb)",
title = "Mean Lead by Ward",
subtitle = "Dashed line: EPA action level (15 ppb)"
) +
theme_moe()
What this walkthrough taught you
- Always run
summary()andglimpse()immediately afterread_csv(). These two calls catch most data-import problems before they propagate. - Verify the headline numbers against your source. The chapter cites specific percentages; the walkthrough reproduces them. If they did not match, that is a problem to fix before any further analysis.
- When data is skewed, mean and median answer different questions. Report both, and choose the one that fits what your reader needs to know.
- Grouped summaries (
group_by()+summarize()) are the workhorse of comparative analysis. You will use this pattern in nearly every chapter.
Try it yourself
- Above 100 ppb. The chapter mentions seven homes above 100 ppb. Filter the data to find them and look at their wards. Are they concentrated in a few wards, or spread across many?
- Compare the median and mean by ward. For each ward, compute both. Which wards show the biggest gap between the two? What does a large mean-median gap tell you about that ward’s distribution?