R Companion — Chapter 1

R Companion — Chapter 1: Why Statistics Matters Now

This walkthrough reproduces every analysis the chapter narrative discusses, using the same Flint Water Study data the chapter is built on. The aim is twofold: show you exactly how to compute the chapter’s results in R, and build the verification habits that turn a CSV into a defensible analysis.

Setup

source("_common.R")

The _common.R file loads tidyverse and defines the book’s color palette and theme_moe() ggplot theme. If you are running this on your own machine, download _common.R and place it next to your script, or replace the source() line with library(tidyverse).

Load the Flint dataset

The dataset contains first-draw lead measurements from 271 Flint homes during the Virginia Tech citizen-science sampling effort in summer 2015.

flint <- read_csv("../datasets/flint-water-lead.csv")
glimpse(flint)
Rows: 271
Columns: 5
$ sample_id <dbl> 1, 2, 4, 5, 6, 7, 8, 9, 12, 13, 15, 16, 17, 18, 19, 20, 21, …
$ zip_code  <dbl> 48504, 48507, 48504, 48507, 48505, 48507, 48507, 48503, 4850…
$ ward      <dbl> 6, 9, 1, 8, 3, 9, 9, 5, 9, 3, 9, 5, 2, 7, 9, 9, 5, 6, 2, 6, …
$ lead_ppb  <dbl> 0.344, 8.133, 1.111, 8.007, 1.951, 7.200, 40.630, 1.100, 10.…
$ notes     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …

Five columns: a sample identifier, ZIP code, ward (geographic district), the lead concentration in parts per billion, and a notes field used to flag a handful of homes that were sampled twice.

Always summary first

Before any analysis, look at the data.

summary(flint$lead_ppb)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.344   1.578   3.521  10.646   9.050 158.000 

Median 3.5 ppb, mean 10.65 ppb. The mean is roughly three times the median — a clear signal of right skew. A small number of homes with high readings are pulling the average up, while most homes test below the median.

Verify the chapter’s headline claim

Chapter 1 reports that 16.6% of homes exceeded the EPA action level of 15 ppb and that seven came in above 100 ppb, with a maximum first-draw reading of 158 ppb. Reproduce those numbers:

flint |>
  summarize(
    n_total       = n(),
    n_above_15    = sum(lead_ppb > 15),
    pct_above_15  = mean(lead_ppb > 15) * 100,
    n_above_100   = sum(lead_ppb > 100),
    max_lead_ppb  = max(lead_ppb)
  )
# A tibble: 1 × 5
  n_total n_above_15 pct_above_15 n_above_100 max_lead_ppb
    <int>      <int>        <dbl>       <int>        <dbl>
1     271         45         16.6           7          158

The output should reproduce the chapter’s numbers: 271 samples total, 45 above 15 ppb (16.6%), 7 above 100 ppb, maximum 158 ppb. If your numbers differ, something went wrong at the read step — re-run from read_csv() and recheck.

Reproduce Figure 1.1: distribution of lead levels

The chapter’s first figure is a right-skewed histogram with a vertical line at the EPA action level.

ggplot(flint, aes(x = lead_ppb)) +
  geom_histogram(binwidth = 5, fill = moe_colors$teal, color = "white") +
  geom_vline(xintercept = 15, linetype = "dashed",
             color = moe_colors$coral, linewidth = 1) +
  annotate("text", x = 22, y = Inf, label = "EPA action level (15 ppb)",
           hjust = 0, vjust = 2, color = moe_colors$coral,
           fontface = "bold", size = 3.5) +
  labs(
    x = "Lead concentration (ppb)",
    y = "Number of homes",
    title = "Lead Levels in Flint Drinking Water"
  ) +
  theme_moe()
Figure 17.1: Distribution of lead levels in Flint first-draw samples. The dashed line marks the EPA action level of 15 ppb.

The shape — heavy weight near zero, a long thin tail to the right — is exactly the kind of distribution where the mean is a misleading summary. The median (3.5 ppb) describes a typical home; the mean (10.65 ppb) describes the average pull of the whole sample, including the long-tail homes.

Geographic breakdown by ward

The chapter mentions that some Flint wards were more affected than others. A grouped summary surfaces the geography:

ward_summary <- flint |>
  group_by(ward) |>
  summarize(
    n_homes      = n(),
    median_lead  = round(median(lead_ppb), 1),
    mean_lead    = round(mean(lead_ppb), 1),
    max_lead     = round(max(lead_ppb), 1),
    pct_above_15 = round(mean(lead_ppb > 15) * 100, 1),
    .groups = "drop"
  ) |>
  arrange(desc(median_lead))

ward_summary
# A tibble: 10 × 6
    ward n_homes median_lead mean_lead max_lead pct_above_15
   <dbl>   <int>       <dbl>     <dbl>    <dbl>        <dbl>
 1     8      29         5.4      14.8    158           20.7
 2     7      39         5.2      11      105.          23.1
 3     6      32         4.4      19.7    120           28.1
 4     9      41         3.8       6.6     40.6          9.8
 5     3      22         3.4      12.5    118.          13.6
 6     2      24         3         9.8     75.8         16.7
 7     1      30         2.4       4.2     23.9          6.7
 8     5      18         2.4      10.3     66.2         22.2
 9     4      35         2         8.6    139.          11.4
10     0       1         0.7       0.7      0.7          0  

Wards with the highest median lead are not always the wards with the highest mean — a single very high reading can flip the rankings. Use the median-based ranking when the question is “where do typical homes have the highest lead”; use the mean-based ranking when the question is “where is the total exposure burden highest.”

ward_summary |>
  ggplot(aes(x = reorder(factor(ward), mean_lead), y = mean_lead)) +
  geom_col(fill = moe_colors$navy) +
  geom_hline(yintercept = 15, linetype = "dashed", color = moe_colors$coral) +
  coord_flip() +
  labs(
    x = "Ward",
    y = "Mean lead (ppb)",
    title = "Mean Lead by Ward",
    subtitle = "Dashed line: EPA action level (15 ppb)"
  ) +
  theme_moe()
Figure 17.2: Mean lead by Flint ward, sorted from highest to lowest. Dashed line marks the EPA action level.

What this walkthrough taught you

  1. Always run summary() and glimpse() immediately after read_csv(). These two calls catch most data-import problems before they propagate.
  2. Verify the headline numbers against your source. The chapter cites specific percentages; the walkthrough reproduces them. If they did not match, that is a problem to fix before any further analysis.
  3. When data is skewed, mean and median answer different questions. Report both, and choose the one that fits what your reader needs to know.
  4. Grouped summaries (group_by() + summarize()) are the workhorse of comparative analysis. You will use this pattern in nearly every chapter.

Try it yourself

  1. Above 100 ppb. The chapter mentions seven homes above 100 ppb. Filter the data to find them and look at their wards. Are they concentrated in a few wards, or spread across many?
  2. Compare the median and mean by ward. For each ward, compute both. Which wards show the biggest gap between the two? What does a large mean-median gap tell you about that ward’s distribution?