Chapter 1: Why Statistics Matters Now

Overview

Statistics helps us separate signal from noise — and sometimes the stakes are life and death. In this walkthrough, we use data from the Flint, Michigan water crisis to see why data skills matter. Lead-contaminated drinking water affected thousands of homes, and the data tell a story that numbers alone can make visible. This walkthrough accompanies Chapter 1 of Margin of Error.

Setup

# Loads tidyverse, book color palette, and theme_moe()
# Download _common.R from the Datasets page if running locally
source("_common.R")

Load and Explore the Data

The dataset contains lead measurements (in parts per billion) from 271 homes in Flint, Michigan. Each row represents a single water test from a home.

flint <- read_csv("data/flint-water-lead.csv")
Rows: 271 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): home_id, draw_type
dbl (3): ward, zip_code, lead_ppb

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(flint)
Rows: 271
Columns: 5
$ home_id   <chr> "H001", "H002", "H004", "H005", "H006", "H007", "H008", "H00…
$ ward      <dbl> 6, 9, 1, 8, 3, 9, 9, 5, 9, 3, 9, 5, 2, 7, 9, 9, 5, 6, 2, 6, …
$ zip_code  <dbl> 48504, 48507, 48504, 48507, 48505, 48507, 48507, 48503, 4850…
$ lead_ppb  <dbl> 0.344, 8.133, 1.111, 8.007, 1.951, 7.200, 40.630, 1.100, 10.…
$ draw_type <chr> "first", "first", "first", "first", "first", "first", "first…

The glimpse() output tells us how many rows and columns we have, along with each variable’s type. Let’s get a statistical summary next.

summary(flint)
   home_id               ward          zip_code        lead_ppb      
 Length:271         Min.   :0.000   Min.   :48502   Min.   :  0.344  
 Class :character   1st Qu.:3.000   1st Qu.:48503   1st Qu.:  1.578  
 Mode  :character   Median :6.000   Median :48505   Median :  3.521  
                    Mean   :5.314   Mean   :48505   Mean   : 10.646  
                    3rd Qu.:8.000   3rd Qu.:48506   3rd Qu.:  9.050  
                    Max.   :9.000   Max.   :48532   Max.   :158.000  
  draw_type        
 Length:271        
 Class :character  
 Mode  :character  
                   
                   
                   

Identify Variable Types

Before any analysis, classify your variables. This determines which summaries and charts are appropriate.

str(flint)
spc_tbl_ [271 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ home_id  : chr [1:271] "H001" "H002" "H004" "H005" ...
 $ ward     : num [1:271] 6 9 1 8 3 9 9 5 9 3 ...
 $ zip_code : num [1:271] 48504 48507 48504 48507 48505 ...
 $ lead_ppb : num [1:271] 0.344 8.133 1.111 8.007 1.951 ...
 $ draw_type: chr [1:271] "first" "first" "first" "first" ...
 - attr(*, "spec")=
  .. cols(
  ..   home_id = col_character(),
  ..   ward = col_double(),
  ..   zip_code = col_double(),
  ..   lead_ppb = col_double(),
  ..   draw_type = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
  • home_id — identifier (nominal)
  • ward — categorical/nominal (geographic grouping)
  • zip_code — categorical/nominal (looks numeric, but codes a location)
  • lead_ppb — continuous (the measurement we care about)
  • draw_type — categorical/nominal (first draw, flushed, etc.)

Check for Missing Values

Missing data can distort every calculation that follows. Always check before you analyze.

colSums(is.na(flint))
  home_id      ward  zip_code  lead_ppb draw_type 
        0         0         0         0         0 
cat("Total missing values:", sum(is.na(flint)), "\n")
Total missing values: 0 

Visualize Lead Levels

The EPA action level for lead in drinking water is 15 parts per billion (ppb). A histogram shows how lead readings are distributed across homes and makes exceedances immediately visible.

ggplot(flint, aes(x = lead_ppb)) +
  geom_histogram(binwidth = 5, fill = moe_colors$teal, color = "white") +
  geom_vline(xintercept = 15, linetype = "dashed", linewidth = 1,
             color = moe_colors$coral) +
  annotate("text", x = 17, y = Inf, label = "EPA action level (15 ppb)",
           hjust = 0, vjust = 2, color = moe_colors$coral, fontface = "bold",
           size = 3.5) +
  labs(
    title = "Lead Levels in Flint Drinking Water",
    subtitle = "Many homes exceed the EPA action level",
    x = "Lead (parts per billion)",
    y = "Number of homes"
  ) +
  theme_moe()
Figure 1: Distribution of lead levels in Flint water samples

Homes Above the EPA Limit

What proportion of tested homes exceeded the 15 ppb threshold?

n_above <- sum(flint$lead_ppb > 15, na.rm = TRUE)
pct_above <- mean(flint$lead_ppb > 15, na.rm = TRUE) * 100

cat("Homes above 15 ppb:", n_above, "\n")
Homes above 15 ppb: 45 
cat("Percentage above 15 ppb:", round(pct_above, 1), "%\n")
Percentage above 15 ppb: 16.6 %

Summarize Lead by Ward

Different neighborhoods may be affected differently. Grouping by ward reveals geographic patterns.

ward_summary <- flint |>
  group_by(ward) |>
  summarize(
    n_homes      = n(),
    mean_lead    = round(mean(lead_ppb, na.rm = TRUE), 1),
    median_lead  = round(median(lead_ppb, na.rm = TRUE), 1),
    max_lead     = max(lead_ppb, na.rm = TRUE),
    pct_above_15 = round(mean(lead_ppb > 15, na.rm = TRUE) * 100, 1)
  ) |>
  arrange(desc(mean_lead))

ward_summary
# A tibble: 10 × 6
    ward n_homes mean_lead median_lead max_lead pct_above_15
   <dbl>   <int>     <dbl>       <dbl>    <dbl>        <dbl>
 1     6      32      19.7         4.4  120             28.1
 2     8      29      14.8         5.4  158             20.7
 3     3      22      12.5         3.4  118.            13.6
 4     7      39      11           5.2  105.            23.1
 5     5      18      10.3         2.4   66.2           22.2
 6     2      24       9.8         3     75.8           16.7
 7     4      35       8.6         2    139.            11.4
 8     9      41       6.6         3.8   40.6            9.8
 9     1      30       4.2         2.4   23.8            6.7
10     0       1       0.7         0.7    0.739          0  
ggplot(ward_summary, aes(x = reorder(factor(ward), mean_lead), y = mean_lead)) +
  geom_col(fill = moe_colors$navy) +
  geom_hline(yintercept = 15, linetype = "dashed", color = moe_colors$coral) +
  coord_flip() +
  labs(
    title = "Mean Lead Level by Ward",
    subtitle = "Dashed line = EPA action level (15 ppb)",
    x = "Ward",
    y = "Mean lead (ppb)"
  ) +
  theme_moe()
Figure 2: Mean lead levels by ward in Flint

Try It Yourself

  1. Mean vs. median. Calculate the overall median lead level. How does it compare to the mean? If the mean is substantially larger than the median, what does that tell you about the shape of the distribution?

  2. Worst-hit ward. Using the ward_summary table above (or computing your own), which ward has the highest proportion of homes above 15 ppb? What might explain geographic differences in contamination?