Datasets

All datasets used in Margin of Error are available for download as CSV files. See Appendix B in the book for complete variable descriptions and source documentation.

Download

Dataset Chapter Rows Description Download
flint-water-lead.csv 1 271 Lead levels in Flint, Michigan water samples Download
sampling-demo.csv 2 200 Simulated student population for sampling methods Download
income-inequality.csv 3 500 Household income data by education, region, and household size Download
housing-redlining.csv 4 551 HOLC redlining grades linked to modern demographics Download
spurious-correlations.csv 4 10 Pairs of variables with high correlation but no causal link Download
covid-testing.csv 5 2,000 Simulated COVID test results with true infection status Download
birth-weights.csv 6 944 Birth weights and maternal characteristics Download
polling-data.csv 7 50 Real U.S. election polls with margins of error Download
energy-reports.csv 8 200 Home energy report experiment (treatment vs control) Download
resume-callbacks.csv 9 4,870 Resume audit study on racial discrimination in hiring Download
education-spending.csv 10 50 State-level education spending and test scores Download
wage-gap.csv 11 534 Wage data with gender, experience, education, and occupation Download

R Setup File

The R Walkthroughs on this site use a shared setup file called _common.R. It loads the tidyverse, defines the book’s color palette, and creates a custom ggplot theme (theme_moe()) so all plots have a consistent look.

If you want to run the walkthrough code on your own machine, download this file and place it in the same folder as your R script or R Markdown document.

Download _common.R

The file defines:

  • moe_colors — a named list of colors used throughout the book (navy, teal, amber, coral, etc.)
  • moe_palette — a discrete color scale for ggplot
  • theme_moe() — a clean ggplot theme with book-consistent styling
  • scale_color_moe() and scale_fill_moe() — convenience wrappers

Each walkthrough begins with source("_common.R"). If you prefer not to use it, you can replace that line with library(tidyverse) and the code will still work — the plots will just use default ggplot styling instead of the book’s theme.

Loading Data in R

# Load the shared theme (optional — download _common.R from the Datasets page)
source("_common.R")

# Or just load tidyverse if you don't need the book's theme
# library(tidyverse)

# Option 1: Download the CSV and load from your local machine
flint <- read_csv("flint-water-lead.csv")

# Option 2: Load directly from the companion site
flint <- read_csv("https://stats.marginoferrormedia.com/data/flint-water-lead.csv")

Data Sources

Datasets are a mix of real public data and carefully simulated data calibrated to published research. Real datasets include the Flint water study, FiveThirtyEight’s redlining analysis, the Bertrand & Mullainathan resume audit experiment, NCES education spending data, and CPS wage data. Simulated datasets reproduce key statistical features from published studies while making individual-level data freely available for teaching.

See Appendix B in the book for full source citations, variable codebooks, and notes on simulation methodology.