Datasets

Datasets

All datasets used in Margin of Error are available for download as CSV files. See Appendix B of the book for complete variable descriptions, source documentation, and notes on simulation methodology.

Download

Dataset Chapter Rows Description Download
flint-water-lead.csv 1 271 Lead levels from the Virginia Tech Flint Water Study (2015) Download
sampling-demo.csv 2 200 Simulated student population for sampling-method demos Download
acs-household-income.csv 3 1,200 U.S. household income from the 2022 American Community Survey 1-Year PUMS Download
housing-redlining.csv 4 551 HOLC redlining grades linked to modern demographics Download
spurious-correlations.csv 4 10 Pairs of variables with high correlation but no causal link Download
covid-testing.csv 5 2,000 Simulated COVID test results with true infection status Download
birth-weights.csv 6 944 Birth weights and maternal characteristics (OpenIntro births14) Download
polling-data.csv 7 50 Real U.S. election polls with margins of error (FiveThirtyEight) Download
energy-reports.csv 8 200 Home energy report experiment (treatment vs. control) Download
resume-callbacks.csv 9 4,870 Bertrand & Mullainathan (2004) résumé audit study Download
education-spending.csv 10 50 State-level education spending and test scores Download
wage-gap.csv 11 534 CPS 1985 wage data with gender, experience, education, and occupation Download

Loading Data in R

library(tidyverse)

# Option 1: Download the CSV and read from your local working directory
flint <- read_csv("flint-water-lead.csv")

# Option 2: Read directly from this site
flint <- read_csv("https://stats.marginoferrormedia.com/datasets/flint-water-lead.csv")

The _common.R setup file

The R Companion walkthroughs each begin with source("_common.R"). That file loads tidyverse and defines the book’s color palette and theme_moe() ggplot theme, so the figures match the book’s look. To run a walkthrough on your own machine, download _common.R and place it in the same folder as your script. If you would rather skip it, replace the source() line with library(tidyverse) — the analyses still run; only the styling changes.

Data Sources

The datasets are a mix of real public data and carefully simulated data calibrated to published research. Real datasets include the Virginia Tech Flint Water Study, FiveThirtyEight’s redlining and polling collections, the Bertrand & Mullainathan résumé audit, OpenIntro’s births14, and the CPS 1985 wage data from the AER package. Simulated datasets reproduce key statistical features from published studies while keeping individual-level data freely available for teaching.

See Appendix B for full source citations, variable codebooks, and simulation notes.