Datasets
Datasets
All datasets used in Margin of Error are available for download as CSV files. See Appendix B of the book for complete variable descriptions, source documentation, and notes on simulation methodology.
Download
| Dataset | Chapter | Rows | Description | Download |
|---|---|---|---|---|
flint-water-lead.csv |
1 | 271 | Lead levels from the Virginia Tech Flint Water Study (2015) | Download |
sampling-demo.csv |
2 | 200 | Simulated student population for sampling-method demos | Download |
acs-household-income.csv |
3 | 1,200 | U.S. household income from the 2022 American Community Survey 1-Year PUMS | Download |
housing-redlining.csv |
4 | 551 | HOLC redlining grades linked to modern demographics | Download |
spurious-correlations.csv |
4 | 10 | Pairs of variables with high correlation but no causal link | Download |
covid-testing.csv |
5 | 2,000 | Simulated COVID test results with true infection status | Download |
birth-weights.csv |
6 | 944 | Birth weights and maternal characteristics (OpenIntro births14) |
Download |
polling-data.csv |
7 | 50 | Real U.S. election polls with margins of error (FiveThirtyEight) | Download |
energy-reports.csv |
8 | 200 | Home energy report experiment (treatment vs. control) | Download |
resume-callbacks.csv |
9 | 4,870 | Bertrand & Mullainathan (2004) résumé audit study | Download |
education-spending.csv |
10 | 50 | State-level education spending and test scores | Download |
wage-gap.csv |
11 | 534 | CPS 1985 wage data with gender, experience, education, and occupation | Download |
Loading Data in R
library(tidyverse)
# Option 1: Download the CSV and read from your local working directory
flint <- read_csv("flint-water-lead.csv")
# Option 2: Read directly from this site
flint <- read_csv("https://stats.marginoferrormedia.com/datasets/flint-water-lead.csv")The _common.R setup file
The R Companion walkthroughs each begin with source("_common.R"). That file loads tidyverse and defines the book’s color palette and theme_moe() ggplot theme, so the figures match the book’s look. To run a walkthrough on your own machine, download _common.R and place it in the same folder as your script. If you would rather skip it, replace the source() line with library(tidyverse) — the analyses still run; only the styling changes.
Data Sources
The datasets are a mix of real public data and carefully simulated data calibrated to published research. Real datasets include the Virginia Tech Flint Water Study, FiveThirtyEight’s redlining and polling collections, the Bertrand & Mullainathan résumé audit, OpenIntro’s births14, and the CPS 1985 wage data from the AER package. Simulated datasets reproduce key statistical features from published studies while keeping individual-level data freely available for teaching.
See Appendix B for full source citations, variable codebooks, and simulation notes.