Appendix B: Dataset Descriptions

All datasets used in the book are available for download from the companion website. Each is provided as a CSV file. Where publicly available, we use real data from the original studies or authoritative government sources. Where real microdata is unavailable or the study design requires it, we use simulated data calibrated to published parameters. Each dataset’s status is clearly marked below.

Chapter 1: Why Statistics Matters Now

flint-water-lead.csv

Status: Real data Source: Virginia Tech Flint Water Study. Data collected by Marc Edwards’s research team during the Flint water crisis and made publicly available through the study’s GitHub repository. Rows: 271 | Variables: 5

Variable	Type	Description
sample_id	Nominal	Unique identifier for each sample
zip_code	Nominal	ZIP code of the home
ward	Nominal	Ward (geographic district) within Flint (0–9)
lead_ppb	Continuous	Lead concentration in parts per billion
notes	Nominal	Flags for special cases (e.g., homes sampled twice)

No missing values. The dataset contains first-draw water samples from 271 Flint homes across 10 wards. Lead levels range from 0.34 to 158 ppb, with a median of 3.5 ppb and a mean of 10.65 ppb. Approximately 16.6% of homes exceeded the EPA action level of 15 ppb, consistent with the crisis-level contamination documented by the Virginia Tech team.

Chapter 2: Research Design

sampling-demo.csv

Status: Simulated (seed = 202) Based on: Fictional small university with realistic demographic and academic distributions. Designed for practicing sampling methods (simple random, stratified, cluster, systematic). Rows: 200 | Variables: 7

Variable	Type	Description
student_id	Nominal	Unique identifier (S001–S200)
major	Nominal	Academic major (Business/Engineering/Liberal Arts/Nursing/Sciences)
gpa	Continuous	Grade point average (0–4 scale)
year	Ordinal	Class year (Freshman/Sophomore/Junior/Senior)
commute_distance_miles	Continuous	Distance of commute in miles (right-skewed)
housing	Nominal	Housing type (On-campus/Off-campus)
works_part_time	Nominal	Whether the student works part-time (Yes/No)

Why simulated: The scenario is fictional by design. No specific real university study is being replicated; the dataset is constructed to provide a realistic population for demonstrating sampling techniques.

Chapter 3: Summarizing Data with Numbers

acs-household-income.csv

Status: Real data Source: U.S. Census Bureau, 2022 American Community Survey 1-Year Public Use Microdata Sample. Sampled to represent the national distribution of income and educational attainment across all 50 states. Rows: 1,200 | Variables: 6

Variable	Type	Description
household_id	Nominal	Unique identifier
state	Nominal	U.S. state
region	Nominal	Geographic region (Northeast/Southeast/Midwest/Southwest/West)
household_income	Continuous	Annual household income in 2022 dollars
education_level	Ordinal	Highest education level attained by head of household (High School or Less, Some College or Associate, Bachelor, Master or Professional, Doctorate)
household_size	Discrete	Number of people in the household

The dataset contains approximately 1,200 household records drawn from the American Community Survey, representing all 50 states. The income distribution exhibits the characteristic right skew of U.S. household income, with education-income gradients and regional variation consistent with published Census summaries.

Chapter 4: Summarizing Data with Pictures

housing-redlining.csv

Status: Real data Source: FiveThirtyEight analysis of Home Owners’ Loan Corporation (HOLC) neighborhood grades linked to modern U.S. Census demographic data. Originally published as part of FiveThirtyEight’s reporting on the lasting effects of redlining. Rows: 551 | Variables: 9

Variable	Type	Description
neighborhood_id	Nominal	Unique identifier
metro_area	Nominal	Metropolitan area name
holc_grade	Ordinal	Historical HOLC grade (A = “Best”, B = “Still Desirable”, C = “Definitely Declining”, D = “Hazardous”)
total_population	Discrete	Total population of the neighborhood/tract
pct_white	Continuous	Percentage of white residents
pct_black	Continuous	Percentage of Black residents
pct_hispanic	Continuous	Percentage of Hispanic residents
pct_asian	Continuous	Percentage of Asian residents
pct_minority	Continuous	Percentage of minority (non-white) residents

The data covers 551 neighborhoods across 138 metropolitan areas. The clear demographic gradient by HOLC grade—neighborhoods graded “A” average 73.8% white, while those graded “D” average only 39.4% white—demonstrates the persistent legacy of 1930s redlining on neighborhood racial composition.

spurious-correlations.csv

Status: Mixed (two pairs from verified public data, one pair simulated) Inspired by: Tyler Vigen’s Spurious Correlations project (tylervigen.com). Rows: 10 | Variables: 7

Variable	Type	Description
year	Discrete	Year (2000–2009)
nicolas_cage_films	Discrete	Number of Nicolas Cage film appearances
pool_drownings	Discrete	Number of pool drownings in the U.S.
cheese_consumption_lbs	Continuous	Per capita cheese consumption (lbs)
bedsheet_deaths	Discrete	Deaths by becoming tangled in bedsheets
margarine_consumption_lbs	Continuous	Per capita margarine consumption (lbs)
maine_divorce_rate	Continuous	Maine divorce rate per 1,000 population

Three pairs of variables with high correlations but no causal relationship: Nicolas Cage films vs. pool drownings (r = 0.66), cheese consumption vs. bedsheet deaths (r = 0.95), and margarine consumption vs. Maine divorce rate (r = 0.99). The cheese/bedsheet and margarine/divorce pairs use verified data from USDA and CDC sources. The Cage/drownings pair is simulated to approximate the correlation reported by Vigen, as the original source data is no longer publicly accessible.

Chapter 5: Probability

covid-testing.csv

Status: Simulated (seed = 50003) Based on: Published sensitivity and specificity estimates for rapid antigen and PCR tests during the COVID-19 pandemic. Rows: 2,000 | Variables: 5

Variable	Type	Description
patient_id	Nominal	Unique identifier
truly_infected	Nominal	True infection status (Yes/No, ~5% prevalence)
test_result	Nominal	Test result (Positive/Negative)
age_group	Ordinal	Age group category (18–29/30–44/45–59/60–74/75+)
risk_category	Nominal	Risk category (Low/Medium/High)

Why simulated: Individual-level COVID testing data with verified true infection status is not publicly available. The dataset is generated with sensitivity of approximately 91% and specificity of approximately 95%, consistent with published performance estimates for rapid antigen tests. The 5% prevalence reflects a moderate community transmission scenario.

Chapter 6: The Normal Distribution and CLT

birth-weights.csv

Status: Real data Source: OpenIntro births14 dataset, a sample from the 2014 CDC National Vital Statistics natality data. Rows: 944 | Variables: 6

Variable	Type	Description
baby_id	Nominal	Unique identifier
birth_weight_g	Continuous	Birth weight in grams (mean = 3,266, SD = 593)
gestational_weeks	Continuous	Gestational age at birth in weeks
mother_age	Continuous	Mother’s age in years
prenatal_visits	Discrete	Number of prenatal care visits
smoking_status	Nominal	Whether the mother smoked during pregnancy (Yes/No)

The dataset includes both full-term and preterm births, which is why the standard deviation (593g) is larger than the approximately 450g typically reported for full-term-only samples. The slight left skew (skewness = -1.0) reflects the inclusion of preterm, low-birth-weight infants. Among full-term births only (gestational weeks >= 37, n = 829), the distribution is more symmetric with mean = 3,373g and SD = 464g.

Chapter 7: Confidence Intervals

polling-data.csv

Status: Real data Source: FiveThirtyEight pollster ratings dataset (raw_polls). Polls are from real U.S. election polling conducted by established polling organizations. Rows: 50 | Variables: 8

Variable	Type	Description
poll_id	Nominal	Unique identifier
pollster	Nominal	Polling organization name (31 unique pollsters)
date	Date	Date the poll was conducted
sample_size	Discrete	Number of respondents (range: 200–22,170)
candidate_a_pct	Continuous	Percentage supporting Candidate A
candidate_b_pct	Continuous	Percentage supporting Candidate B
margin_of_error	Continuous	Reported margin of error (percentage points)
confidence_level	Continuous	Confidence level used (all 95%)

The margins of error in this dataset are computed from the actual sample sizes using the standard formula for a 95% confidence interval for a proportion, allowing students to verify the relationship between sample size and margin of error.

Chapter 8: Hypothesis Testing

energy-reports.csv

Status: Simulated (seed = 808) Based on: Allcott, H. (2011). “Social Norms and Energy Conservation.” Journal of Public Economics, 95(9–10), 1082–1095. Rows: 200 | Variables: 7

Variable	Type	Description
household_id	Nominal	Unique identifier
group	Nominal	Experimental group (Treatment/Control, 100 each)
energy_kwh_before	Continuous	Monthly energy usage before intervention (kWh)
energy_kwh_after	Continuous	Monthly energy usage after intervention (kWh)
home_size_sqft	Continuous	Home size in square feet
num_occupants	Discrete	Number of household occupants
region	Nominal	Geographic region (Northeast/Southeast/Midwest/Southwest/West)

Why simulated: The Allcott (2011) study involved over 600,000 households across multiple utility companies, and the individual-level data is not publicly available. This simulated dataset reproduces the study’s key finding: treatment households (who received home energy reports comparing their usage to neighbors) reduced consumption by approximately 1.6%, while control households showed a slight increase. The treatment effect is statistically significant (p < 0.001).

Chapter 9: Comparing Groups

resume-callbacks.csv

Status: Real data Source: Bertrand, M. & Mullainathan, S. (2004). “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review, 94(4), 991–1013. Data accessed via the OpenIntro R package. Rows: 4,870 | Variables: 7

Variable	Type	Description
resume_id	Nominal	Unique identifier
name_type	Nominal	Name category (White-sounding/Black-sounding, 2,435 each)
gender	Nominal	Gender (Male/Female)
resume_quality	Ordinal	Résumé quality level (Low/High)
years_experience	Discrete	Years of work experience
education	Nominal	Education level (Bachelor/High School)
callback	Nominal	Whether the applicant received a callback (Yes/No)

This is the complete dataset from the Bertrand and Mullainathan field experiment. The researchers sent 4,870 fictitious résumés to real job postings in Boston and Chicago, randomly assigning either White-sounding or Black-sounding names. The callback rate for White-sounding names (9.65%) was approximately 50% higher than for Black-sounding names (6.45%), providing direct experimental evidence of racial discrimination in hiring.

Chapter 10: Simple Linear Regression

education-spending.csv

Status: Simulated, calibrated to patterns from real sources. Sources of calibration: National Center for Education Statistics (NCES) Digest of Education Statistics (spending, test scores, student-teacher ratios, free lunch eligibility) and U.S. Census Bureau Small Area Income and Poverty Estimates (SAIPE) for median household income and poverty rates. The simulated rows preserve the orders of magnitude and approximate correlations of these real sources but do not reproduce any individual state’s reported figures. Rows: 50 | Variables: 7

Variable	Type	Description
state	Nominal	U.S. state (all 50 states)
spending_per_student	Continuous	Annual spending per student in constant dollars
avg_score	Continuous	Average NAEP (National Assessment of Educational Progress) mathematics score
median_household_income	Continuous	State median household income
pct_poverty	Continuous	State poverty rate (percentage)
student_teacher_ratio	Continuous	Students per teacher
pct_free_lunch	Continuous	Percentage of students eligible for free/reduced-price lunch

Each row represents one U.S. state. The correlation between per-student spending and average NAEP math score is r = 0.43, indicating a moderate positive association. Additional variables allow students to explore confounders and build toward multiple regression concepts in Chapter 11.

Chapter 11: Multiple Regression

wage-gap.csv

Status: Real data Source: Current Population Survey (CPS) May 1985, distributed in the AER (Applied Econometrics with R) package as CPS1985, originally from Berndt’s The Practice of Econometrics (1991). This is a widely used teaching dataset in labor economics. Rows: 534 | Variables: 10

Variable	Type	Description
employee_id	Nominal	Unique identifier
salary	Continuous	Full-time-equivalent annual salary in 1985 dollars (computed as hourly wage × 2,000 hours/year)
gender	Nominal	Gender (Male/Female) — recorded as binary in the original CPS instrument
years_experience	Continuous	Years of work experience
education_years	Continuous	Years of education
occupation	Nominal	Occupation category (Worker/Technical/Services/Office/Sales/Management)
region	Nominal	Geographic region (South/Other)
union_member	Nominal	Union membership status (Yes/No)
married	Nominal	Marital status (Yes/No)
ethnicity	Nominal	Ethnicity (cauc / hispanic / other, as coded in the source)

Note on construction: The CPS1985 source records wage as an hourly rate. To support the chapter’s narrative in dollar-per-year terms (so coefficients read as “dollars more per year”), the CSV reports salary = wage × 2,000, the standard 50-week × 40-hour full-time benchmark. All chapter coefficients reconcile exactly with this transformation: the raw Female coefficient is approximately −$4,233/year (R² ≈ 0.042), narrowing to −$3,930/year after controlling for experience, education, and occupation (R² ≈ 0.302). The progression — from a simple bivariate regression to a multiple regression that accounts for confounders — is the pedagogical core of the chapter, illustrating both controlling-for logic and the limits of “controlling away” discrimination.