Appendix B: Dataset Descriptions

All datasets used in the book are available for download from the companion website. Each is provided as a CSV file. Where publicly available, we use real data from the original studies or authoritative government sources. Where real microdata is unavailable or the study design requires it, we use simulated data calibrated to published parameters. Each dataset’s status is clearly marked below.

Chapter 1: Why Statistics Matters Now

flint-water-lead.csv

Status: Real data Source: Virginia Tech Flint Water Study. Data collected by Marc Edwards’s research team during the Flint water crisis and made publicly available through the study’s GitHub repository. Rows: 271 | Variables: 5

Variable Type Description
sample_id Nominal Unique identifier for each sample
zip_code Nominal ZIP code of the home
ward Nominal Ward (geographic district) within Flint (0–9)
lead_ppb Continuous Lead concentration in parts per billion
notes Nominal Flags for special cases (e.g., homes sampled twice)

No missing values. The dataset contains first-draw water samples from 271 Flint homes across 10 wards. Lead levels range from 0.34 to 158 ppb, with a median of 3.5 ppb and a mean of 10.65 ppb. Approximately 16.6% of homes exceeded the EPA action level of 15 ppb, consistent with the crisis-level contamination documented by the Virginia Tech team.

Chapter 2: Research Design

sampling-demo.csv

Status: Simulated (seed = 202) Based on: Fictional small university with realistic demographic and academic distributions. Designed for practicing sampling methods (simple random, stratified, cluster, systematic). Rows: 200 | Variables: 7

Variable Type Description
student_id Nominal Unique identifier (S001–S200)
major Nominal Academic major (Business/Engineering/Liberal Arts/Nursing/Sciences)
gpa Continuous Grade point average (0–4 scale)
year Ordinal Class year (Freshman/Sophomore/Junior/Senior)
commute_distance_miles Continuous Distance of commute in miles (right-skewed)
housing Nominal Housing type (On-campus/Off-campus)
works_part_time Nominal Whether the student works part-time (Yes/No)

Why simulated: The scenario is fictional by design. No specific real university study is being replicated; the dataset is constructed to provide a realistic population for demonstrating sampling techniques.

Chapter 3: Summarizing Data with Numbers

acs-household-income.csv

Status: Real data Source: U.S. Census Bureau, 2022 American Community Survey 1-Year Public Use Microdata Sample. Sampled to represent the national distribution of income and educational attainment across all 50 states. Rows: 1,200 | Variables: 6

Variable Type Description
household_id Nominal Unique identifier
state Nominal U.S. state
region Nominal Geographic region (Northeast/Southeast/Midwest/Southwest/West)
household_income Continuous Annual household income in 2022 dollars
education_level Ordinal Highest education level attained by head of household (High School or Less, Some College or Associate, Bachelor, Master or Professional, Doctorate)
household_size Discrete Number of people in the household

The dataset contains approximately 1,200 household records drawn from the American Community Survey, representing all 50 states. The income distribution exhibits the characteristic right skew of U.S. household income, with education-income gradients and regional variation consistent with published Census summaries.

Chapter 4: Summarizing Data with Pictures

housing-redlining.csv

Status: Real data Source: FiveThirtyEight analysis of Home Owners’ Loan Corporation (HOLC) neighborhood grades linked to modern U.S. Census demographic data. Originally published as part of FiveThirtyEight’s reporting on the lasting effects of redlining. Rows: 551 | Variables: 9

Variable Type Description
neighborhood_id Nominal Unique identifier
metro_area Nominal Metropolitan area name
holc_grade Ordinal Historical HOLC grade (A = “Best”, B = “Still Desirable”, C = “Definitely Declining”, D = “Hazardous”)
total_population Discrete Total population of the neighborhood/tract
pct_white Continuous Percentage of white residents
pct_black Continuous Percentage of Black residents
pct_hispanic Continuous Percentage of Hispanic residents
pct_asian Continuous Percentage of Asian residents
pct_minority Continuous Percentage of minority (non-white) residents

The data covers 551 neighborhoods across 138 metropolitan areas. The clear demographic gradient by HOLC grade—neighborhoods graded “A” average 73.8% white, while those graded “D” average only 39.4% white—demonstrates the persistent legacy of 1930s redlining on neighborhood racial composition.

spurious-correlations.csv

Status: Mixed (two pairs from verified public data, one pair simulated) Inspired by: Tyler Vigen’s Spurious Correlations project (tylervigen.com). Rows: 10 | Variables: 7

Variable Type Description
year Discrete Year (2000–2009)
nicolas_cage_films Discrete Number of Nicolas Cage film appearances
pool_drownings Discrete Number of pool drownings in the U.S.
cheese_consumption_lbs Continuous Per capita cheese consumption (lbs)
bedsheet_deaths Discrete Deaths by becoming tangled in bedsheets
margarine_consumption_lbs Continuous Per capita margarine consumption (lbs)
maine_divorce_rate Continuous Maine divorce rate per 1,000 population

Three pairs of variables with high correlations but no causal relationship: Nicolas Cage films vs. pool drownings (r = 0.66), cheese consumption vs. bedsheet deaths (r = 0.95), and margarine consumption vs. Maine divorce rate (r = 0.99). The cheese/bedsheet and margarine/divorce pairs use verified data from USDA and CDC sources. The Cage/drownings pair is simulated to approximate the correlation reported by Vigen, as the original source data is no longer publicly accessible.

Chapter 5: Probability

covid-testing.csv

Status: Simulated (seed = 50003) Based on: Published sensitivity and specificity estimates for rapid antigen and PCR tests during the COVID-19 pandemic. Rows: 2,000 | Variables: 5

Variable Type Description
patient_id Nominal Unique identifier
truly_infected Nominal True infection status (Yes/No, ~5% prevalence)
test_result Nominal Test result (Positive/Negative)
age_group Ordinal Age group category (18–29/30–44/45–59/60–74/75+)
risk_category Nominal Risk category (Low/Medium/High)

Why simulated: Individual-level COVID testing data with verified true infection status is not publicly available. The dataset is generated with sensitivity of approximately 91% and specificity of approximately 95%, consistent with published performance estimates for rapid antigen tests. The 5% prevalence reflects a moderate community transmission scenario.

Chapter 6: The Normal Distribution and CLT

birth-weights.csv

Status: Real data Source: OpenIntro births14 dataset, a sample from the 2014 CDC National Vital Statistics natality data. Rows: 944 | Variables: 6

Variable Type Description
baby_id Nominal Unique identifier
birth_weight_g Continuous Birth weight in grams (mean = 3,266, SD = 593)
gestational_weeks Continuous Gestational age at birth in weeks
mother_age Continuous Mother’s age in years
prenatal_visits Discrete Number of prenatal care visits
smoking_status Nominal Whether the mother smoked during pregnancy (Yes/No)

The dataset includes both full-term and preterm births, which is why the standard deviation (593g) is larger than the approximately 450g typically reported for full-term-only samples. The slight left skew (skewness = -1.0) reflects the inclusion of preterm, low-birth-weight infants. Among full-term births only (gestational weeks >= 37, n = 829), the distribution is more symmetric with mean = 3,373g and SD = 464g.

Chapter 7: Confidence Intervals

polling-data.csv

Status: Real data Source: FiveThirtyEight pollster ratings dataset (raw_polls). Polls are from real U.S. election polling conducted by established polling organizations. Rows: 50 | Variables: 8

Variable Type Description
poll_id Nominal Unique identifier
pollster Nominal Polling organization name (31 unique pollsters)
date Date Date the poll was conducted
sample_size Discrete Number of respondents (range: 200–22,170)
candidate_a_pct Continuous Percentage supporting Candidate A
candidate_b_pct Continuous Percentage supporting Candidate B
margin_of_error Continuous Reported margin of error (percentage points)
confidence_level Continuous Confidence level used (all 95%)

The margins of error in this dataset are computed from the actual sample sizes using the standard formula for a 95% confidence interval for a proportion, allowing students to verify the relationship between sample size and margin of error.

Chapter 8: Hypothesis Testing

energy-reports.csv

Status: Simulated (seed = 808) Based on: Allcott, H. (2011). “Social Norms and Energy Conservation.” Journal of Public Economics, 95(9–10), 1082–1095. Rows: 200 | Variables: 7

Variable Type Description
household_id Nominal Unique identifier
group Nominal Experimental group (Treatment/Control, 100 each)
energy_kwh_before Continuous Monthly energy usage before intervention (kWh)
energy_kwh_after Continuous Monthly energy usage after intervention (kWh)
home_size_sqft Continuous Home size in square feet
num_occupants Discrete Number of household occupants
region Nominal Geographic region (Northeast/Southeast/Midwest/Southwest/West)

Why simulated: The Allcott (2011) study involved over 600,000 households across multiple utility companies, and the individual-level data is not publicly available. This simulated dataset reproduces the study’s key finding: treatment households (who received home energy reports comparing their usage to neighbors) reduced consumption by approximately 1.6%, while control households showed a slight increase. The treatment effect is statistically significant (p < 0.001).

Chapter 9: Comparing Groups

resume-callbacks.csv

Status: Real data Source: Bertrand, M. & Mullainathan, S. (2004). “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review, 94(4), 991–1013. Data accessed via the OpenIntro R package. Rows: 4,870 | Variables: 7

Variable Type Description
resume_id Nominal Unique identifier
name_type Nominal Name category (White-sounding/Black-sounding, 2,435 each)
gender Nominal Gender (Male/Female)
resume_quality Ordinal Résumé quality level (Low/High)
years_experience Discrete Years of work experience
education Nominal Education level (Bachelor/High School)
callback Nominal Whether the applicant received a callback (Yes/No)

This is the complete dataset from the Bertrand and Mullainathan field experiment. The researchers sent 4,870 fictitious résumés to real job postings in Boston and Chicago, randomly assigning either White-sounding or Black-sounding names. The callback rate for White-sounding names (9.65%) was approximately 50% higher than for Black-sounding names (6.45%), providing direct experimental evidence of racial discrimination in hiring.

Chapter 10: Simple Linear Regression

education-spending.csv

Status: Simulated, calibrated to patterns from real sources. Sources of calibration: National Center for Education Statistics (NCES) Digest of Education Statistics (spending, test scores, student-teacher ratios, free lunch eligibility) and U.S. Census Bureau Small Area Income and Poverty Estimates (SAIPE) for median household income and poverty rates. The simulated rows preserve the orders of magnitude and approximate correlations of these real sources but do not reproduce any individual state’s reported figures. Rows: 50 | Variables: 7

Variable Type Description
state Nominal U.S. state (all 50 states)
spending_per_student Continuous Annual spending per student in constant dollars
avg_score Continuous Average NAEP (National Assessment of Educational Progress) mathematics score
median_household_income Continuous State median household income
pct_poverty Continuous State poverty rate (percentage)
student_teacher_ratio Continuous Students per teacher
pct_free_lunch Continuous Percentage of students eligible for free/reduced-price lunch

Each row represents one U.S. state. The correlation between per-student spending and average NAEP math score is r = 0.43, indicating a moderate positive association. Additional variables allow students to explore confounders and build toward multiple regression concepts in Chapter 11.

Chapter 11: Multiple Regression

wage-gap.csv

Status: Real data Source: Current Population Survey (CPS) May 1985, distributed in the AER (Applied Econometrics with R) package as CPS1985, originally from Berndt’s The Practice of Econometrics (1991). This is a widely used teaching dataset in labor economics. Rows: 534 | Variables: 10

Variable Type Description
employee_id Nominal Unique identifier
salary Continuous Full-time-equivalent annual salary in 1985 dollars (computed as hourly wage × 2,000 hours/year)
gender Nominal Gender (Male/Female) — recorded as binary in the original CPS instrument
years_experience Continuous Years of work experience
education_years Continuous Years of education
occupation Nominal Occupation category (Worker/Technical/Services/Office/Sales/Management)
region Nominal Geographic region (South/Other)
union_member Nominal Union membership status (Yes/No)
married Nominal Marital status (Yes/No)
ethnicity Nominal Ethnicity (cauc / hispanic / other, as coded in the source)

Note on construction: The CPS1985 source records wage as an hourly rate. To support the chapter’s narrative in dollar-per-year terms (so coefficients read as “dollars more per year”), the CSV reports salary = wage × 2,000, the standard 50-week × 40-hour full-time benchmark. All chapter coefficients reconcile exactly with this transformation: the raw Female coefficient is approximately −$4,233/year (R² ≈ 0.042), narrowing to −$3,930/year after controlling for experience, education, and occupation (R² ≈ 0.302). The progression — from a simple bivariate regression to a multiple regression that accounts for confounders — is the pedagogical core of the chapter, illustrating both controlling-for logic and the limits of “controlling away” discrimination.