Appendix B: Dataset Descriptions
All datasets used in the book are available for download from the companion website. Each is provided as a CSV file. Where publicly available, we use real data from the original studies or authoritative government sources. Where real microdata is unavailable or the study design requires it, we use simulated data calibrated to published parameters. Each dataset’s status is clearly marked below.
Chapter 1: Why Statistics Matters Now
flint-water-lead.csv
Status: Real data Source: Virginia Tech Flint Water Study. Data collected by Marc Edwards’s research team during the Flint water crisis and made publicly available through the study’s GitHub repository. Rows: 271 | Variables: 5
| Variable | Type | Description |
|---|---|---|
| sample_id | Nominal | Unique identifier for each sample |
| zip_code | Nominal | ZIP code of the home |
| ward | Nominal | Ward (geographic district) within Flint (0–9) |
| lead_ppb | Continuous | Lead concentration in parts per billion |
| notes | Nominal | Flags for special cases (e.g., homes sampled twice) |
No missing values. The dataset contains first-draw water samples from 271 Flint homes across 10 wards. Lead levels range from 0.34 to 158 ppb, with a median of 3.5 ppb and a mean of 10.65 ppb. Approximately 16.6% of homes exceeded the EPA action level of 15 ppb, consistent with the crisis-level contamination documented by the Virginia Tech team.
Chapter 2: Research Design
sampling-demo.csv
Status: Simulated (seed = 202) Based on: Fictional small university with realistic demographic and academic distributions. Designed for practicing sampling methods (simple random, stratified, cluster, systematic). Rows: 200 | Variables: 7
| Variable | Type | Description |
|---|---|---|
| student_id | Nominal | Unique identifier (S001–S200) |
| major | Nominal | Academic major (Business/Engineering/Liberal Arts/Nursing/Sciences) |
| gpa | Continuous | Grade point average (0–4 scale) |
| year | Ordinal | Class year (Freshman/Sophomore/Junior/Senior) |
| commute_distance_miles | Continuous | Distance of commute in miles (right-skewed) |
| housing | Nominal | Housing type (On-campus/Off-campus) |
| works_part_time | Nominal | Whether the student works part-time (Yes/No) |
Why simulated: The scenario is fictional by design. No specific real university study is being replicated; the dataset is constructed to provide a realistic population for demonstrating sampling techniques.
Chapter 3: Summarizing Data with Numbers
acs-household-income.csv
Status: Real data Source: U.S. Census Bureau, 2022 American Community Survey 1-Year Public Use Microdata Sample. Sampled to represent the national distribution of income and educational attainment across all 50 states. Rows: 1,200 | Variables: 6
| Variable | Type | Description |
|---|---|---|
| household_id | Nominal | Unique identifier |
| state | Nominal | U.S. state |
| region | Nominal | Geographic region (Northeast/Southeast/Midwest/Southwest/West) |
| household_income | Continuous | Annual household income in 2022 dollars |
| education_level | Ordinal | Highest education level attained by head of household (High School or Less, Some College or Associate, Bachelor, Master or Professional, Doctorate) |
| household_size | Discrete | Number of people in the household |
The dataset contains approximately 1,200 household records drawn from the American Community Survey, representing all 50 states. The income distribution exhibits the characteristic right skew of U.S. household income, with education-income gradients and regional variation consistent with published Census summaries.
Chapter 4: Summarizing Data with Pictures
housing-redlining.csv
Status: Real data Source: FiveThirtyEight analysis of Home Owners’ Loan Corporation (HOLC) neighborhood grades linked to modern U.S. Census demographic data. Originally published as part of FiveThirtyEight’s reporting on the lasting effects of redlining. Rows: 551 | Variables: 9
| Variable | Type | Description |
|---|---|---|
| neighborhood_id | Nominal | Unique identifier |
| metro_area | Nominal | Metropolitan area name |
| holc_grade | Ordinal | Historical HOLC grade (A = “Best”, B = “Still Desirable”, C = “Definitely Declining”, D = “Hazardous”) |
| total_population | Discrete | Total population of the neighborhood/tract |
| pct_white | Continuous | Percentage of white residents |
| pct_black | Continuous | Percentage of Black residents |
| pct_hispanic | Continuous | Percentage of Hispanic residents |
| pct_asian | Continuous | Percentage of Asian residents |
| pct_minority | Continuous | Percentage of minority (non-white) residents |
The data covers 551 neighborhoods across 138 metropolitan areas. The clear demographic gradient by HOLC grade—neighborhoods graded “A” average 73.8% white, while those graded “D” average only 39.4% white—demonstrates the persistent legacy of 1930s redlining on neighborhood racial composition.
spurious-correlations.csv
Status: Mixed (two pairs from verified public data, one pair simulated) Inspired by: Tyler Vigen’s Spurious Correlations project (tylervigen.com). Rows: 10 | Variables: 7
| Variable | Type | Description |
|---|---|---|
| year | Discrete | Year (2000–2009) |
| nicolas_cage_films | Discrete | Number of Nicolas Cage film appearances |
| pool_drownings | Discrete | Number of pool drownings in the U.S. |
| cheese_consumption_lbs | Continuous | Per capita cheese consumption (lbs) |
| bedsheet_deaths | Discrete | Deaths by becoming tangled in bedsheets |
| margarine_consumption_lbs | Continuous | Per capita margarine consumption (lbs) |
| maine_divorce_rate | Continuous | Maine divorce rate per 1,000 population |
Three pairs of variables with high correlations but no causal relationship: Nicolas Cage films vs. pool drownings (r = 0.66), cheese consumption vs. bedsheet deaths (r = 0.95), and margarine consumption vs. Maine divorce rate (r = 0.99). The cheese/bedsheet and margarine/divorce pairs use verified data from USDA and CDC sources. The Cage/drownings pair is simulated to approximate the correlation reported by Vigen, as the original source data is no longer publicly accessible.
Chapter 5: Probability
covid-testing.csv
Status: Simulated (seed = 50003) Based on: Published sensitivity and specificity estimates for rapid antigen and PCR tests during the COVID-19 pandemic. Rows: 2,000 | Variables: 5
| Variable | Type | Description |
|---|---|---|
| patient_id | Nominal | Unique identifier |
| truly_infected | Nominal | True infection status (Yes/No, ~5% prevalence) |
| test_result | Nominal | Test result (Positive/Negative) |
| age_group | Ordinal | Age group category (18–29/30–44/45–59/60–74/75+) |
| risk_category | Nominal | Risk category (Low/Medium/High) |
Why simulated: Individual-level COVID testing data with verified true infection status is not publicly available. The dataset is generated with sensitivity of approximately 91% and specificity of approximately 95%, consistent with published performance estimates for rapid antigen tests. The 5% prevalence reflects a moderate community transmission scenario.
Chapter 6: The Normal Distribution and CLT
birth-weights.csv
Status: Real data Source: OpenIntro births14 dataset, a sample from the 2014 CDC National Vital Statistics natality data. Rows: 944 | Variables: 6
| Variable | Type | Description |
|---|---|---|
| baby_id | Nominal | Unique identifier |
| birth_weight_g | Continuous | Birth weight in grams (mean = 3,266, SD = 593) |
| gestational_weeks | Continuous | Gestational age at birth in weeks |
| mother_age | Continuous | Mother’s age in years |
| prenatal_visits | Discrete | Number of prenatal care visits |
| smoking_status | Nominal | Whether the mother smoked during pregnancy (Yes/No) |
The dataset includes both full-term and preterm births, which is why the standard deviation (593g) is larger than the approximately 450g typically reported for full-term-only samples. The slight left skew (skewness = -1.0) reflects the inclusion of preterm, low-birth-weight infants. Among full-term births only (gestational weeks >= 37, n = 829), the distribution is more symmetric with mean = 3,373g and SD = 464g.
Chapter 7: Confidence Intervals
polling-data.csv
Status: Real data Source: FiveThirtyEight pollster ratings dataset (raw_polls). Polls are from real U.S. election polling conducted by established polling organizations. Rows: 50 | Variables: 8
| Variable | Type | Description |
|---|---|---|
| poll_id | Nominal | Unique identifier |
| pollster | Nominal | Polling organization name (31 unique pollsters) |
| date | Date | Date the poll was conducted |
| sample_size | Discrete | Number of respondents (range: 200–22,170) |
| candidate_a_pct | Continuous | Percentage supporting Candidate A |
| candidate_b_pct | Continuous | Percentage supporting Candidate B |
| margin_of_error | Continuous | Reported margin of error (percentage points) |
| confidence_level | Continuous | Confidence level used (all 95%) |
The margins of error in this dataset are computed from the actual sample sizes using the standard formula for a 95% confidence interval for a proportion, allowing students to verify the relationship between sample size and margin of error.
Chapter 8: Hypothesis Testing
energy-reports.csv
Status: Simulated (seed = 808) Based on: Allcott, H. (2011). “Social Norms and Energy Conservation.” Journal of Public Economics, 95(9–10), 1082–1095. Rows: 200 | Variables: 7
| Variable | Type | Description |
|---|---|---|
| household_id | Nominal | Unique identifier |
| group | Nominal | Experimental group (Treatment/Control, 100 each) |
| energy_kwh_before | Continuous | Monthly energy usage before intervention (kWh) |
| energy_kwh_after | Continuous | Monthly energy usage after intervention (kWh) |
| home_size_sqft | Continuous | Home size in square feet |
| num_occupants | Discrete | Number of household occupants |
| region | Nominal | Geographic region (Northeast/Southeast/Midwest/Southwest/West) |
Why simulated: The Allcott (2011) study involved over 600,000 households across multiple utility companies, and the individual-level data is not publicly available. This simulated dataset reproduces the study’s key finding: treatment households (who received home energy reports comparing their usage to neighbors) reduced consumption by approximately 1.6%, while control households showed a slight increase. The treatment effect is statistically significant (p < 0.001).
Chapter 9: Comparing Groups
resume-callbacks.csv
Status: Real data Source: Bertrand, M. & Mullainathan, S. (2004). “Are Emily and Greg More Employable Than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination.” American Economic Review, 94(4), 991–1013. Data accessed via the OpenIntro R package. Rows: 4,870 | Variables: 7
| Variable | Type | Description |
|---|---|---|
| resume_id | Nominal | Unique identifier |
| name_type | Nominal | Name category (White-sounding/Black-sounding, 2,435 each) |
| gender | Nominal | Gender (Male/Female) |
| resume_quality | Ordinal | Résumé quality level (Low/High) |
| years_experience | Discrete | Years of work experience |
| education | Nominal | Education level (Bachelor/High School) |
| callback | Nominal | Whether the applicant received a callback (Yes/No) |
This is the complete dataset from the Bertrand and Mullainathan field experiment. The researchers sent 4,870 fictitious résumés to real job postings in Boston and Chicago, randomly assigning either White-sounding or Black-sounding names. The callback rate for White-sounding names (9.65%) was approximately 50% higher than for Black-sounding names (6.45%), providing direct experimental evidence of racial discrimination in hiring.
Chapter 10: Simple Linear Regression
education-spending.csv
Status: Simulated, calibrated to patterns from real sources. Sources of calibration: National Center for Education Statistics (NCES) Digest of Education Statistics (spending, test scores, student-teacher ratios, free lunch eligibility) and U.S. Census Bureau Small Area Income and Poverty Estimates (SAIPE) for median household income and poverty rates. The simulated rows preserve the orders of magnitude and approximate correlations of these real sources but do not reproduce any individual state’s reported figures. Rows: 50 | Variables: 7
| Variable | Type | Description |
|---|---|---|
| state | Nominal | U.S. state (all 50 states) |
| spending_per_student | Continuous | Annual spending per student in constant dollars |
| avg_score | Continuous | Average NAEP (National Assessment of Educational Progress) mathematics score |
| median_household_income | Continuous | State median household income |
| pct_poverty | Continuous | State poverty rate (percentage) |
| student_teacher_ratio | Continuous | Students per teacher |
| pct_free_lunch | Continuous | Percentage of students eligible for free/reduced-price lunch |
Each row represents one U.S. state. The correlation between per-student spending and average NAEP math score is r = 0.43, indicating a moderate positive association. Additional variables allow students to explore confounders and build toward multiple regression concepts in Chapter 11.
Chapter 11: Multiple Regression
wage-gap.csv
Status: Real data Source: Current Population Survey (CPS) May 1985, distributed in the AER (Applied Econometrics with R) package as CPS1985, originally from Berndt’s The Practice of Econometrics (1991). This is a widely used teaching dataset in labor economics. Rows: 534 | Variables: 10
| Variable | Type | Description |
|---|---|---|
| employee_id | Nominal | Unique identifier |
| salary | Continuous | Full-time-equivalent annual salary in 1985 dollars (computed as hourly wage × 2,000 hours/year) |
| gender | Nominal | Gender (Male/Female) — recorded as binary in the original CPS instrument |
| years_experience | Continuous | Years of work experience |
| education_years | Continuous | Years of education |
| occupation | Nominal | Occupation category (Worker/Technical/Services/Office/Sales/Management) |
| region | Nominal | Geographic region (South/Other) |
| union_member | Nominal | Union membership status (Yes/No) |
| married | Nominal | Marital status (Yes/No) |
| ethnicity | Nominal | Ethnicity (cauc / hispanic / other, as coded in the source) |
Note on construction: The CPS1985 source records wage as an hourly rate. To support the chapter’s narrative in dollar-per-year terms (so coefficients read as “dollars more per year”), the CSV reports salary = wage × 2,000, the standard 50-week × 40-hour full-time benchmark. All chapter coefficients reconcile exactly with this transformation: the raw Female coefficient is approximately −$4,233/year (R² ≈ 0.042), narrowing to −$3,930/year after controlling for experience, education, and occupation (R² ≈ 0.302). The progression — from a simple bivariate regression to a multiple regression that accounts for confounders — is the pedagogical core of the chapter, illustrating both controlling-for logic and the limits of “controlling away” discrimination.