R Companion — Chapter 9

R Companion — Chapter 9: Comparing Groups

This walkthrough reproduces the chapter’s central result on the original Bertrand & Mullainathan (2004) résumé-audit data, plus the multi-group analyses the chapter discusses.

Setup

source("_common.R")

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load the resume-callbacks data

resumes <- read_csv("../datasets/resume-callbacks.csv")
glimpse(resumes)

Rows: 4,870
Columns: 7
$ resume_id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ name_type        <chr> "White-sounding", "White-sounding", "Black-sounding",…
$ gender           <chr> "Female", "Female", "Female", "Female", "Female", "Ma…
$ resume_quality   <chr> "Low", "High", "Low", "High", "High", "Low", "High", …
$ years_experience <dbl> 6, 6, 6, 6, 22, 6, 5, 21, 3, 6, 8, 8, 4, 4, 5, 4, 5, …
$ education        <chr> "Bachelor", "High School", "Bachelor", "High School",…
$ callback         <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No",…

The full Bertrand & Mullainathan (2004) experiment: 4,870 fictitious résumés, randomly assigned a White-sounding or Black-sounding name and sent to real job postings in Boston and Chicago.

Verify the chapter’s headline numbers

Chapter 9 reports callback rates of 9.65% for White-sounding names and 6.45% for Black-sounding names — about a 50% gap.

callback_summary <- resumes |>
  group_by(name_type) |>
  summarize(
    n         = n(),
    callbacks = sum(callback == "Yes"),
    rate_pct  = round(100 * mean(callback == "Yes"), 2),
    .groups   = "drop"
  )
callback_summary

# A tibble: 2 × 4
  name_type          n callbacks rate_pct
  <chr>          <int>     <int>    <dbl>
1 Black-sounding  2435       157     6.45
2 White-sounding  2435       235     9.65

The output should show 9.65% and 6.45% (the chapter’s exact numbers). If they differ, something went wrong — either in the data or in your code.

Test whether the difference is real: chi-square

contingency <- table(resumes$name_type, resumes$callback)
contingency

                
                   No  Yes
  Black-sounding 2278  157
  White-sounding 2200  235

chisq.test(contingency)


    Pearson's Chi-squared test with Yates' continuity correction

data:  contingency
X-squared = 16.449, df = 1, p-value = 4.998e-05

The X-squared statistic is large and the p-value is essentially zero. The 50% gap is not a sampling artifact — under the null hypothesis of no race effect, observing a gap this large in 4,870 résumés would be astronomically unlikely.

Verify the randomization: t-test on years_experience

Random assignment should have made the two groups equivalent on every observable except the name. Confirm:

t.test(years_experience ~ name_type, data = resumes)


    Welch Two Sample t-test

data:  years_experience by name_type
t = -0.18462, df = 4867.1, p-value = 0.8535
alternative hypothesis: true difference in means between group Black-sounding and group White-sounding is not equal to 0
95 percent confidence interval:
 -0.3101545  0.2567664
sample estimates:
mean in group Black-sounding mean in group White-sounding 
                    7.829569                     7.856263

The p-value should be large (well above 0.05): the groups were equivalent on years of experience before the experiment, as expected. If they were not, the design integrity would be in question.

One-way ANOVA: experience by education

aov_model <- aov(years_experience ~ education, data = resumes)
summary(aov_model)

              Df Sum Sq Mean Sq F value Pr(>F)
education      1     69   68.56   2.695  0.101
Residuals   4868 123838   25.44

The F-statistic is large; the p-value is small. At least one education group has a different mean experience level than the others.

Post-hoc: TukeyHSD

ANOVA tells you “something differs” but not “which pairs differ.” TukeyHSD does pairwise comparisons while controlling the family-wise error rate:

TukeyHSD(aov_model)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = years_experience ~ education, data = resumes)

$education
                          diff         lwr       upr     p adj
High School-Bachelor 0.2641073 -0.05129538 0.5795099 0.1007346

Pairs whose 95% CI does not include zero are statistically distinguishable. Read the lwr and upr columns: any interval that doesn’t span zero is a significant pairwise difference.

Effect size: eta-squared

How much of the variation in years of experience is explained by education level?

ss        <- summary(aov_model)[[1]][["Sum Sq"]]
ss_between <- ss[1]
ss_total   <- sum(ss)
eta_sq     <- ss_between / ss_total
round(eta_sq, 4)

[1] 6e-04

Conventional benchmarks: 0.01 small, 0.06 medium, 0.14+ large.

ANOVA assumptions: residual diagnostics

par(mfrow = c(2, 2))
plot(aov_model)
par(mfrow = c(1, 1))

Figure 25.1: Standard four-panel diagnostic for an aov() model. Look for random scatter, a roughly straight QQ line, and no trumpeting in the scale-location plot.

If these plots look badly off — a strong U-shape in residuals, severely curved QQ line — the parametric ANOVA result is unreliable and the rank-based alternative below is the better choice.

Nonparametric alternative: Kruskal-Wallis

When ANOVA assumptions are shaky (heavy skew, severe outliers), Kruskal-Wallis tests the same kind of question on ranks instead of values:

kruskal.test(years_experience ~ education, data = resumes)


    Kruskal-Wallis rank sum test

data:  years_experience by education
Kruskal-Wallis chi-squared = 11.202, df = 1, p-value = 0.000817

The H statistic is approximately chi-squared distributed under the null of identical distributions. With ANOVA’s assumptions reasonably met, Kruskal-Wallis tends to agree with the parametric result; when assumptions break, Kruskal-Wallis is the safer call.

Visualize: experience by education

ggplot(resumes, aes(x = education, y = years_experience, fill = education)) +
  geom_boxplot(alpha = 0.6, outlier.shape = NA) +
  geom_jitter(width = 0.25, alpha = 0.1, size = 0.6,
              color = moe_colors$navy) +
  scale_fill_moe() +
  labs(x = "Education level", y = "Years of experience",
       title = "Experience by Education Level") +
  theme_moe() +
  theme(legend.position = "none")

Figure 25.2: Years of experience by education level. Boxplots with jittered points show both the central tendency and the underlying spread.

Try it yourself

Two-way ANOVA. Fit aov(years_experience ~ name_type * resume_quality, data = resumes). Is there a significant interaction? What would a significant interaction mean substantively?
Callback rates by quality. Reproduce the chapter’s finding that higher-quality résumés increased callbacks for White-sounding names but not for Black-sounding names. Group by name_type and resume_quality and compare callback rates.