R Companion — Chapter 11

R Companion — Chapter 11: Multiple Regression

This walkthrough fits the chapter’s four nested regressions on the CPS 1985 wage data and reproduces the headline numbers — the gender coefficient narrowing from raw to fully-adjusted, R² climbing as controls are added, and VIF for the final model.

Setup

source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load the wage-gap data

wages <- read_csv("../datasets/wage-gap.csv")
glimpse(wages)
Rows: 534
Columns: 10
$ employee_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ salary           <dbl> 10200, 9900, 13340, 8000, 15000, 26140, 8900, 38940, …
$ gender           <chr> "Female", "Female", "Male", "Male", "Male", "Male", "…
$ years_experience <dbl> 21, 42, 1, 4, 17, 9, 27, 9, 11, 9, 17, 19, 27, 30, 29…
$ education_years  <dbl> 8, 9, 12, 12, 12, 13, 10, 12, 16, 12, 12, 12, 8, 9, 9…
$ occupation       <chr> "Worker", "Worker", "Worker", "Worker", "Worker", "Wo…
$ region           <chr> "Other", "Other", "Other", "Other", "Other", "Other",…
$ union_member     <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "No"…
$ married          <chr> "Yes", "Yes", "No", "No", "Yes", "No", "No", "No", "Y…
$ ethnicity        <chr> "hispanic", "cauc", "cauc", "cauc", "cauc", "cauc", "…

534 employees from the CPS May 1985 — the canonical labor-economics teaching dataset. Hourly wage has been multiplied by 2,000 (50 weeks × 40 hours) to give a full-time-equivalent annual salary, so coefficients read as “dollars per year.”

The raw wage gap

wages |>
  group_by(gender) |>
  summarize(
    n             = n(),
    mean_salary   = mean(salary),
    median_salary = median(salary),
    .groups       = "drop"
  )
# A tibble: 2 × 4
  gender     n mean_salary median_salary
  <chr>  <int>       <dbl>         <dbl>
1 Female   245      15758.         13600
2 Male     289      19990.         17860

The raw difference between men’s and women’s average salary in this dataset is the headline gap before any controls.

Model 1: gender only

model1 <- lm(salary ~ gender, data = wages)
summary(model1)

Call:
lm(formula = salary ~ gender, data = wages)

Residuals:
   Min     1Q Median     3Q    Max 
-17990  -7058  -2144   4787  73242 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  15757.7      643.3   24.50  < 2e-16 ***
genderMale    4232.1      874.4    4.84  1.7e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10070 on 532 degrees of freedom
Multiple R-squared:  0.04218,   Adjusted R-squared:  0.04038 
F-statistic: 23.43 on 1 and 532 DF,  p-value: 1.703e-06

The chapter cites a raw Female-vs-Male gap of approximately −$4,233/year with R² ≈ 0.042. The negative sign means women earn less; the magnitude is the gap.

Model 2: add experience

Maybe men in this sample have more experience, and that explains the gap?

model2 <- lm(salary ~ gender + years_experience, data = wages)
summary(model2)

Call:
lm(formula = salary ~ gender + years_experience, data = wages)

Residuals:
   Min     1Q Median     3Q    Max 
-18592  -7222  -2183   4934  74769 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      14145.78     921.19  15.356  < 2e-16 ***
genderMale        4391.94     872.85   5.032 6.66e-07 ***
years_experience    85.59      35.17   2.434   0.0153 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10020 on 531 degrees of freedom
Multiple R-squared:  0.05275,   Adjusted R-squared:  0.04918 
F-statistic: 14.78 on 2 and 531 DF,  p-value: 5.649e-07

Watch what happens to the gender coefficient. If experience explained most of the gap, the coefficient would shrink toward zero. Read the new gender coefficient and compare.

Model 3: add education

model3 <- lm(salary ~ gender + years_experience + education_years, data = wages)
summary(model3)

Call:
lm(formula = salary ~ gender + years_experience + education_years, 
    data = wages)

Residuals:
   Min     1Q Median     3Q    Max 
-19143  -5492  -1306   3785  75448 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -13009.01    2419.71  -5.376 1.14e-07 ***
genderMale         4675.26     776.12   6.024 3.19e-09 ***
years_experience    226.60      33.42   6.781 3.19e-11 ***
education_years    1881.01     157.73  11.926  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8908 on 530 degrees of freedom
Multiple R-squared:  0.2532,    Adjusted R-squared:  0.2489 
F-statistic: 59.88 on 3 and 530 DF,  p-value: < 2.2e-16

Each additional year of education is associated with some increase in salary. Has this further reduced the unexplained gender gap?

Model 4: full model with occupation

The biggest control is occupation. Men and women in 1985 were distributed differently across occupations, and occupations themselves carry different pay scales.

model4 <- lm(salary ~ gender + years_experience + education_years + occupation,
             data = wages)
summary(model4)

Call:
lm(formula = salary ~ gender + years_experience + education_years + 
    occupation, data = wages)

Residuals:
   Min     1Q Median     3Q    Max 
-22428  -5388   -934   4125  70372 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         -1645.06    3382.59  -0.486  0.62694    
genderMale           3929.75     832.57   4.720 3.03e-06 ***
years_experience      205.96      33.02   6.238 9.11e-10 ***
education_years      1433.39     198.11   7.235 1.66e-12 ***
occupationOffice    -6526.25    1530.01  -4.265 2.37e-05 ***
occupationSales     -8076.70    1845.23  -4.377 1.45e-05 ***
occupationServices  -7679.52    1612.61  -4.762 2.48e-06 ***
occupationTechnical -2042.78    1457.83  -1.401  0.16173    
occupationWorker    -4825.42    1504.32  -3.208  0.00142 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8654 on 525 degrees of freedom
Multiple R-squared:  0.3017,    Adjusted R-squared:  0.291 
F-statistic: 28.35 on 8 and 525 DF,  p-value: < 2.2e-16

The chapter cites a controlled Female-vs-Male coefficient of approximately −$3,930/year with R² ≈ 0.302. The gap shrinks but does not vanish.

The interpretation hinges on what “controlling for occupation” means: if women are channeled into lower-paying occupations because of discrimination upstream, then controlling for occupation effectively removes part of the discrimination from the estimated gap. The chapter discusses this — controls are not always neutral.

Track the gender coefficient across models

gender_coefs <- tibble(
  model        = c("M1: Gender only",
                   "M2: + Experience",
                   "M3: + Education",
                   "M4: + Occupation"),
  coefficient  = c(coef(model1)["genderMale"],
                   coef(model2)["genderMale"],
                   coef(model3)["genderMale"],
                   coef(model4)["genderMale"]),
  r_squared    = c(summary(model1)$r.squared,
                   summary(model2)$r.squared,
                   summary(model3)$r.squared,
                   summary(model4)$r.squared)
)
gender_coefs |>
  mutate(coefficient = round(coefficient, 0),
         r_squared   = round(r_squared, 3))
# A tibble: 4 × 3
  model            coefficient r_squared
  <chr>                  <dbl>     <dbl>
1 M1: Gender only         4232     0.042
2 M2: + Experience        4392     0.053
3 M3: + Education         4675     0.253
4 M4: + Occupation        3930     0.302

The chapter argues this exact progression: the headline ~$4,233 gap shrinks as controls are added. R² climbs from 0.042 to 0.302 — the model now explains 30% of the variation in salaries.

Visualize the coefficient trajectory

gender_coefs |>
  mutate(model = factor(model, levels = model)) |>
  ggplot(aes(x = model, y = coefficient)) +
  geom_segment(aes(xend = model, y = 0, yend = coefficient),
               color = moe_colors$coral, linewidth = 1) +
  geom_point(size = 4, color = moe_colors$coral) +
  geom_hline(yintercept = 0, linetype = "dashed",
             color = moe_colors$slate) +
  coord_flip() +
  labs(x = NULL, y = "Female-vs-Male coefficient ($/year)",
       title = "Gender coefficient narrows but does not vanish") +
  theme_moe()
Figure 27.1: Gender coefficient across four nested models. Adding controls shrinks the gap but does not close it.

Multicollinearity check: VIF

When predictors are correlated with each other, coefficient estimates become unstable. The Variance Inflation Factor (VIF) quantifies how much each predictor’s standard error is inflated by collinearity with the others.

library(car)
Warning: package 'car' was built under R version 4.5.2
Warning: package 'carData' was built under R version 4.5.2
vif(model4)
                     GVIF Df GVIF^(1/(2*Df))
gender           1.227183  1        1.107783
years_experience 1.188824  1        1.090332
education_years  1.910504  1        1.382210
occupation       2.056254  5        1.074751

Conventional rules of thumb: VIF below 5 is fine, VIF above 10 is a problem. Categorical predictors (like occupation) get a “GVIF” with adjustment for degrees of freedom.

Coefficient plot for the full model

coef_df <- broom::tidy(model4, conf.int = TRUE) |>
  filter(term != "(Intercept)")

ggplot(coef_df, aes(x = estimate, y = reorder(term, estimate))) +
  geom_vline(xintercept = 0, linetype = "dashed", color = moe_colors$slate) +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high),
                 height = 0.25, color = moe_colors$navy) +
  geom_point(size = 3, color = moe_colors$coral) +
  scale_x_continuous(labels = scales::label_dollar()) +
  labs(x = "Coefficient ($/year)", y = NULL,
       title = "Model 4: Coefficient Estimates with 95% CIs") +
  theme_moe()
Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
`height` was translated to `width`.
Figure 27.2: Coefficients from Model 4 with 95% confidence intervals. Coefficients whose intervals exclude zero are statistically distinguishable from zero.

Try it yourself

  1. Add region. Fit Model 5 with region added to the formula. Does the gender coefficient move further? Does region explain meaningful additional variance?
  2. Interaction. Fit lm(salary ~ gender * years_experience + education_years + occupation, data = wages). The gender:years_experience interaction tests whether the experience-salary relationship differs by gender. What does the coefficient tell you?