R Companion — Chapter 10

R Companion — Chapter 10: Simple Linear Regression

This walkthrough fits the chapter’s regression of average NAEP score on per-student spending and reproduces every number the chapter narrative cites.

Setup

source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load the education-spending data

education <- read_csv("../datasets/education-spending.csv")
glimpse(education)
Rows: 50
Columns: 7
$ state                   <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "C…
$ spending_per_student    <dbl> 12903, 24479, 12363, 13959, 17178, 16331, 2662…
$ avg_score               <dbl> 268.5, 283.0, 277.3, 276.0, 270.4, 287.4, 288.…
$ median_household_income <dbl> 59703, 88072, 74355, 55505, 91517, 89096, 8818…
$ pct_poverty             <dbl> 16.2, 10.8, 12.5, 16.3, 12.2, 9.5, 9.8, 10.0, …
$ student_teacher_ratio   <dbl> 17.9, 18.3, 22.8, 12.7, 21.8, 16.3, 11.7, 14.2…
$ pct_free_lunch          <dbl> 60.2, 39.5, 50.3, 64.0, 58.2, 42.4, 41.9, 25.2…

50 states. Per-student spending, average NAEP math score, and several covariates (median household income, poverty rate, student-teacher ratio, free-lunch eligibility).

Plot first

ggplot(education, aes(x = spending_per_student, y = avg_score)) +
  geom_point(size = 3, color = moe_colors$teal, alpha = 0.8) +
  geom_smooth(method = "lm", se = TRUE,
              color = moe_colors$coral, fill = moe_colors$coral, alpha = 0.15) +
  scale_x_continuous(labels = scales::label_dollar()) +
  labs(x = "Spending per student",
       y = "Average NAEP score",
       title = "Education Spending and Test Scores") +
  theme_moe()
`geom_smooth()` using formula = 'y ~ x'
Figure 26.1: Per-student spending vs. average NAEP score, one point per state. The relationship is positive but modest.

A positive but moderate slope. The shaded band is the 95% confidence interval for the regression line.

Correlation and R²

r  <- cor(education$spending_per_student, education$avg_score)
r2 <- r ^ 2
round(c(r = r, r_squared = r2), 3)
        r r_squared 
    0.433     0.188 

Chapter 10 cites a correlation around 0.43 and an R² around 0.19 — meaning roughly 19% of state-to-state variation in NAEP scores is explained by per-student spending alone. The other 81% comes from everything else (poverty, demographics, school quality, prior achievement, etc.).

Fit the regression

model <- lm(avg_score ~ spending_per_student, data = education)
summary(model)

Call:
lm(formula = avg_score ~ spending_per_student, data = education)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.6741  -4.2677   0.8565   5.1683  11.4769 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.717e+02  3.534e+00  76.872  < 2e-16 ***
spending_per_student 6.352e-04  1.906e-04   3.332  0.00166 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.89 on 48 degrees of freedom
Multiple R-squared:  0.1879,    Adjusted R-squared:  0.171 
F-statistic:  11.1 on 1 and 48 DF,  p-value: 0.001665

The output has the chapter’s headline numbers:

  • Slope (spending_per_student coefficient) — Chapter cites approximately 0.000635, meaning each additional dollar of per-student spending is associated with a 0.000635-point increase in average NAEP score
  • Standard error of that slope — chapter cites approximately 0.000191
  • t-statistic for the slope — chapter cites approximately 3.33
  • p-value — chapter cites approximately 0.0017
  • Intercept — chapter cites approximately 271.67

If your R output produces those numbers (or numbers very close), the regression has reproduced the chapter exactly. Tiny rounding differences are normal.

95% confidence interval for the slope

confint(model, level = 0.95)
                            2.5 %       97.5 %
(Intercept)          2.645644e+02 2.787758e+02
spending_per_student 2.519231e-04 1.018448e-03

The chapter reports a CI of roughly (0.000252, 0.001019) for the slope. The interval excludes zero (consistent with the small p-value), and its width tells you how precisely the slope is estimated.

Predict at a hypothetical spending level

The chapter discusses prediction at $14,000 per student.

new_data <- data.frame(spending_per_student = 14000)

predict(model, newdata = new_data, interval = "confidence", level = 0.95)
       fit      lwr      upr
1 280.5627 278.1167 283.0087

The chapter cites a predicted score of approximately 280.56 at $14,000 spending. The wider interval = "prediction" would give a band for an individual state’s expected score at that spending level (always wider than the confidence band for the mean).

Residual diagnostics

A regression’s coefficient estimates can be dead right while its inference (standard errors, p-values, CIs) is misleading if the residual assumptions are violated. Run the standard diagnostic plots:

par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
Figure 26.2: Standard four-panel residual diagnostic. Look for random scatter in the residuals-vs-fitted plot, a roughly straight QQ line, no trumpeting in the scale-location plot, and no points far from zero in the leverage plot.

What you want to see:

  1. Residuals vs Fitted — random scatter around zero. A curve suggests nonlinearity; a fan shape suggests heteroscedasticity.
  2. Normal QQ — points along the line. Strong departures suggest the residuals are not normal.
  3. Scale-Location — flat trend line. A rising trend suggests heteroscedasticity.
  4. Residuals vs Leverage — no points outside the dashed Cook’s-distance lines. Points outside have undue influence on the fit.

Try it yourself

  1. Try a different predictor. Fit lm(avg_score ~ pct_poverty, data = education). Compare the slope, R², and p-value to the spending model. Which predictor explains more variation in NAEP scores?
  2. Both predictors at once. Multiple regression preview: fit lm(avg_score ~ spending_per_student + pct_poverty, data = education). What happens to the spending coefficient once poverty is in the model? (This is the lurking-variable problem Chapter 11 dives into.)