R Companion — Chapter 4

R Companion — Chapter 4: Summarizing Data with Pictures

This walkthrough builds the visualizations the chapter discusses on the HOLC redlining dataset, plus reproduces Anscombe’s quartet to drive home the chapter’s central message: identical summary statistics can hide very different distributions.

Setup

source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load the redlining data

housing <- read_csv("../datasets/housing-redlining.csv")
glimpse(housing)
Rows: 551
Columns: 9
$ neighborhood_id  <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ metro_area       <chr> "Akron, OH", "Akron, OH", "Akron, OH", "Akron, OH", "…
$ holc_grade       <chr> "A", "B", "C", "D", "A", "B", "C", "D", "A", "B", "C"…
$ total_population <dbl> 36963, 67816, 112694, 15144, 23303, 45230, 101538, 49…
$ pct_white        <dbl> 66.8, 61.2, 64.9, 40.8, 72.9, 58.9, 56.0, 33.9, 66.6,…
$ pct_black        <dbl> 23.3, 24.3, 20.3, 45.7, 7.8, 15.7, 16.5, 39.4, 4.4, 6…
$ pct_hispanic     <dbl> 2.6, 3.3, 2.8, 3.8, 5.6, 9.6, 10.2, 13.5, 22.7, 27.6,…
$ pct_asian        <dbl> 1.9, 5.0, 5.6, 3.0, 8.6, 5.6, 6.3, 4.4, 1.3, 2.5, 2.1…
$ pct_minority     <dbl> 33.2, 38.8, 35.1, 59.2, 27.1, 41.1, 44.0, 66.1, 33.4,…

551 neighborhoods across 138 metro areas. Each row has a HOLC grade (A “Best” through D “Hazardous”), total population, and percentages by race/ethnicity. Set the HOLC grade as an ordered factor so the visualization order is logical, not alphabetical.

housing <- housing |>
  mutate(holc_grade = factor(holc_grade, levels = c("A", "B", "C", "D"), ordered = TRUE))

Histogram: minority percentage across all neighborhoods

ggplot(housing, aes(x = pct_minority)) +
  geom_histogram(binwidth = 5, fill = moe_colors$teal, color = "white") +
  labs(
    x = "Minority population (%)",
    y = "Number of neighborhoods",
    title = "Minority Population Share Across Neighborhoods"
  ) +
  theme_moe()
Figure 20.1: Distribution of minority percentage across HOLC-graded neighborhoods.

The bimodal shape is itself informative — many neighborhoods cluster at low minority percentages, many at high, with relatively few in the middle. That bimodality is exactly the residential segregation pattern HOLC grades helped enforce.

Density plot by HOLC grade

ggplot(housing, aes(x = pct_minority, fill = holc_grade)) +
  geom_density(alpha = 0.5) +
  scale_fill_moe() +
  labs(
    x = "Minority population (%)",
    y = "Density",
    fill = "HOLC grade",
    title = "Minority Population by HOLC Grade"
  ) +
  theme_moe()
Figure 20.2: Density of minority percentage by HOLC grade. Grade D neighborhoods skew much higher.

Grade A neighborhoods cluster at low minority percentages; Grade D at high. The overlap between B and C is substantial, but A and D are nearly disjoint — a 90-year-old set of grades is still strongly visible in the demographic data of today.

Boxplot by HOLC grade

ggplot(housing, aes(x = holc_grade, y = pct_minority, fill = holc_grade)) +
  geom_boxplot(show.legend = FALSE,
               outlier.color = moe_colors$coral, outlier.alpha = 0.6) +
  scale_fill_moe() +
  labs(
    x = "HOLC grade",
    y = "Minority population (%)",
    title = "Boxplot: Minority Population by HOLC Grade"
  ) +
  theme_moe()
Figure 20.3: Boxplot of minority percentage by HOLC grade. The shift is clean and monotonic.

The medians (the line in the middle of each box) climb steadily from A to D. The IQRs (the boxes) are wide enough that there is real overlap between adjacent grades — but the shift is unmistakable.

Bar chart: mean minority percentage by grade

housing |>
  group_by(holc_grade) |>
  summarize(mean_minority = mean(pct_minority), .groups = "drop") |>
  ggplot(aes(x = holc_grade, y = mean_minority, fill = holc_grade)) +
  geom_col(show.legend = FALSE) +
  scale_fill_moe() +
  labs(
    x = "HOLC grade",
    y = "Mean minority population (%)",
    title = "Average Minority Population by HOLC Grade"
  ) +
  theme_moe()
Figure 20.4: Mean minority percentage by HOLC grade. Bars hide spread but make the gradient obvious.

Bar charts are great for monotonic comparisons across categories. The cost is that they hide the spread you saw in the boxplot — two A-graded neighborhoods can have very different minority percentages, but the bar collapses that information into a single height.

Scatter plot: white vs Black population share

ggplot(housing, aes(x = pct_white, y = pct_black, color = holc_grade)) +
  geom_point(alpha = 0.6, size = 2) +
  scale_color_moe() +
  labs(
    x = "White population (%)",
    y = "Black population (%)",
    color = "HOLC grade",
    title = "White vs Black Population Share by HOLC Grade"
  ) +
  theme_moe()
Figure 20.5: White vs Black population share by HOLC grade. The negative relationship is mechanical, but the HOLC grade clustering reveals which neighborhoods are at which extreme.

The negative slope is partly mechanical (the percentages compete for 100% of the population), but the coloring is what tells the story. Grade A clusters in the lower-right (high white, low Black). Grade D clusters in the upper-left.

Correlation

cor(housing$pct_white, housing$pct_black, use = "complete.obs") |> round(3)
[1] -0.658
cor(housing$pct_white, housing$pct_minority, use = "complete.obs") |> round(3)
[1] -1

The strong negative correlations are partly mechanical, partly real. The interesting question is not whether they are correlated but how the strength varies across grades — which we could investigate by computing the correlation within each HOLC grade separately.

Anscombe’s quartet — why you always plot first

The chapter’s most important lesson is that summary statistics can be identical across very different datasets. R has Anscombe’s classic 1973 demonstration built in:

data(anscombe)

anscombe_long <- anscombe |>
  pivot_longer(everything(),
               names_to = c(".value", "set"),
               names_pattern = "(.)(.)") |>
  mutate(set = paste("Set", set))

anscombe_long |>
  group_by(set) |>
  summarize(
    mean_x = round(mean(x), 2),
    mean_y = round(mean(y), 2),
    sd_x   = round(sd(x), 2),
    sd_y   = round(sd(y), 2),
    cor_xy = round(cor(x, y), 3),
    .groups = "drop"
  )
# A tibble: 4 × 6
  set   mean_x mean_y  sd_x  sd_y cor_xy
  <chr>  <dbl>  <dbl> <dbl> <dbl>  <dbl>
1 Set 1      9    7.5  3.32  2.03  0.816
2 Set 2      9    7.5  3.32  2.03  0.816
3 Set 3      9    7.5  3.32  2.03  0.816
4 Set 4      9    7.5  3.32  2.03  0.817

All four sets have nearly identical means, standard deviations, and correlations. Now plot them:

ggplot(anscombe_long, aes(x = x, y = y)) +
  geom_point(color = moe_colors$navy, size = 2) +
  geom_smooth(method = "lm", se = FALSE, color = moe_colors$coral) +
  facet_wrap(~ set) +
  labs(
    title = "Anscombe's Quartet",
    subtitle = "Identical means, SDs, and correlations — wildly different distributions"
  ) +
  theme_moe()
`geom_smooth()` using formula = 'y ~ x'
Figure 20.6: Anscombe’s quartet: four datasets with identical summary statistics, four very different shapes.

This is the chapter’s argument made undeniable. The same numerical summaries describe a clean linear relationship, a clean curve, a single outlier, and a vertical line with one extreme point — and you cannot tell them apart from the numbers alone.

Try it yourself

  1. Within-grade correlation. Use housing %>% group_by(holc_grade) %>% summarize(cor_white_black = cor(pct_white, pct_black)) to compute the correlation between white and Black population share within each HOLC grade. Where is the correlation strongest? Why might it be weaker in grade C?
  2. Pick a metro area. Filter to a single metro_area with several neighborhoods and replot the scatter colored by grade. The within-metro story is often even sharper than the national one.