R Companion — Chapter 3

R Companion — Chapter 3: Summarizing Data with Numbers

This walkthrough computes every summary the chapter discusses on the U.S. household income data, with the mean-versus-median gap as the running theme.

Setup

source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load and explore the ACS household income data

income <- read_csv("../datasets/acs-household-income.csv")
glimpse(income)
Rows: 1,200
Columns: 6
$ household_id     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ state            <chr> "Louisiana", "Oregon", "North Carolina", "New Mexico"…
$ region           <chr> "Southeast", "West", "Southeast", "Southwest", "North…
$ household_income <dbl> 44800, 93800, 109100, 63000, 29300, 43500, 214700, 99…
$ education_level  <chr> "High School or Less", "Some College or Associate", "…
$ household_size   <dbl> 1, 1, 2, 4, 1, 2, 4, 5, 2, 1, 4, 3, 1, 1, 1, 6, 2, 3,…

The dataset is 1,200 households drawn from the U.S. Census Bureau’s 2022 American Community Survey 1-Year PUMS. Each row is one household with state, region, annual income, education level (of head of household), and household size.

The headline: mean versus median

This is the chapter’s central argument: when distributions are skewed, the mean and the median answer different questions and the “typical” household depends on which one you choose.

income |>
  summarize(
    n            = n(),
    mean_income  = mean(household_income),
    median_income = median(household_income),
    gap          = mean(household_income) - median(household_income)
  )
# A tibble: 1 × 4
      n mean_income median_income    gap
  <int>       <dbl>         <dbl>  <dbl>
1  1200     100183.         73200 26983.

The mean exceeds the median by a substantial margin. That is the signature of right skew — a long tail of high-income households pulls the mean up while the median stays anchored at the middle of the data.

The chapter reports U.S. Census Bureau figures of approximately $74,580 (median) and $105,000 (mean) for 2022. Sample-based estimates from this 1,200-household subset will be in the same ballpark; the exact numbers depend on the random draw underlying the dataset, but the gap between mean and median should be similar.

Spread

income |>
  summarize(
    sd_income    = sd(household_income),
    iqr_income   = IQR(household_income),
    min_income   = min(household_income),
    max_income   = max(household_income),
    range_income = max(household_income) - min(household_income)
  )
# A tibble: 1 × 5
  sd_income iqr_income min_income max_income range_income
      <dbl>      <dbl>      <dbl>      <dbl>        <dbl>
1   100084.      92325          0    1007900      1007900

The standard deviation is heavily influenced by the high-income tail; the IQR is not. When data is skewed, the IQR is generally the more honest measure of “typical spread.”

The five-number summary

quantile(income$household_income)
     0%     25%     50%     75%    100% 
      0   38675   73200  131000 1007900 

The chapter cites approximate values like Q1 ≈ $45,000, median ≈ $74,580, Q3 ≈ $133,000 from the full ACS. Sample estimates here will be in the same neighborhood. Whatever your numbers, what matters is the gap between Q3 and the median (often larger than the gap between the median and Q1, again signaling right skew).

Reproduce Figure 3.1: the income distribution with mean and median

mean_inc   <- mean(income$household_income)
median_inc <- median(income$household_income)

ggplot(income, aes(x = household_income)) +
  geom_histogram(binwidth = 10000, fill = moe_colors$teal, color = "white") +
  geom_vline(xintercept = median_inc, linewidth = 1,
             color = moe_colors$navy) +
  geom_vline(xintercept = mean_inc, linetype = "dashed", linewidth = 1,
             color = moe_colors$coral) +
  annotate("text", x = median_inc - 5000, y = Inf, label = "Median",
           hjust = 1, vjust = 2, fontface = "bold",
           color = moe_colors$navy, size = 3.5) +
  annotate("text", x = mean_inc + 5000, y = Inf, label = "Mean",
           hjust = 0, vjust = 2, fontface = "bold",
           color = moe_colors$coral, size = 3.5) +
  scale_x_continuous(labels = scales::label_dollar(scale = 1e-3, suffix = "K")) +
  labs(
    x = "Annual household income",
    y = "Number of households",
    title = "Distribution of U.S. Household Income (2022 ACS sample)",
    subtitle = "The mean (dashed) is pulled rightward by the high-income tail"
  ) +
  theme_moe()
Figure 19.1: U.S. household income distribution with the mean and median marked. The mean is pulled to the right by the long tail of high incomes.

Outlier detection by the IQR rule

The chapter introduces the standard 1.5 × IQR rule for flagging outliers.

q   <- quantile(income$household_income)
Q1  <- q[2]
Q3  <- q[4]
iqr <- Q3 - Q1

lower_fence <- Q1 - 1.5 * iqr
upper_fence <- Q3 + 1.5 * iqr

income |>
  summarize(
    Q1            = round(Q1, 0),
    Q3            = round(Q3, 0),
    upper_fence   = round(upper_fence, 0),
    n_above_upper = sum(household_income > upper_fence),
    pct_above     = round(100 * mean(household_income > upper_fence), 1)
  )
# A tibble: 1 × 5
     Q1     Q3 upper_fence n_above_upper pct_above
  <dbl>  <dbl>       <dbl>         <int>     <dbl>
1 38675 131000      269488            54       4.5

Several percent of households are above the upper fence. They are not “errors” — they are real high-income households that the IQR rule flags because they sit far from the middle 50%. In income data, the upper-fence “outliers” are exactly the long tail that pulls the mean above the median.

Summarize by group: income across education levels

income |>
  group_by(education_level) |>
  summarize(
    n             = n(),
    median_income = median(household_income),
    mean_income   = mean(household_income),
    sd_income     = sd(household_income),
    .groups = "drop"
  ) |>
  arrange(desc(median_income))
# A tibble: 5 × 5
  education_level               n median_income mean_income sd_income
  <chr>                     <int>         <dbl>       <dbl>     <dbl>
1 Master or Professional       50        148500     201100    161767.
2 Doctorate                    50        131050     171922    145751.
3 Bachelor                     87        128800     159198.   141988.
4 Some College or Associate   649         78700     101258.    89996.
5 High School or Less         364         49400      60444.    54031.

The mean-median gap reappears within each education tier — the high-tail pattern is everywhere, not just in the overall distribution. This is the kind of finding that should make you skeptical of any analysis that reports “average income by education level” without also reporting the median.

Try it yourself

  1. Region comparison. Group by region and compute median income for each. Which region has the highest median? Does the ranking match what you would expect from coastal-vs-inland income patterns?
  2. Custom percentiles. Use quantile(income$household_income, probs = c(0.05, 0.95)) to see the 5th and 95th percentiles. The ratio of these two numbers is sometimes used as a quick measure of inequality. What does it tell you about this sample?