R Companion — Chapter 2

R Companion — Chapter 2: Asking Good Questions

This walkthrough simulates the four sampling methods discussed in the chapter — simple random, stratified, cluster, and systematic — using a 200-student population where we already know the truth. Because we know the population mean, we can see which sampling methods recover it and which ones drift.

Setup

source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
set.seed(42)

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

The set.seed(42) call makes the random samples reproducible — change the seed and the samples shift, but the overall pattern across methods stays similar.

Load the population

population <- read_csv("../datasets/sampling-demo.csv")
glimpse(population)
Rows: 200
Columns: 7
$ student_id             <chr> "S001", "S002", "S003", "S004", "S005", "S006",…
$ major                  <chr> "Engineering", "Engineering", "Sciences", "Scie…
$ gpa                    <dbl> 3.57, 2.13, 3.05, 3.37, 2.79, 3.23, 2.79, 2.86,…
$ year                   <chr> "Senior", "Sophomore", "Sophomore", "Freshman",…
$ commute_distance_miles <dbl> 0.5, 2.7, 0.5, 0.8, 6.3, 2.5, 0.7, 7.1, 0.8, 5.…
$ housing                <chr> "On-campus", "On-campus", "On-campus", "Off-cam…
$ works_part_time        <chr> "No", "Yes", "Yes", "No", "Yes", "No", "Yes", "…

200 students, with major, GPA, class year, commute distance, housing, and a part-time-work flag. The chapter treats this as a “small college” where we hypothetically know everything about every student. In real research we would not have this — the entire point of sampling is that we cannot afford the full census.

What the truth looks like

population |>
  summarize(
    n         = n(),
    mean_gpa  = round(mean(gpa), 3),
    sd_gpa    = round(sd(gpa), 3),
    .groups   = "drop"
  )
# A tibble: 1 × 3
      n mean_gpa sd_gpa
  <int>    <dbl>  <dbl>
1   200     3.14  0.468
population |>
  count(major) |>
  mutate(pct = round(100 * n / sum(n), 1))
# A tibble: 5 × 3
  major            n   pct
  <chr>        <int> <dbl>
1 Business        38  19  
2 Engineering     53  26.5
3 Liberal Arts    37  18.5
4 Nursing         27  13.5
5 Sciences        45  22.5

The population mean GPA and the major breakdown are the targets. Each sampling method below tries to recover them.

Method 1: Simple random sample (SRS)

Every student has an equal chance of selection.

srs <- population |> slice_sample(n = 40)

srs |>
  summarize(
    method   = "Simple Random",
    n        = n(),
    mean_gpa = round(mean(gpa), 3)
  )
# A tibble: 1 × 3
  method            n mean_gpa
  <chr>         <int>    <dbl>
1 Simple Random    40     3.23

Method 2: Stratified sample

Sample equal numbers from each major. Guarantees every major is represented.

strat <- population |>
  group_by(major) |>
  slice_sample(n = 8) |>
  ungroup()

strat |>
  summarize(
    method   = "Stratified",
    n        = n(),
    mean_gpa = round(mean(gpa), 3)
  )
# A tibble: 1 × 3
  method         n mean_gpa
  <chr>      <int>    <dbl>
1 Stratified    40     3.08
strat |> count(major)
# A tibble: 5 × 2
  major            n
  <chr>        <int>
1 Business         8
2 Engineering      8
3 Liberal Arts     8
4 Nursing          8
5 Sciences         8

Each major contributes 8 students. With 5 majors, that is 40 students total — same sample size as SRS, but with guaranteed coverage.

Method 3: Cluster sample

Sample whole groups (here: whole majors), include everyone in those groups.

selected_majors <- sample(unique(population$major), 2)
selected_majors
[1] "Business" "Sciences"
cluster <- population |> filter(major %in% selected_majors)

cluster |>
  summarize(
    method   = "Cluster",
    n        = n(),
    mean_gpa = round(mean(gpa), 3)
  )
# A tibble: 1 × 3
  method      n mean_gpa
  <chr>   <int>    <dbl>
1 Cluster    83     3.14

The sample mean depends entirely on which majors got picked. If the two sampled majors happen to have unusual GPAs, the cluster estimate is far from the population truth — exactly the risk the chapter warns about.

Method 4: Systematic sample

Pick a random starting point, take every k-th student.

k     <- 5
start <- sample(1:k, 1)
sys   <- population |> slice(seq(start, n(), by = k))

sys |>
  summarize(
    method   = "Systematic",
    n        = n(),
    mean_gpa = round(mean(gpa), 3)
  )
# A tibble: 1 × 3
  method         n mean_gpa
  <chr>      <int>    <dbl>
1 Systematic    40     3.12

200 students with k = 5 yields 40 sampled. Works fine here because the data are not ordered by anything correlated with GPA. Watch out for hidden ordering — for example, if the file were sorted by GPA, systematic sampling would still hit a near-random spread, but if it were sorted by major and major correlates with GPA, systematic sampling could over-represent some majors.

Comparison table

bind_rows(
  population |> summarize(method = "Population", n = n(),
                          mean_gpa = round(mean(gpa), 3)),
  srs        |> summarize(method = "Simple Random", n = n(),
                          mean_gpa = round(mean(gpa), 3)),
  strat      |> summarize(method = "Stratified", n = n(),
                          mean_gpa = round(mean(gpa), 3)),
  cluster    |> summarize(method = "Cluster", n = n(),
                          mean_gpa = round(mean(gpa), 3)),
  sys        |> summarize(method = "Systematic", n = n(),
                          mean_gpa = round(mean(gpa), 3))
)
# A tibble: 5 × 3
  method            n mean_gpa
  <chr>         <int>    <dbl>
1 Population      200     3.14
2 Simple Random    40     3.23
3 Stratified       40     3.08
4 Cluster          83     3.14
5 Systematic       40     3.12

The point is not which method got closest in this single run — that is partly luck of the draw. The point is that some methods (SRS, stratified) are unbiased estimators of the population mean, while cluster sampling can swing wildly depending on which clusters get picked. Run this script with five different seeds and watch how cluster’s estimate jumps around.

Sample sizes and stability

A short loop showing how sample size affects the variability of SRS estimates:

sample_sizes <- c(10, 25, 50, 100)

stability <- map_dfr(sample_sizes, function(size) {
  reps <- replicate(500, mean(slice_sample(population, n = size)$gpa))
  tibble(
    size = size,
    sd_of_means = round(sd(reps), 3),
    range_95    = paste0(round(quantile(reps, 0.025), 2),
                         " to ",
                         round(quantile(reps, 0.975), 2))
  )
})
stability
# A tibble: 4 × 3
   size sd_of_means range_95    
  <dbl>       <dbl> <chr>       
1    10       0.138 2.88 to 3.4 
2    25       0.086 2.99 to 3.31
3    50       0.06  3.02 to 3.25
4   100       0.034 3.07 to 3.2 

Larger samples produce tighter sampling distributions. The chapter calls this out qualitatively; the simulation makes it visible. Doubling the sample size shrinks the SD of the sample-mean distribution by a factor of √2 ≈ 1.41 — exactly the textbook result.

Try it yourself

  1. Re-run with different seeds. Change set.seed(42) to several other values. Which sampling method’s mean GPA estimate moves around the most across seeds? Why does that match the chapter’s claim about cluster sampling’s higher variability?
  2. Stratify by year instead of major. Replace group_by(major) with group_by(year) in the stratified sample. Does the resulting mean GPA estimate get closer to the population mean than simple random sampling? Stratifying helps most when the variable you stratify on correlates with the outcome — does year correlate with GPA in this dataset?