This walkthrough simulates the four sampling methods discussed in the chapter — simple random, stratified, cluster, and systematic — using a 200-student population where we already know the truth. Because we know the population mean, we can see which sampling methods recover it and which ones drift.
Setup
source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
set.seed(42)
Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).
The set.seed(42) call makes the random samples reproducible — change the seed and the samples shift, but the overall pattern across methods stays similar.
Load the population
population <-read_csv("../datasets/sampling-demo.csv")glimpse(population)
200 students, with major, GPA, class year, commute distance, housing, and a part-time-work flag. The chapter treats this as a “small college” where we hypothetically know everything about every student. In real research we would not have this — the entire point of sampling is that we cannot afford the full census.
What the truth looks like
population |>summarize(n =n(),mean_gpa =round(mean(gpa), 3),sd_gpa =round(sd(gpa), 3),.groups ="drop" )
# A tibble: 1 × 3
n mean_gpa sd_gpa
<int> <dbl> <dbl>
1 200 3.14 0.468
population |>count(major) |>mutate(pct =round(100* n /sum(n), 1))
# A tibble: 5 × 3
major n pct
<chr> <int> <dbl>
1 Business 38 19
2 Engineering 53 26.5
3 Liberal Arts 37 18.5
4 Nursing 27 13.5
5 Sciences 45 22.5
The population mean GPA and the major breakdown are the targets. Each sampling method below tries to recover them.
# A tibble: 1 × 3
method n mean_gpa
<chr> <int> <dbl>
1 Cluster 83 3.14
The sample mean depends entirely on which majors got picked. If the two sampled majors happen to have unusual GPAs, the cluster estimate is far from the population truth — exactly the risk the chapter warns about.
Method 4: Systematic sample
Pick a random starting point, take every k-th student.
k <-5start <-sample(1:k, 1)sys <- population |>slice(seq(start, n(), by = k))sys |>summarize(method ="Systematic",n =n(),mean_gpa =round(mean(gpa), 3) )
# A tibble: 1 × 3
method n mean_gpa
<chr> <int> <dbl>
1 Systematic 40 3.12
200 students with k = 5 yields 40 sampled. Works fine here because the data are not ordered by anything correlated with GPA. Watch out for hidden ordering — for example, if the file were sorted by GPA, systematic sampling would still hit a near-random spread, but if it were sorted by major and major correlates with GPA, systematic sampling could over-represent some majors.
Comparison table
bind_rows( population |>summarize(method ="Population", n =n(),mean_gpa =round(mean(gpa), 3)), srs |>summarize(method ="Simple Random", n =n(),mean_gpa =round(mean(gpa), 3)), strat |>summarize(method ="Stratified", n =n(),mean_gpa =round(mean(gpa), 3)), cluster |>summarize(method ="Cluster", n =n(),mean_gpa =round(mean(gpa), 3)), sys |>summarize(method ="Systematic", n =n(),mean_gpa =round(mean(gpa), 3)))
# A tibble: 5 × 3
method n mean_gpa
<chr> <int> <dbl>
1 Population 200 3.14
2 Simple Random 40 3.23
3 Stratified 40 3.08
4 Cluster 83 3.14
5 Systematic 40 3.12
The point is not which method got closest in this single run — that is partly luck of the draw. The point is that some methods (SRS, stratified) are unbiased estimators of the population mean, while cluster sampling can swing wildly depending on which clusters get picked. Run this script with five different seeds and watch how cluster’s estimate jumps around.
Sample sizes and stability
A short loop showing how sample size affects the variability of SRS estimates:
# A tibble: 4 × 3
size sd_of_means range_95
<dbl> <dbl> <chr>
1 10 0.138 2.88 to 3.4
2 25 0.086 2.99 to 3.31
3 50 0.06 3.02 to 3.25
4 100 0.034 3.07 to 3.2
Larger samples produce tighter sampling distributions. The chapter calls this out qualitatively; the simulation makes it visible. Doubling the sample size shrinks the SD of the sample-mean distribution by a factor of √2 ≈ 1.41 — exactly the textbook result.
Try it yourself
Re-run with different seeds. Change set.seed(42) to several other values. Which sampling method’s mean GPA estimate moves around the most across seeds? Why does that match the chapter’s claim about cluster sampling’s higher variability?
Stratify by year instead of major. Replace group_by(major) with group_by(year) in the stratified sample. Does the resulting mean GPA estimate get closer to the population mean than simple random sampling? Stratifying helps most when the variable you stratify on correlates with the outcome — does year correlate with GPA in this dataset?