R Companion — Chapter 6

R Companion — Chapter 6: The Normal Distribution and CLT

This walkthrough covers R’s tools for the normal distribution and uses the births14 data to anchor it: real birth-weight data, with a familiar bell shape, a slight left skew from preterm births, and a chance to see the Central Limit Theorem in action on real measurements.

Setup

source("_common.R")
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tibble' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'dplyr' was built under R version 4.5.2
Warning: package 'stringr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

R’s four-function family for any distribution

For the normal distribution:

# d: density at a point
dnorm(0)            # height of standard normal at z = 0
[1] 0.3989423
# p: cumulative probability up to a point
pnorm(1.96)         # P(Z < 1.96), the famous ~97.5%
[1] 0.9750021
# q: quantile (the inverse of p)
qnorm(0.975)        # the z-score that puts 97.5% below it
[1] 1.959964
# r: random draws
set.seed(42)
rnorm(5, mean = 0, sd = 1)
[1]  1.3709584 -0.5646982  0.3631284  0.6328626  0.4042683

The same letters apply to every distribution: dt/pt/qt/rt for t, dbinom/pbinom/etc for binomial, and so on.

Plot the standard normal curve

ggplot(data.frame(x = c(-4, 4)), aes(x)) +
  stat_function(fun = dnorm, color = moe_colors$navy, linewidth = 1) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed",
             color = moe_colors$coral, alpha = 0.6) +
  geom_vline(xintercept = c(-2, 2), linetype = "dashed",
             color = moe_colors$amber, alpha = 0.6) +
  labs(
    x = "z",
    y = "Density",
    title = "Standard Normal Distribution",
    subtitle = "Coral: ±1 SD (~68% of area). Amber: ±2 SD (~95% of area)."
  ) +
  theme_moe()
Figure 22.1: The standard normal distribution N(0, 1). About 68% of the area lies within ±1 SD; about 95% within ±2 SD.
pnorm(1) - pnorm(-1)    # ~0.683
[1] 0.6826895
pnorm(2) - pnorm(-2)    # ~0.954
[1] 0.9544997
pnorm(3) - pnorm(-3)    # ~0.997
[1] 0.9973002

The 68-95-99.7 rule the chapter cites comes from these three calls.

Birth weights: a real-data example

births <- read_csv("../datasets/birth-weights.csv")
glimpse(births)
Rows: 944
Columns: 6
$ baby_id           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ birth_weight_g    <dbl> 3157, 4019, 3406, 3062, 3035, 2781, 3057, 4055, 4137…
$ gestational_weeks <dbl> 37, 41, 37, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39, …
$ mother_age        <dbl> 34, 31, 36, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31, …
$ prenatal_visits   <dbl> 14, 12, 10, 12, 14, 10, 13, 15, 11, 14, 16, 20, 15, …
$ smoking_status    <chr> "No", "No", "No", "No", "No", "No", "No", "No", "No"…

944 babies. Birth weight in grams, gestational age, mother’s age, prenatal visits, smoking status. The full sample includes preterm births, which is why the distribution will have a slight left skew (preterm babies pull the lower tail).

Distribution of birth weights

births |>
  summarize(
    n             = n(),
    mean_weight_g = mean(birth_weight_g),
    sd_weight_g   = sd(birth_weight_g),
    median        = median(birth_weight_g)
  )
# A tibble: 1 × 4
      n mean_weight_g sd_weight_g median
  <int>         <dbl>       <dbl>  <dbl>
1   944         3266.        593.   3316
ggplot(births, aes(x = birth_weight_g)) +
  geom_histogram(binwidth = 200, fill = moe_colors$teal, color = "white") +
  labs(x = "Birth weight (g)", y = "Number of babies",
       title = "Distribution of Birth Weights") +
  theme_moe()
Figure 22.2: Birth weight distribution. Mostly bell-shaped, with a slight left skew from preterm babies.

QQ plot — visual check for normality

ggplot(births, aes(sample = birth_weight_g)) +
  stat_qq(color = moe_colors$navy, alpha = 0.5) +
  stat_qq_line(color = moe_colors$coral, linewidth = 1) +
  labs(x = "Theoretical quantiles (normal)",
       y = "Sample quantiles (birth weight, g)",
       title = "QQ Plot: Birth Weight") +
  theme_moe()
Figure 22.3: QQ plot of birth weights. Points hug the line in the middle and bend off at the lower end — the signature of left-skew.

The deviation at the lower tail is the preterm signature. Among full-term babies only (gestational_weeks >= 37), the distribution is much more symmetric:

births |>
  filter(gestational_weeks >= 37) |>
  ggplot(aes(sample = birth_weight_g)) +
  stat_qq(color = moe_colors$navy, alpha = 0.5) +
  stat_qq_line(color = moe_colors$coral, linewidth = 1) +
  labs(x = "Theoretical quantiles", y = "Sample quantiles",
       title = "QQ Plot: Full-term Babies Only") +
  theme_moe()
Figure 22.4: QQ plot restricted to full-term babies. The lower-tail deviation mostly disappears.

Shapiro-Wilk normality test

A formal test of normality. With a sample of 944 it is overpowered — small departures from normality become “significant” — but it is useful as a check on the QQ plot’s eye test.

shapiro.test(sample(births$birth_weight_g, 500))

    Shapiro-Wilk normality test

data:  sample(births$birth_weight_g, 500)
W = 0.96, p-value = 2.074e-10

(Shapiro-Wilk is limited to 5000 observations; we sample 500 for stability.)

Z-scores for individual babies

mean_w <- mean(births$birth_weight_g)
sd_w   <- sd(births$birth_weight_g)

# A 2,500g baby — common low-birth-weight cutoff
z_low <- (2500 - mean_w) / sd_w
z_low
[1] -1.291622
# What proportion of babies are below 2,500g?
mean(births$birth_weight_g < 2500)
[1] 0.08474576
# What does the normal approximation predict?
pnorm(2500, mean = mean_w, sd = sd_w)
[1] 0.09824408

The empirical proportion (real data) and the normal-approximation proportion (pnorm()) are close but not identical. The gap is the preterm-driven left skew the QQ plot showed.

The Central Limit Theorem in action

Birth weights are roughly normal already. To make the CLT more striking, simulate from an exponential population — heavily right-skewed — and watch sample means become bell-shaped.

set.seed(42)

sample_means <- replicate(1000, mean(rexp(30, rate = 0.1)))

ggplot(tibble(mean_x = sample_means), aes(x = mean_x)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins = 30, fill = moe_colors$teal, color = "white") +
  stat_function(fun = dnorm,
                args = list(mean = mean(sample_means),
                            sd   = sd(sample_means)),
                color = moe_colors$coral, linewidth = 1) +
  labs(x = "Sample mean", y = "Density",
       title = "CLT: Sample Means from an Exponential Population (n = 30)",
       subtitle = "The source distribution is heavily right-skewed; the sample-mean distribution is nearly normal") +
  theme_moe()
Figure 22.5: CLT in action: 1,000 sample means (n = 30) drawn from an exponential population. The sample-mean distribution is approximately normal even though the source is not.

That bell shape is what makes inference possible — the CLT is what lets us use t-tests, confidence intervals, and z-based methods even when the underlying data is not normal.

Try it yourself

  1. CLT with smaller n. Re-run the simulation with rexp(5, ...) instead of rexp(30, ...). Is the sample-mean distribution still approximately normal? At what sample size does the bell shape become visible?
  2. Smoking and birth weight. Compare birth_weight_g between smoking_status groups using group_by() and summarize(). Then plot the two distributions overlaid. Is the gap big enough to see in a histogram, or do you need a more formal test (Chapter 8 will cover that)?