R Companion — Chapter 8

R Companion — Chapter 8: Hypothesis Testing

This walkthrough runs the chapter’s hypothesis tests on the energy-reports dataset — a randomized experiment where treatment households received a home energy report comparing them to neighbors. The chapter discusses the Opower-style intervention; the data is calibrated to reproduce that pattern.

Setup

source("_common.R")

Warning: package 'tidyverse' was built under R version 4.5.2

Warning: package 'ggplot2' was built under R version 4.5.2

Warning: package 'tibble' was built under R version 4.5.2

Warning: package 'tidyr' was built under R version 4.5.2

Warning: package 'readr' was built under R version 4.5.2

Warning: package 'purrr' was built under R version 4.5.2

Warning: package 'dplyr' was built under R version 4.5.2

Warning: package 'stringr' was built under R version 4.5.2

Warning: package 'forcats' was built under R version 4.5.2

Warning: package 'lubridate' was built under R version 4.5.2

Download _common.R to run this locally — it loads tidyverse and the book’s plot theme. Or replace the source() line with library(tidyverse).

Load the energy-reports data

energy <- read_csv("../datasets/energy-reports.csv")
glimpse(energy)

Rows: 200
Columns: 7
$ household_id      <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ group             <chr> "Treatment", "Treatment", "Treatment", "Treatment", …
$ energy_kwh_before <dbl> 1370.3, 1403.3, 1529.0, 1240.0, 1483.5, 1312.2, 1621…
$ energy_kwh_after  <dbl> 1487.8, 1452.2, 1503.2, 1228.9, 1437.8, 1297.6, 1523…
$ home_size_sqft    <dbl> 1810, 2260, 2520, 1690, 1430, 1950, 1910, 1190, 1730…
$ num_occupants     <dbl> 4, 2, 2, 2, 3, 3, 2, 3, 3, 3, 2, 4, 3, 3, 3, 2, 4, 3…
$ region            <chr> "Northeast", "Southeast", "Midwest", "Southwest", "W…

200 households, half assigned to treatment (received the energy report), half to control. Pre-intervention and post-intervention monthly kWh usage for each. Random assignment was performed at the household level.

Verify the randomization worked

Before testing the treatment effect, check that the groups are balanced on observables. Random assignment makes them balanced in expectation, but small-sample randomness can still produce imbalance.

energy |>
  group_by(group) |>
  summarize(
    n            = n(),
    mean_before  = mean(energy_kwh_before),
    mean_size    = mean(home_size_sqft),
    mean_occ     = mean(num_occupants),
    .groups      = "drop"
  )

# A tibble: 2 × 5
  group         n mean_before mean_size mean_occ
  <chr>     <int>       <dbl>     <dbl>    <dbl>
1 Control     100       1442.     1746.     2.89
2 Treatment   100       1415.     1726.     2.72

t.test(energy_kwh_before ~ group, data = energy)$p.value

[1] 0.4345041

Pre-intervention means should be similar between treatment and control (large p-value). If they were not, randomization would have failed and any post-intervention difference might just reflect the imbalance rather than the treatment.

The change in usage (post − pre)

energy <- energy |>
  mutate(change = energy_kwh_after - energy_kwh_before)

energy |>
  group_by(group) |>
  summarize(
    mean_change = round(mean(change), 1),
    sd_change   = round(sd(change), 1),
    .groups     = "drop"
  )

# A tibble: 2 × 3
  group     mean_change sd_change
  <chr>           <dbl>     <dbl>
1 Control          12.4        51
2 Treatment       -23.2        50

Treatment households should show a small negative change (reduced usage); control households show no systematic change. The chapter cites a roughly 2% reduction as the Opower-style headline result.

Two-sample t-test on the change

ttest_change <- t.test(change ~ group, data = energy)
ttest_change


    Welch Two Sample t-test

data:  change by group
t = 4.9976, df = 197.92, p-value = 1.274e-06
alternative hypothesis: true difference in means between group Control and group Treatment is not equal to 0
95 percent confidence interval:
 21.59835 49.75365
sample estimates:
  mean in group Control mean in group Treatment 
                 12.445                 -23.231

The output tells you:

t statistic — how many standard errors apart the two group means are
df — degrees of freedom (Welch’s, allowing unequal variances)
p-value — under the null of zero treatment effect, the probability of seeing a difference at least this large
95% CI — the plausible range for the true effect

If the p-value is below your threshold (typically 0.05) and the CI excludes zero, the treatment had a statistically detectable effect.

Paired t-test (within-household change)

When each unit has a before and after measurement, a paired test is more powerful than the two-sample test on changes (which is mathematically equivalent here, but the paired framing is closer to how the experiment was conducted).

# Treatment-group only, paired before/after
treatment <- energy |> filter(group == "Treatment")

t.test(treatment$energy_kwh_after, treatment$energy_kwh_before,
       paired = TRUE)


    Paired t-test

data:  treatment$energy_kwh_after and treatment$energy_kwh_before
t = -4.6486, df = 99, p-value = 1.033e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
 -33.14703 -13.31497
sample estimates:
mean difference 
        -23.231

Effect size: Cohen’s d

The chapter argues that p-values measure detectability, not importance. Cohen’s d translates a difference in means into “how many standard deviations apart” the groups are.

mean_t <- mean(energy$change[energy$group == "Treatment"])
mean_c <- mean(energy$change[energy$group == "Control"])
sd_t   <- sd(energy$change[energy$group == "Treatment"])
sd_c   <- sd(energy$change[energy$group == "Control"])

# Pooled SD
n_t <- sum(energy$group == "Treatment")
n_c <- sum(energy$group == "Control")
pooled_sd <- sqrt(((n_t - 1) * sd_t^2 + (n_c - 1) * sd_c^2) / (n_t + n_c - 2))

cohens_d <- (mean_t - mean_c) / pooled_sd
round(cohens_d, 3)

[1] -0.707

Conventional benchmarks: ~0.2 = small, ~0.5 = medium, ~0.8 = large. Energy-report effects are typically small in these terms — but small effects across millions of households add up.

P-hacking simulation

If you run 20 unrelated tests at α = 0.05, you expect about one to come up “significant” by chance alone.

set.seed(42)
n_obs   <- 200
n_tests <- 20

outcome   <- sample(c("A", "B"), n_obs, replace = TRUE)
random_xs <- map(1:n_tests, ~ rnorm(n_obs))

p_values <- map_dbl(random_xs, ~ t.test(.x ~ outcome)$p.value)

tibble(
  test_num    = 1:n_tests,
  p_value     = p_values,
  significant = p_value < 0.05
) |>
  ggplot(aes(x = factor(test_num), y = p_value, fill = significant)) +
  geom_col(width = 0.7) +
  geom_hline(yintercept = 0.05, linetype = "dashed",
             color = moe_colors$coral) +
  scale_fill_manual(values = c("TRUE"  = moe_colors$coral,
                               "FALSE" = moe_colors$teal),
                    labels = c("TRUE" = "p < 0.05 (false positive)",
                               "FALSE" = "Not significant")) +
  labs(x = "Test number", y = "p-value", fill = NULL,
       title = "20 tests on random unrelated data") +
  theme_moe()

mean(p_values < 0.05)

[1] 0.1

Figure 24.1: Twenty t-tests on random unrelated data. About one in twenty crosses the α = 0.05 threshold by chance — exactly what the multiple-testing problem warns about.

Across many runs of this simulation, the false-positive rate hovers around 5% — exactly the α you set. The danger is when researchers run 20 tests and report only the “significant” ones, claiming a real finding from what is likely just one of these false positives.

Try it yourself

Test by region. Split the energy data by region and run the t-test on change within each region. Does the treatment effect look uniform across regions, or stronger in some?
Bonferroni correction. Re-examine the 20-test simulation: if you applied a Bonferroni correction (compare each p-value to 0.05 / 20 = 0.0025 instead of 0.05), how many of the “significant” results survive?