Appendix C: Glossary of Key Terms

Adjusted R-squared. A modified version of R-squared that adjusts for the number of predictors in a regression model, penalizing unnecessary complexity. Unlike R-squared, adjusted R-squared can decrease when a new predictor does not improve the model enough to justify the added complexity.

Alternative hypothesis (\(H_a\) or \(H_1\)). The claim that there is an effect, a difference, or a relationship. The hypothesis the researcher typically hopes to find evidence for.

ANOVA (Analysis of Variance). A statistical method for comparing means across three or more groups using the F-statistic. Tests whether at least one group mean differs from the others.

Bar chart. A visualization for categorical data where the height (or length) of each bar represents the count or proportion in each category.

Bayes’ theorem. A formula for updating the probability of an event based on new evidence. Relates conditional probabilities: \(P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}\).

Bias. Systematic error in sampling or measurement that causes results to deviate from the truth in a consistent direction.

Binomial distribution. The probability distribution for the number of successes in a fixed number of independent trials, each with the same probability of success.

Bonferroni correction. A method for adjusting significance levels when performing multiple hypothesis tests simultaneously. The corrected significance level is alpha divided by the number of tests, reducing the risk of Type I errors from multiple comparisons.

Box plot. A visualization showing the five-number summary (minimum, Q1, median, Q3, maximum) of a distribution, with outliers plotted individually.

Categorical variable. A variable whose values represent group membership rather than measured quantities. Includes nominal and ordinal types.

Central Limit Theorem (CLT). The result that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the shape of the population distribution.

Chi-square test. A hypothesis test for categorical data. The goodness-of-fit test compares observed frequencies to expected frequencies. The test of independence tests whether two categorical variables are associated.

Coefficient of determination (\(R^2\)). The proportion of variability in the response variable that is explained by the regression model. Ranges from 0 to 1.

Cohen’s d. An effect size measure for the difference between two group means, expressed in standard deviation units. Small: 0.2, Medium: 0.5, Large: 0.8.

Conditional probability. The probability of an event occurring given that another event has already occurred. Written \(P(A|B)\).

Confidence interval. A range of plausible values for a population parameter, constructed from a sample. A 95% confidence interval means that 95% of intervals constructed this way would contain the true parameter.

Confidence level. The long-run proportion of confidence intervals that capture the true population parameter when the sampling and construction process is repeated. Common levels are 90%, 95%, and 99%.

Confounding variable. A variable that influences both the explanatory variable and the response variable, creating a misleading association between them.

Continuous variable. A numerical variable that can take any value within a range, including decimals.

Correlation coefficient (Pearson’s r). A measure of the strength and direction of the linear relationship between two numerical variables. Ranges from -1 to +1.

Critical value. The threshold value of a test statistic that determines the boundary of the rejection region.

Cramér’s V. A measure of the strength of association between two categorical variables, based on the chi-square statistic. It ranges from 0 (no association) to 1 (perfect association) and is appropriate for tables larger than 2x2.

Degrees of freedom (df). A parameter that depends on the sample size and determines the shape of certain distributions (t, chi-square, F). Generally related to the number of independent pieces of information in the data.

Density plot. A smoothed version of a histogram that estimates the probability density function of a continuous variable.

Dependent variable. See response variable.

Discrete variable. A numerical variable that takes on countable values, usually integers.

Dummy variable. A binary (0/1) variable used to represent a category in regression. A categorical variable with \(k\) categories requires \(k-1\) dummy variables.

Effect size. A quantitative measure of the magnitude of a phenomenon. Unlike p-values, effect sizes are not influenced by sample size.

Empirical Rule (68-95-99.7 rule). For a normal distribution, approximately 68% of observations fall within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3.

Eta-squared (\(\eta^2\)). An effect size measure for ANOVA, representing the proportion of total variance explained by the grouping variable.

Expected value. The long-run average of a random variable. For a discrete random variable, \(E(X) = \sum x_i \cdot P(x_i)\).

Experiment. A study in which the researcher deliberately assigns treatments to subjects and compares outcomes.

Explanatory variable. A variable used to explain or predict changes in the response variable. Also called independent variable or predictor.

Extrapolation. Using a regression model to predict outcomes for values of the explanatory variable outside the range of the data used to build the model. Generally unreliable.

F-statistic. The test statistic for ANOVA, computed as the ratio of between-group variance to within-group variance.

F-test. A statistical test that compares the variability between group means to the variability within groups. Used in ANOVA to determine whether at least one group mean differs significantly from the others.

Five-number summary. The minimum, first quartile (Q1), median, third quartile (Q3), and maximum of a dataset.

Frequency. The count of observations falling into a particular category or interval.

Histogram. A visualization for the distribution of a numerical variable, where bars represent the frequency of observations in each interval (bin).

Hypothesis test. A formal procedure for using sample data to decide between two competing claims (hypotheses) about a population parameter.

Independence. Two events are independent if the occurrence of one does not affect the probability of the other. \(P(A \cap B) = P(A) \cdot P(B)\).

Independent variable. See explanatory variable.

Interaction effect. A condition in regression where the effect of one predictor on the response depends on the value of another predictor. Modeled by including the product of the two predictors as an additional term.

Intercept (\(b_0\)). In regression, the predicted value of the response variable when all explanatory variables equal zero.

Interquartile range (IQR). The difference between the third quartile and the first quartile: \(IQR = Q3 - Q1\). A resistant measure of spread.

Kruskal-Wallis test. A nonparametric alternative to one-way ANOVA that compares three or more groups using ranks rather than raw values. Does not assume normality.

Least squares. The method for fitting a regression line by minimizing the sum of the squared residuals. Also called ordinary least squares (OLS) when the assumptions of standard linear regression are met.

Lurking variable. See confounding variable.

Margin of error. The maximum expected difference between a sample statistic and the population parameter, given a specified confidence level. Equal to the critical value multiplied by the standard error.

Mean. The arithmetic average of a set of values. The sum divided by the count.

Median. The middle value when observations are arranged in order. Resistant to outliers.

Mediator variable. A variable that lies on the causal pathway between an independent variable and a dependent variable, helping to explain how or why the effect occurs. Contrasted with a confounding variable, which creates a spurious association.

Missing data. Values absent from a dataset. The pattern and mechanism of missingness affect how (and whether) the data can be analyzed validly.

Mode. The most frequently occurring value in a dataset.

Multicollinearity. A condition in multiple regression where two or more predictor variables are highly correlated, making it difficult to isolate the individual effect of each predictor.

Multiple regression. A regression model with two or more explanatory variables.

Nominal variable. A categorical variable with no natural ordering among categories.

Nonparametric test. A hypothesis test that does not assume the data come from a specific distribution. Rank-based tests (Wilcoxon signed-rank, Wilcoxon rank-sum / Mann-Whitney U, Kruskal-Wallis) are common nonparametric alternatives to t-tests and ANOVA.

Normal distribution. A symmetric, bell-shaped distribution completely described by its mean (\(\mu\)) and standard deviation (\(\sigma\)).

Null hypothesis (\(H_0\)). The claim of no effect, no difference, or no relationship. The default assumption that the hypothesis test seeks to find evidence against.

Numerical variable. A variable whose values represent measured quantities where arithmetic is meaningful.

Observational study. A study in which researchers observe subjects without intervening or assigning treatments.

Omitted variable bias. Bias in regression coefficients that occurs when an important confounding variable is not included in the model.

Ordinal variable. A categorical variable with a natural ordering, but where the distances between categories are not necessarily equal.

Ordinary Least Squares (OLS). The standard method for fitting a linear regression by minimizing the sum of squared residuals. Identical in spirit to “least squares” as introduced in Chapter 10 and extended to multiple predictors in Chapter 11.

Outlier. An observation that falls far from the rest of the data. The IQR method identifies observations below \(Q1 - 1.5 \cdot IQR\) or above \(Q3 + 1.5 \cdot IQR\) as outliers.

P-hacking. The practice of manipulating data analysis (choosing variables, subsets, or tests) until a statistically significant result is obtained, without accounting for the multiple tests performed.

P-value. The probability of obtaining a test statistic as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. It is NOT the probability that the null hypothesis is true.

Paired data. Data where each observation in one group is naturally linked to a specific observation in another group (e.g., before-and-after measurements on the same subjects).

Percentile. The value below which a given percentage of observations fall. The 75th percentile means 75% of values are at or below that point.

Point estimate. A single value calculated from a sample and used to estimate a population parameter (e.g., the sample mean \(\bar{x}\) estimates the population mean \(\mu\)).

Population. The entire group of interest about which conclusions are to be drawn.

Post-hoc test. A follow-up analysis conducted after a significant ANOVA result to determine which specific group means differ from each other. Common methods include Tukey’s HSD and Bonferroni-corrected pairwise comparisons.

Power. The probability of correctly rejecting the null hypothesis when it is false. Equal to \(1 - \beta\), where \(\beta\) is the probability of a Type II error.

Prediction interval. An interval estimate for an individual future observation, wider than a confidence interval for the mean because it accounts for individual variability.

Proportion. The fraction of observations falling into a particular category. Ranges from 0 to 1.

QQ plot (quantile-quantile plot). A graphical tool for assessing whether data follow a particular distribution (often normal). Points falling along the reference line suggest the assumed distribution is appropriate.

Quartiles. Values that divide the data into four equal parts. Q1 (25th percentile), Q2 (median, 50th percentile), Q3 (75th percentile).

Random variable. A variable whose value is determined by the outcome of a random process.

Randomized controlled experiment. An experiment where subjects are randomly assigned to treatment and control groups.

Range. The difference between the maximum and minimum values in a dataset.

Reference category. In a regression model with categorical predictors, the category coded as zero on all dummy variables. Coefficients on the dummy variables represent comparisons against this baseline group.

Regression line. The line that best fits the relationship between an explanatory and response variable, determined by least squares.

Replication crisis. The finding that many published scientific results fail to reproduce when the original studies are repeated.

Residual. The difference between an observed value and the value predicted by a model. \(e_i = y_i - \hat{y}_i\).

Response variable. The variable being predicted or explained in a study. Also called dependent variable or outcome variable.

Sample. A subset of a population from which data is actually collected.

Sampling distribution. The probability distribution of a sample statistic (such as the sample mean) computed from repeated random samples of the same size from the same population.

Scatter plot. A visualization for the relationship between two numerical variables, where each point represents one observation.

Selection bias. Systematic error arising from how subjects are selected for inclusion in a study or sample.

Shapiro-Wilk test. A formal hypothesis test for normality. A small p-value suggests the data deviate from a normal distribution.

Simple linear regression. A regression model with exactly one explanatory variable.

Simpson’s Paradox. A phenomenon where a trend that appears in aggregated data reverses or disappears when the data is separated into subgroups.

Slope (\(b_1\)). In simple regression, the predicted change in the response variable for a one-unit increase in the explanatory variable.

Standard deviation. A measure of the spread of data around the mean. The square root of the variance. For a sample: \(s = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n-1}}\).

Standard error. The standard deviation of a sampling distribution. Measures the variability of a sample statistic from sample to sample. For the sample mean: \(SE = \frac{s}{\sqrt{n}}\).

Statistic. A numerical summary computed from a sample (e.g., sample mean, sample proportion).

Statistical significance. A result is statistically significant at level \(\alpha\) if the p-value is less than \(\alpha\). Does not necessarily imply practical importance.

Stratified sampling. A sampling method where the population is divided into strata (subgroups) and random samples are drawn from each stratum.

Survivorship bias. A form of selection bias that occurs when analysis is limited to subjects that survived a selection process, ignoring those that did not.

T-distribution. A symmetric, bell-shaped distribution similar to the normal but with heavier tails. Used for inference about means when the population standard deviation is unknown.

Test statistic. A numerical value calculated from sample data that is used to decide whether to reject the null hypothesis.

Tukey HSD (Honestly Significant Difference). A post-hoc test used after ANOVA to identify which specific group pairs differ, while controlling the family-wise error rate.

Type I error. Rejecting the null hypothesis when it is actually true (a false positive). The probability is denoted \(\alpha\).

Type II error. Failing to reject the null hypothesis when it is actually false (a false negative). The probability is denoted \(\beta\).

Variance. A measure of spread computed as the average of the squared deviations from the mean. For a sample: \(s^2 = \frac{\sum (x_i - \bar{x})^2}{n-1}\).

Variance Inflation Factor (VIF). A diagnostic for multicollinearity in regression. VIF values above 5 or 10 suggest problematic collinearity.

Welch’s t-test. A modification of the independent-samples t-test that does not assume equal variances in the two groups. The recommended default for two-sample comparisons.

Wilcoxon rank-sum test (Mann-Whitney U test). A rank-based alternative to the independent two-sample t-test, evaluating whether two groups tend to produce systematically larger or smaller values.

Wilcoxon signed-rank test. A rank-based alternative to the one-sample or paired t-test, evaluating whether the median (or median within-pair difference) differs from a hypothesized value.

Z-score. A standardized value indicating how many standard deviations an observation is above or below the mean: \(z = \frac{x - \mu}{\sigma}\).