Common Formulas Reference

Definitions, intuitions, and code for statistical quantities

TipHow to Use This Reference

Each formula includes:

  • Definition — the mathematical formula
  • Intuition — what it means conceptually
  • Codenumpy and scipy to compute it

See also: Notation Reference | Glossary

import numpy as np
from scipy import stats as st

The Big Picture: Data = Model + Error

Every statistical analysis follows the same fundamental structure:

NoteThe Fundamental Equation

\[\text{Data} = \text{Model} + \text{Error}\]

Or with symbols: \[y_i = \hat{y}_i + e_i\]

Intuition: We decompose each observation into:

  1. Model (\(\hat{y}\)): Our best prediction based on some systematic pattern
  2. Error (\(e\)): Everything we can’t explain — noise, unmeasured factors, randomness

All the formulas in this guide are variations on this theme:

  • Aggregation: The simplest model predicts the mean for everyone
  • Relationships: We improve predictions by using other variables
  • Model fit: How well does the model explain the data?
  • Inference: How confident are we in our conclusions?
TipTwo Perspectives, One Framework

Whether you’re focused on explanation (understanding why) or prediction (forecasting what), the math is the same. The difference is in what you emphasize: coefficients (\(\hat{\beta}\)) for explanation, predictions (\(\hat{y}\)) for forecasting.

Aggregation & Summarization

These formulas reduce a collection of values to a single summary. The mean is the foundation — it’s the simplest possible “model.”

Mean

NoteDefinition

\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\]

Intuition: The mean is the best 1-parameter model for your data — if you had to guess a single value for everyone, the mean minimizes your total squared error.

np.mean(x)

Median

NoteDefinition

\[\tilde{x} = \begin{cases} x_{(\frac{n+1}{2})} & \text{if } n \text{ is odd} \\[6pt] \frac{x_{(\frac{n}{2})} + x_{(\frac{n}{2}+1)}}{2} & \text{if } n \text{ is even} \end{cases}\]

where \(x_{(k)}\) denotes the \(k\)th value after sorting.

Intuition: The median is the middle value — it splits the data so 50% are above and 50% below. Unlike the mean, the median minimizes absolute error rather than squared error, making it robust to outliers.

np.median(x)

Mode

NoteDefinition

\[\text{mode} = \arg\max_{x_i} P(X = x_i)\]

Intuition: The most frequent value — the one that occurs with highest probability. Most useful for categorical or discrete data. A distribution can have multiple modes (bimodal, multimodal).

st.mode(x)

Variance

NoteDefinition

\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\]

Intuition: Variance is the average squared prediction error when your model is just the mean. It measures how spread out the data are around the center.

ImportantSame Formula, Different Lens

Variance can be written as: \[s^2 = \frac{SSE}{n - 1} = \frac{\sum(x_i - \bar{x})^2}{n - 1}\]

This is Mean Squared Error (MSE) where the “model” is \(\bar{x}\). The connection between variance and model error runs throughout statistics.

# ddof=1 gives sample variance (divides by n-1)
np.var(x, ddof=1)
NoteWhy n-1? Parameters, not ‘degrees of freedom’

We divide by \(n-1\) rather than \(n\) because we estimated one parameter (the mean) from the data before calculating variance.

Think of it this way: to compute variance, we first need \(\bar{x}\). That’s one calculation that “uses up” information. We have \(n\) data points but only \(n-1\) independent pieces of information left for estimating spread.

Throughout this guide, when you see \(n - k\) in a denominator, \(k\) is the number of parameters estimated to make the calculation possible.

Standard Deviation

NoteDefinition

\[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\]

Intuition: The square root of variance — puts spread back in the original units of the data.

np.std(x, ddof=1)

Standard Error of the Mean

NoteDefinition

\[SE_{\bar{x}} = \frac{s}{\sqrt{n}}\]

Intuition: How much would the sample mean vary if we repeated the study? SE quantifies uncertainty in the mean, not spread in individual observations.

  • Larger \(n\) → smaller SE (more data = more precise estimate)
  • SE decreases with \(\sqrt{n}\), so 4× the sample size halves the SE
# numpy
np.std(x, ddof=1) / np.sqrt(len(x))

# scipy
st.sem(x)
ImportantSD vs. SE: Different Questions
Statistic Question it answers
SD (\(s\)) How spread out are individual observations?
SE (\(SE_{\bar{x}}\)) How precise is our estimate of the mean?

Z-Score (Standardization)

NoteDefinition

\[z_i = \frac{x_i - \bar{x}}{s}\]

Intuition: How many standard deviations is \(x_i\) from the mean?

  • \(z = 0\): at the mean
  • \(z = 1\): one SD above the mean
  • \(z = -2\): two SDs below the mean

Z-scores put variables on a common scale, enabling comparison across different measurements.

# numpy
(x - np.mean(x)) / np.std(x, ddof=1)

# scipy
st.zscore(x, ddof=1)

Relationships & Similarity

These formulas describe how two variables relate. The core operation is the dot product — measuring how two vectors “point in the same direction.”

Dot Product

NoteDefinition

\[\mathbf{x}^\top\mathbf{y} = \sum_{i=1}^n x_i y_i\]

Intuition: Multiply corresponding elements and sum. If \(x\) and \(y\) tend to be large together (or small together), the dot product is large. This is the foundation for covariance, correlation, and regression.

np.dot(x, y)

# Equivalent
np.sum(x * y)

Covariance

NoteDefinition

\[\text{Cov}(x, y) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\]

Intuition: The mean-centered dot product — do \(x\) and \(y\) move together around their respective means?

  • Positive: when \(x\) is above its mean, \(y\) tends to be too
  • Negative: when \(x\) is above its mean, \(y\) tends to be below
  • Zero: no linear relationship

The magnitude depends on the scales of \(x\) and \(y\), making covariance hard to interpret directly.

# numpy (returns 2x2 matrix; off-diagonal is covariance)
np.cov(x, y)[0, 1]

Pearson Correlation

NoteDefinition

\[r = \frac{\text{Cov}(x, y)}{s_x \cdot s_y}\]

Intuition: Covariance scaled to [-1, 1]. Measures strength and direction of linear relationship.

  • \(r = 1\): perfect positive relationship
  • \(r = -1\): perfect negative relationship
  • \(r = 0\): no linear relationship
ImportantSame Formula, Different Lens

Correlation is the average dot product of z-scores: \[r = \frac{1}{n-1} \sum_{i=1}^n z_{x_i} \cdot z_{y_i}\]

When both variables are standardized, their dot product (similarity) is their correlation.

# numpy
np.corrcoef(x, y)[0, 1]

# scipy (includes p-value)
st.pearsonr(x, y)

The Linear Model

Now we move from describing single variables and relationships to building predictive models. The General Linear Model (GLM) is the workhorse of statistics.

The Model Equation

NoteDefinition

Single observation: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i\]

Matrix form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

Intuition: Each observation equals a linear combination of predictors plus random error. The coefficients (\(\beta\)) tell us how much each predictor contributes.

# Design matrix: first column is 1s for intercept
X = np.column_stack([np.ones(n), x1, x2])

Predicted Values

NoteDefinition

\[\hat{y}_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\]

Matrix form: \[\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\]

Intuition: The model’s best guess for each observation — the “Model” part of Data = Model + Error.

y_hat = X @ beta_hat

Residuals

NoteDefinition

\[e_i = y_i - \hat{y}_i\]

Intuition: The “Error” part of Data = Model + Error. What’s left after the model makes its prediction.

  • Positive residual: model under-predicted
  • Negative residual: model over-predicted
  • Patterns in residuals suggest the model is missing something
residuals = y - y_hat

OLS Estimator

NoteDefinition

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

Intuition: OLS (Ordinary Least Squares) finds coefficients that minimize total squared error: \[\min_{\boldsymbol{\beta}} \sum_i (y_i - \hat{y}_i)^2\]

Reading the formula:

  • \(\mathbf{X}^\top\mathbf{y}\): how each predictor relates to the outcome
  • \((\mathbf{X}^\top\mathbf{X})^{-1}\): adjusts for relationships among predictors
# Full formula
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y

# More stable
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]

Residual Variance (Model Error)

NoteDefinition

\[\hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y}_i)^2}{n - p - 1} = \frac{SSE}{n - p - 1}\]

Intuition: The average squared prediction error of the model. This is the same structure as variance, but now:

  • The “model” is the regression line (not just the mean)
  • We’ve estimated \(p + 1\) parameters (intercept + \(p\) slopes)
  • So we divide by \(n - (p + 1) = n - p - 1\)
ImportantSame Formula, Different Lens
Statistic Model Parameters Formula
Variance Mean only 1 \(SSE / (n - 1)\)
Residual variance Regression \(p + 1\) \(SSE / (n - p - 1)\)

Both are “average squared error” — they differ only in what model generated the predictions.

n = len(y)
p = X.shape[1] - 1  # predictors (not counting intercept)
SSE = np.sum(residuals**2)
sigma2_hat = SSE / (n - p - 1)

Variance of Coefficients

NoteDefinition

\[\text{Var}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

Intuition: How uncertain are we about each coefficient?

  • Larger \(\hat{\sigma}^2\) (noisier data) → more uncertainty
  • Diagonal of \((\mathbf{X}^\top\mathbf{X})^{-1}\) reflects predictor spread and collinearity
var_beta = sigma2_hat * np.linalg.inv(X.T @ X)

Standard Error of Coefficients

NoteDefinition

\[SE(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)}\]

Intuition: Uncertainty in the same units as the coefficient. Used for t-tests and confidence intervals.

se_beta = np.sqrt(np.diag(var_beta))

Model Fit

How well does the model explain the data?

PRE: Proportional Reduction in Error

NoteDefinition

\[PRE = \frac{SSE_C - SSE_A}{SSE_C}\]

where \(C\) = compact (simpler) model, \(A\) = augmented (complex) model

Intuition: What proportion of the simpler model’s error is eliminated by the complex model?

  • \(PRE = 0\): complex model is no better
  • \(PRE = 1\): complex model explains everything the simple model missed
  • \(PRE = 0.10\): complex model reduces error by 10%
PRE = (SSE_compact - SSE_augmented) / SSE_compact

R-squared

NoteDefinition

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = \frac{SS_{tot} - SS_{res}}{SS_{tot}}\]

Intuition: What proportion of variance does the model explain?

  • \(R^2 = 0\): model explains nothing
  • \(R^2 = 1\): model explains everything
  • \(R^2 = 0.3\): model explains 30% of variance
ImportantSame Formula, Different Lens

\(R^2\) is PRE when:

  • Compact model = “predict the mean for everyone”
  • Augmented model = “predict using regression”

\[R^2 = \frac{SSE(\text{mean}) - SSE(\text{regression})}{SSE(\text{mean})} = PRE\]

WarningCaution

\(R^2\) always increases when you add predictors, even useless ones. Use adjusted \(R^2\), AIC, or BIC for model selection.

SS_tot = np.sum((y - np.mean(y))**2)
SS_res = np.sum((y - y_hat)**2)
R2 = 1 - SS_res / SS_tot

Log-Likelihood

NoteDefinition

\[\log L = \sum_{i=1}^n \log p(y_i \mid \hat{\theta})\]

Intuition: How probable is the observed data under the fitted model? Higher (less negative) is better.

  • Foundation for AIC, BIC, and likelihood ratio tests
  • Used for comparing models fit with Maximum Likelihood
# For a linear model with normal errors
log_lik = -n/2 * np.log(2 * np.pi * sigma2_hat) - SSE / (2 * sigma2_hat)

AIC (Akaike Information Criterion)

NoteDefinition

\[AIC = -2 \log L + 2k\]

where \(k\) = number of parameters

Intuition: Balances fit against complexity. The \(2k\) term penalizes adding parameters.

  • Lower AIC = better model
  • Difference of 2+ is meaningful; <2 is negligible
k = p + 2  # slopes + intercept + sigma^2
AIC = -2 * log_lik + 2 * k

BIC (Bayesian Information Criterion)

NoteDefinition

\[BIC = -2 \log L + k \log(n)\]

Intuition: Like AIC but with a stronger penalty for complexity (especially with large \(n\)).

  • Lower BIC = better model
  • Tends to select simpler models than AIC
BIC = -2 * log_lik + k * np.log(n)
TipAIC vs BIC
Criterion Penalty Tends to select
AIC \(2k\) Better predictions (may overfit)
BIC \(k \log(n)\) Simpler models (more conservative)

Neither is “correct” — they answer slightly different questions about model quality.

Statistical Inference

How confident are we in our conclusions? Statistical inference provides tools for testing hypotheses about parameters and comparing models.

ImportantTwo Sides of the Same Coin
Test What it tests When to use
t-statistic Is a single parameter different from 0? Testing individual coefficients
F-statistic Do multiple parameters together improve the model? Comparing nested models

Key insight: When testing a single parameter, \(F = t^2\). The F-test generalizes the t-test to multiple parameters simultaneously.

t-Statistic

NoteDefinition

\[t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\]

Intuition: Signal-to-noise ratio for a single coefficient. How many standard errors is the estimate from zero?

  • \(|t| > 2\) roughly corresponds to \(p < 0.05\)
  • Larger \(|t|\) = stronger evidence the coefficient isn’t zero
t_stat = beta_hat / se_beta
p_values = 2 * (1 - st.t.cdf(np.abs(t_stat), n - p - 1))

F-Statistic

NoteDefinition

\[F = \frac{(SSE_C - SSE_A) / (p_A - p_C)}{SSE_A / (n - p_A)}\]

Or equivalently: \[F = \frac{PRE / (p_A - p_C)}{(1 - PRE) / (n - p_A)}\]

Intuition: Is the error reduction “worth it”? The F-statistic is a ratio:

  • Numerator: Error reduction per added parameter
  • Denominator: Remaining error per observation (adjusted for parameters)

If the added parameters help, \(F >> 1\).

# Comparing compact vs augmented model
df1 = p_A - p_C  # parameters added
df2 = n - p_A    # residual df for augmented model
F_stat = (PRE / df1) / ((1 - PRE) / df2)

# p-value
p_value = 1 - st.f.cdf(F_stat, df1, df2)

p-value

NoteDefinition

\[p = P(\text{test statistic at least as extreme} \mid H_0 \text{ true})\]

Intuition: If there were truly no effect, how surprising would our result be?

  • Small \(p\) (e.g., < 0.05): result is unlikely under the null → reject \(H_0\)
  • Large \(p\): result is compatible with the null → fail to reject \(H_0\)
WarningWhat p-values are NOT
  • NOT the probability the null is true
  • NOT the probability of replication
  • NOT a measure of effect size

Confidence Interval

NoteDefinition

\[\hat{\beta}_j \pm t_{\alpha/2} \cdot SE(\hat{\beta}_j)\]

Intuition: A range of plausible values for the true parameter. If we repeated the study many times, 95% of intervals would contain the true \(\beta\).

  • If the CI excludes 0, the coefficient is “significant” at that level
  • Wider CI = more uncertainty
t_crit = st.t.ppf(0.975, df=n - p - 1)
ci_lower = beta_hat - t_crit * se_beta
ci_upper = beta_hat + t_crit * se_beta

Summary

Category What it does Key formulas
Aggregation Summarize a variable Mean, Median, Mode, Variance, SD, SE
Relationships Measure association Covariance, Correlation
Linear Model Predict outcomes OLS, Residuals, \(\hat{\sigma}^2\)
Model Fit Evaluate model quality PRE, \(R^2\), Log-lik, AIC, BIC
Inference Test hypotheses t-stat, F-stat, p-value, CI

The workflow:

  1. Specify → What’s your model? (Design matrix \(\mathbf{X}\), outcome \(\mathbf{y}\))
  2. Estimate → Fit parameters (\(\hat{\boldsymbol{\beta}}\))
  3. Evaluate → How good is the fit? (\(R^2\), AIC, BIC)
  4. Compare → Is complexity worth it? (PRE, F-test)
  5. Infer → What can we conclude? (t-tests, CIs, p-values)