Common Formulas Reference

Definitions, intuitions, and code for statistical quantities

How to Use This Reference

Each formula includes:

Definition — the mathematical formula
Intuition — what it means conceptually
Code — numpy and scipy to compute it

The Big Picture: Data = Model + Error

Every statistical analysis follows the same fundamental structure:

The Fundamental Equation

\[\text{Data} = \text{Model} + \text{Error}\]

Or with symbols: \[y_i = \hat{y}_i + e_i\]

Intuition: We decompose each observation into:

Model (\(\hat{y}\)): Our best prediction based on some systematic pattern
Error (\(e\)): Everything we can’t explain — noise, unmeasured factors, randomness

All the formulas in this guide are variations on this theme:

Aggregation: The simplest model predicts the mean for everyone
Relationships: We improve predictions by using other variables
Model fit: How well does the model explain the data?
Inference: How confident are we in our conclusions?

Two Perspectives, One Framework

Whether you’re focused on explanation (understanding why) or prediction (forecasting what), the math is the same. The difference is in what you emphasize: coefficients (\(\hat{\beta}\)) for explanation, predictions (\(\hat{y}\)) for forecasting.

Aggregation & Summarization

These formulas reduce a collection of values to a single summary. The mean is the foundation — it’s the simplest possible “model.”

Mean

Definition

\[\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i\]

Intuition: The mean is the best 1-parameter model for your data — if you had to guess a single value for everyone, the mean minimizes your total squared error.

np.mean(x)

Median

Definition

\[\tilde{x} = \begin{cases} x_{(\frac{n+1}{2})} & \text{if } n \text{ is odd} \\[6pt] \frac{x_{(\frac{n}{2})} + x_{(\frac{n}{2}+1)}}{2} & \text{if } n \text{ is even} \end{cases}\]

where \(x_{(k)}\) denotes the \(k\)th value after sorting.

Intuition: The median is the middle value — it splits the data so 50% are above and 50% below. Unlike the mean, the median minimizes absolute error rather than squared error, making it robust to outliers.

np.median(x)

Mode

Definition

\[\text{mode} = \arg\max_{x_i} P(X = x_i)\]

Intuition: The most frequent value — the one that occurs with highest probability. Most useful for categorical or discrete data. A distribution can have multiple modes (bimodal, multimodal).

st.mode(x)

Variance

Definition

\[s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2\]

Intuition: Variance is the average squared prediction error when your model is just the mean. It measures how spread out the data are around the center.

Same Formula, Different Lens

Variance can be written as: \[s^2 = \frac{SSE}{n - 1} = \frac{\sum(x_i - \bar{x})^2}{n - 1}\]

This is Mean Squared Error (MSE) where the “model” is \(\bar{x}\). The connection between variance and model error runs throughout statistics.

# ddof=1 gives sample variance (divides by n-1)
np.var(x, ddof=1)

Why n-1? Parameters, not ‘degrees of freedom’

We divide by \(n-1\) rather than \(n\) because we estimated one parameter (the mean) from the data before calculating variance.

Think of it this way: to compute variance, we first need \(\bar{x}\). That’s one calculation that “uses up” information. We have \(n\) data points but only \(n-1\) independent pieces of information left for estimating spread.

Throughout this guide, when you see \(n - k\) in a denominator, \(k\) is the number of parameters estimated to make the calculation possible.

Standard Deviation

Definition

\[s = \sqrt{s^2} = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2}\]

Intuition: The square root of variance — puts spread back in the original units of the data.

np.std(x, ddof=1)

Standard Error of the Mean

Definition

\[SE_{\bar{x}} = \frac{s}{\sqrt{n}}\]

Intuition: How much would the sample mean vary if we repeated the study? SE quantifies uncertainty in the mean, not spread in individual observations.

Larger \(n\) → smaller SE (more data = more precise estimate)
SE decreases with \(\sqrt{n}\), so 4× the sample size halves the SE

# numpy
np.std(x, ddof=1) / np.sqrt(len(x))

# scipy
st.sem(x)

SD vs. SE: Different Questions

Statistic	Question it answers
SD (\(s\))	How spread out are individual observations?
SE (\(SE_{\bar{x}}\))	How precise is our estimate of the mean?

Z-Score (Standardization)

Definition

\[z_i = \frac{x_i - \bar{x}}{s}\]

Intuition: How many standard deviations is \(x_i\) from the mean?

\(z = 0\): at the mean
\(z = 1\): one SD above the mean
\(z = -2\): two SDs below the mean

Z-scores put variables on a common scale, enabling comparison across different measurements.

# numpy
(x - np.mean(x)) / np.std(x, ddof=1)

# scipy
st.zscore(x, ddof=1)

Relationships & Similarity

These formulas describe how two variables relate. The core operation is the dot product — measuring how two vectors “point in the same direction.”

Dot Product

Definition

\[\mathbf{x}^\top\mathbf{y} = \sum_{i=1}^n x_i y_i\]

Intuition: Multiply corresponding elements and sum. If \(x\) and \(y\) tend to be large together (or small together), the dot product is large. This is the foundation for covariance, correlation, and regression.

np.dot(x, y)

# Equivalent
np.sum(x * y)

Covariance

Definition

\[\text{Cov}(x, y) = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})\]

Intuition: The mean-centered dot product — do \(x\) and \(y\) move together around their respective means?

Positive: when \(x\) is above its mean, \(y\) tends to be too
Negative: when \(x\) is above its mean, \(y\) tends to be below
Zero: no linear relationship

The magnitude depends on the scales of \(x\) and \(y\), making covariance hard to interpret directly.

# numpy (returns 2x2 matrix; off-diagonal is covariance)
np.cov(x, y)[0, 1]

Pearson Correlation

Definition

\[r = \frac{\text{Cov}(x, y)}{s_x \cdot s_y}\]

Intuition: Covariance scaled to [-1, 1]. Measures strength and direction of linear relationship.

\(r = 1\): perfect positive relationship
\(r = -1\): perfect negative relationship
\(r = 0\): no linear relationship

Same Formula, Different Lens

Correlation is the average dot product of z-scores: \[r = \frac{1}{n-1} \sum_{i=1}^n z_{x_i} \cdot z_{y_i}\]

When both variables are standardized, their dot product (similarity) is their correlation.

# numpy
np.corrcoef(x, y)[0, 1]

# scipy (includes p-value)
st.pearsonr(x, y)

The Linear Model

Now we move from describing single variables and relationships to building predictive models. The General Linear Model (GLM) is the workhorse of statistics.

The Model Equation

Definition

Single observation: \[y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i\]

Matrix form: \[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}\]

Intuition: Each observation equals a linear combination of predictors plus random error. The coefficients (\(\beta\)) tell us how much each predictor contributes.

# Design matrix: first column is 1s for intercept
X = np.column_stack([np.ones(n), x1, x2])

Predicted Values

Definition

\[\hat{y}_i = \beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip}\]

Matrix form: \[\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}}\]

Intuition: The model’s best guess for each observation — the “Model” part of Data = Model + Error.

y_hat = X @ beta_hat

Residuals

Definition

\[e_i = y_i - \hat{y}_i\]

Intuition: The “Error” part of Data = Model + Error. What’s left after the model makes its prediction.

Positive residual: model under-predicted
Negative residual: model over-predicted
Patterns in residuals suggest the model is missing something

residuals = y - y_hat

OLS Estimator

Definition

\[\hat{\boldsymbol{\beta}} = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}\]

Intuition: OLS (Ordinary Least Squares) finds coefficients that minimize total squared error: \[\min_{\boldsymbol{\beta}} \sum_i (y_i - \hat{y}_i)^2\]

Reading the formula:

\(\mathbf{X}^\top\mathbf{y}\): how each predictor relates to the outcome
\((\mathbf{X}^\top\mathbf{X})^{-1}\): adjusts for relationships among predictors

# Full formula
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y

# More stable
beta_hat = np.linalg.lstsq(X, y, rcond=None)[0]

Residual Variance (Model Error)

Definition

\[\hat{\sigma}^2 = \frac{\sum_i (y_i - \hat{y}_i)^2}{n - p - 1} = \frac{SSE}{n - p - 1}\]

Intuition: The average squared prediction error of the model. This is the same structure as variance, but now:

The “model” is the regression line (not just the mean)
We’ve estimated \(p + 1\) parameters (intercept + \(p\) slopes)
So we divide by \(n - (p + 1) = n - p - 1\)

Same Formula, Different Lens

Statistic	Model	Parameters	Formula
Variance	Mean only	1	\(SSE / (n - 1)\)
Residual variance	Regression	\(p + 1\)	\(SSE / (n - p - 1)\)

Both are “average squared error” — they differ only in what model generated the predictions.

n = len(y)
p = X.shape[1] - 1  # predictors (not counting intercept)
SSE = np.sum(residuals**2)
sigma2_hat = SSE / (n - p - 1)

Variance of Coefficients

Definition

\[\text{Var}(\hat{\boldsymbol{\beta}}) = \hat{\sigma}^2 (\mathbf{X}^\top\mathbf{X})^{-1}\]

Intuition: How uncertain are we about each coefficient?

Larger \(\hat{\sigma}^2\) (noisier data) → more uncertainty
Diagonal of \((\mathbf{X}^\top\mathbf{X})^{-1}\) reflects predictor spread and collinearity

var_beta = sigma2_hat * np.linalg.inv(X.T @ X)

Standard Error of Coefficients

Definition

\[SE(\hat{\beta}_j) = \sqrt{\text{Var}(\hat{\beta}_j)}\]

Intuition: Uncertainty in the same units as the coefficient. Used for t-tests and confidence intervals.

se_beta = np.sqrt(np.diag(var_beta))

Model Fit

How well does the model explain the data?

PRE: Proportional Reduction in Error

Definition

\[PRE = \frac{SSE_C - SSE_A}{SSE_C}\]

where \(C\) = compact (simpler) model, \(A\) = augmented (complex) model

Intuition: What proportion of the simpler model’s error is eliminated by the complex model?

\(PRE = 0\): complex model is no better
\(PRE = 1\): complex model explains everything the simple model missed
\(PRE = 0.10\): complex model reduces error by 10%

PRE = (SSE_compact - SSE_augmented) / SSE_compact

R-squared

Definition

\[R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = \frac{SS_{tot} - SS_{res}}{SS_{tot}}\]

Intuition: What proportion of variance does the model explain?

\(R^2 = 0\): model explains nothing
\(R^2 = 1\): model explains everything
\(R^2 = 0.3\): model explains 30% of variance

Same Formula, Different Lens

\(R^2\) is PRE when:

Compact model = “predict the mean for everyone”
Augmented model = “predict using regression”

\[R^2 = \frac{SSE(\text{mean}) - SSE(\text{regression})}{SSE(\text{mean})} = PRE\]

Caution

\(R^2\) always increases when you add predictors, even useless ones. Use adjusted \(R^2\), AIC, or BIC for model selection.

SS_tot = np.sum((y - np.mean(y))**2)
SS_res = np.sum((y - y_hat)**2)
R2 = 1 - SS_res / SS_tot

Log-Likelihood

Definition

\[\log L = \sum_{i=1}^n \log p(y_i \mid \hat{\theta})\]

Intuition: How probable is the observed data under the fitted model? Higher (less negative) is better.

Foundation for AIC, BIC, and likelihood ratio tests
Used for comparing models fit with Maximum Likelihood

# For a linear model with normal errors
log_lik = -n/2 * np.log(2 * np.pi * sigma2_hat) - SSE / (2 * sigma2_hat)

AIC (Akaike Information Criterion)

Definition

\[AIC = -2 \log L + 2k\]

where \(k\) = number of parameters

Intuition: Balances fit against complexity. The \(2k\) term penalizes adding parameters.

Lower AIC = better model
Difference of 2+ is meaningful; <2 is negligible

k = p + 2  # slopes + intercept + sigma^2
AIC = -2 * log_lik + 2 * k

BIC (Bayesian Information Criterion)

Definition

\[BIC = -2 \log L + k \log(n)\]

Intuition: Like AIC but with a stronger penalty for complexity (especially with large \(n\)).

Lower BIC = better model
Tends to select simpler models than AIC

BIC = -2 * log_lik + k * np.log(n)

AIC vs BIC

Criterion	Penalty	Tends to select
AIC	\(2k\)	Better predictions (may overfit)
BIC	\(k \log(n)\)	Simpler models (more conservative)

Neither is “correct” — they answer slightly different questions about model quality.

Statistical Inference

How confident are we in our conclusions? Statistical inference provides tools for testing hypotheses about parameters and comparing models.

Two Sides of the Same Coin

Test	What it tests	When to use
t-statistic	Is a single parameter different from 0?	Testing individual coefficients
F-statistic	Do multiple parameters together improve the model?	Comparing nested models

Key insight: When testing a single parameter, \(F = t^2\). The F-test generalizes the t-test to multiple parameters simultaneously.

t-Statistic

Definition

\[t = \frac{\hat{\beta}_j}{SE(\hat{\beta}_j)}\]

Intuition: Signal-to-noise ratio for a single coefficient. How many standard errors is the estimate from zero?

\(|t| > 2\) roughly corresponds to \(p < 0.05\)
Larger \(|t|\) = stronger evidence the coefficient isn’t zero

t_stat = beta_hat / se_beta
p_values = 2 * (1 - st.t.cdf(np.abs(t_stat), n - p - 1))

F-Statistic

Definition

\[F = \frac{(SSE_C - SSE_A) / (p_A - p_C)}{SSE_A / (n - p_A)}\]

Or equivalently: \[F = \frac{PRE / (p_A - p_C)}{(1 - PRE) / (n - p_A)}\]

Intuition: Is the error reduction “worth it”? The F-statistic is a ratio:

Numerator: Error reduction per added parameter
Denominator: Remaining error per observation (adjusted for parameters)

If the added parameters help, \(F >> 1\).

# Comparing compact vs augmented model
df1 = p_A - p_C  # parameters added
df2 = n - p_A    # residual df for augmented model
F_stat = (PRE / df1) / ((1 - PRE) / df2)

# p-value
p_value = 1 - st.f.cdf(F_stat, df1, df2)

p-value

Definition

\[p = P(\text{test statistic at least as extreme} \mid H_0 \text{ true})\]

Intuition: If there were truly no effect, how surprising would our result be?

Small \(p\) (e.g., < 0.05): result is unlikely under the null → reject \(H_0\)
Large \(p\): result is compatible with the null → fail to reject \(H_0\)

What p-values are NOT

NOT the probability the null is true
NOT the probability of replication
NOT a measure of effect size

Confidence Interval

Definition

\[\hat{\beta}_j \pm t_{\alpha/2} \cdot SE(\hat{\beta}_j)\]

Intuition: A range of plausible values for the true parameter. If we repeated the study many times, 95% of intervals would contain the true \(\beta\).

If the CI excludes 0, the coefficient is “significant” at that level
Wider CI = more uncertainty

t_crit = st.t.ppf(0.975, df=n - p - 1)
ci_lower = beta_hat - t_crit * se_beta
ci_upper = beta_hat + t_crit * se_beta

Summary

Category	What it does	Key formulas
Aggregation	Summarize a variable	Mean, Median, Mode, Variance, SD, SE
Relationships	Measure association	Covariance, Correlation
Linear Model	Predict outcomes	OLS, Residuals, \(\hat{\sigma}^2\)
Model Fit	Evaluate model quality	PRE, \(R^2\), Log-lik, AIC, BIC
Inference	Test hypotheses	t-stat, F-stat, p-value, CI

The workflow:

Specify → What’s your model? (Design matrix \(\mathbf{X}\), outcome \(\mathbf{y}\))
Estimate → Fit parameters (\(\hat{\boldsymbol{\beta}}\))
Evaluate → How good is the fit? (\(R^2\), AIC, BIC)
Compare → Is complexity worth it? (PRE, F-test)
Infer → What can we conclude? (t-tests, CIs, p-values)