Common Formulas Reference
Definitions, intuitions, and code for statistical quantities
Each formula includes:
- Definition — the mathematical formula
- Intuition — what it means conceptually
- Code —
numpyandscipyto compute it
See also: Notation Reference | Glossary
The Big Picture: Data = Model + Error
Every statistical analysis follows the same fundamental structure:
Intuition: We decompose each observation into:
- Model (\(\hat{y}\)): Our best prediction based on some systematic pattern
- Error (\(e\)): Everything we can’t explain — noise, unmeasured factors, randomness
All the formulas in this guide are variations on this theme:
- Aggregation: The simplest model predicts the mean for everyone
- Relationships: We improve predictions by using other variables
- Model fit: How well does the model explain the data?
- Inference: How confident are we in our conclusions?
Whether you’re focused on explanation (understanding why) or prediction (forecasting what), the math is the same. The difference is in what you emphasize: coefficients (\(\hat{\beta}\)) for explanation, predictions (\(\hat{y}\)) for forecasting.
Aggregation & Summarization
These formulas reduce a collection of values to a single summary. The mean is the foundation — it’s the simplest possible “model.”
Mean
Intuition: The mean is the best 1-parameter model for your data — if you had to guess a single value for everyone, the mean minimizes your total squared error.
Median
Intuition: The median is the middle value — it splits the data so 50% are above and 50% below. Unlike the mean, the median minimizes absolute error rather than squared error, making it robust to outliers.
Mode
Intuition: The most frequent value — the one that occurs with highest probability. Most useful for categorical or discrete data. A distribution can have multiple modes (bimodal, multimodal).
Variance
Intuition: Variance is the average squared prediction error when your model is just the mean. It measures how spread out the data are around the center.
Variance can be written as: \[s^2 = \frac{SSE}{n - 1} = \frac{\sum(x_i - \bar{x})^2}{n - 1}\]
This is Mean Squared Error (MSE) where the “model” is \(\bar{x}\). The connection between variance and model error runs throughout statistics.
We divide by \(n-1\) rather than \(n\) because we estimated one parameter (the mean) from the data before calculating variance.
Think of it this way: to compute variance, we first need \(\bar{x}\). That’s one calculation that “uses up” information. We have \(n\) data points but only \(n-1\) independent pieces of information left for estimating spread.
Throughout this guide, when you see \(n - k\) in a denominator, \(k\) is the number of parameters estimated to make the calculation possible.
Standard Deviation
Intuition: The square root of variance — puts spread back in the original units of the data.
Standard Error of the Mean
Intuition: How much would the sample mean vary if we repeated the study? SE quantifies uncertainty in the mean, not spread in individual observations.
- Larger \(n\) → smaller SE (more data = more precise estimate)
- SE decreases with \(\sqrt{n}\), so 4× the sample size halves the SE
| Statistic | Question it answers |
|---|---|
| SD (\(s\)) | How spread out are individual observations? |
| SE (\(SE_{\bar{x}}\)) | How precise is our estimate of the mean? |
Z-Score (Standardization)
Intuition: How many standard deviations is \(x_i\) from the mean?
- \(z = 0\): at the mean
- \(z = 1\): one SD above the mean
- \(z = -2\): two SDs below the mean
Z-scores put variables on a common scale, enabling comparison across different measurements.
Relationships & Similarity
These formulas describe how two variables relate. The core operation is the dot product — measuring how two vectors “point in the same direction.”
Dot Product
Intuition: Multiply corresponding elements and sum. If \(x\) and \(y\) tend to be large together (or small together), the dot product is large. This is the foundation for covariance, correlation, and regression.
Covariance
Intuition: The mean-centered dot product — do \(x\) and \(y\) move together around their respective means?
- Positive: when \(x\) is above its mean, \(y\) tends to be too
- Negative: when \(x\) is above its mean, \(y\) tends to be below
- Zero: no linear relationship
The magnitude depends on the scales of \(x\) and \(y\), making covariance hard to interpret directly.
Pearson Correlation
Intuition: Covariance scaled to [-1, 1]. Measures strength and direction of linear relationship.
- \(r = 1\): perfect positive relationship
- \(r = -1\): perfect negative relationship
- \(r = 0\): no linear relationship
Correlation is the average dot product of z-scores: \[r = \frac{1}{n-1} \sum_{i=1}^n z_{x_i} \cdot z_{y_i}\]
When both variables are standardized, their dot product (similarity) is their correlation.
The Linear Model
Now we move from describing single variables and relationships to building predictive models. The General Linear Model (GLM) is the workhorse of statistics.
The Model Equation
Intuition: Each observation equals a linear combination of predictors plus random error. The coefficients (\(\beta\)) tell us how much each predictor contributes.
Predicted Values
Intuition: The model’s best guess for each observation — the “Model” part of Data = Model + Error.
Residuals
Intuition: The “Error” part of Data = Model + Error. What’s left after the model makes its prediction.
- Positive residual: model under-predicted
- Negative residual: model over-predicted
- Patterns in residuals suggest the model is missing something
OLS Estimator
Intuition: OLS (Ordinary Least Squares) finds coefficients that minimize total squared error: \[\min_{\boldsymbol{\beta}} \sum_i (y_i - \hat{y}_i)^2\]
Reading the formula:
- \(\mathbf{X}^\top\mathbf{y}\): how each predictor relates to the outcome
- \((\mathbf{X}^\top\mathbf{X})^{-1}\): adjusts for relationships among predictors
Residual Variance (Model Error)
Intuition: The average squared prediction error of the model. This is the same structure as variance, but now:
- The “model” is the regression line (not just the mean)
- We’ve estimated \(p + 1\) parameters (intercept + \(p\) slopes)
- So we divide by \(n - (p + 1) = n - p - 1\)
| Statistic | Model | Parameters | Formula |
|---|---|---|---|
| Variance | Mean only | 1 | \(SSE / (n - 1)\) |
| Residual variance | Regression | \(p + 1\) | \(SSE / (n - p - 1)\) |
Both are “average squared error” — they differ only in what model generated the predictions.
Variance of Coefficients
Intuition: How uncertain are we about each coefficient?
- Larger \(\hat{\sigma}^2\) (noisier data) → more uncertainty
- Diagonal of \((\mathbf{X}^\top\mathbf{X})^{-1}\) reflects predictor spread and collinearity
Standard Error of Coefficients
Intuition: Uncertainty in the same units as the coefficient. Used for t-tests and confidence intervals.
Model Fit
How well does the model explain the data?
PRE: Proportional Reduction in Error
Intuition: What proportion of the simpler model’s error is eliminated by the complex model?
- \(PRE = 0\): complex model is no better
- \(PRE = 1\): complex model explains everything the simple model missed
- \(PRE = 0.10\): complex model reduces error by 10%
R-squared
Intuition: What proportion of variance does the model explain?
- \(R^2 = 0\): model explains nothing
- \(R^2 = 1\): model explains everything
- \(R^2 = 0.3\): model explains 30% of variance
\(R^2\) is PRE when:
- Compact model = “predict the mean for everyone”
- Augmented model = “predict using regression”
\[R^2 = \frac{SSE(\text{mean}) - SSE(\text{regression})}{SSE(\text{mean})} = PRE\]
\(R^2\) always increases when you add predictors, even useless ones. Use adjusted \(R^2\), AIC, or BIC for model selection.
Log-Likelihood
Intuition: How probable is the observed data under the fitted model? Higher (less negative) is better.
- Foundation for AIC, BIC, and likelihood ratio tests
- Used for comparing models fit with Maximum Likelihood
AIC (Akaike Information Criterion)
Intuition: Balances fit against complexity. The \(2k\) term penalizes adding parameters.
- Lower AIC = better model
- Difference of 2+ is meaningful; <2 is negligible
BIC (Bayesian Information Criterion)
Intuition: Like AIC but with a stronger penalty for complexity (especially with large \(n\)).
- Lower BIC = better model
- Tends to select simpler models than AIC
| Criterion | Penalty | Tends to select |
|---|---|---|
| AIC | \(2k\) | Better predictions (may overfit) |
| BIC | \(k \log(n)\) | Simpler models (more conservative) |
Neither is “correct” — they answer slightly different questions about model quality.
Statistical Inference
How confident are we in our conclusions? Statistical inference provides tools for testing hypotheses about parameters and comparing models.
| Test | What it tests | When to use |
|---|---|---|
| t-statistic | Is a single parameter different from 0? | Testing individual coefficients |
| F-statistic | Do multiple parameters together improve the model? | Comparing nested models |
Key insight: When testing a single parameter, \(F = t^2\). The F-test generalizes the t-test to multiple parameters simultaneously.
t-Statistic
Intuition: Signal-to-noise ratio for a single coefficient. How many standard errors is the estimate from zero?
- \(|t| > 2\) roughly corresponds to \(p < 0.05\)
- Larger \(|t|\) = stronger evidence the coefficient isn’t zero
F-Statistic
Intuition: Is the error reduction “worth it”? The F-statistic is a ratio:
- Numerator: Error reduction per added parameter
- Denominator: Remaining error per observation (adjusted for parameters)
If the added parameters help, \(F >> 1\).
p-value
Intuition: If there were truly no effect, how surprising would our result be?
- Small \(p\) (e.g., < 0.05): result is unlikely under the null → reject \(H_0\)
- Large \(p\): result is compatible with the null → fail to reject \(H_0\)
- NOT the probability the null is true
- NOT the probability of replication
- NOT a measure of effect size
Confidence Interval
Intuition: A range of plausible values for the true parameter. If we repeated the study many times, 95% of intervals would contain the true \(\beta\).
- If the CI excludes 0, the coefficient is “significant” at that level
- Wider CI = more uncertainty
Summary
| Category | What it does | Key formulas |
|---|---|---|
| Aggregation | Summarize a variable | Mean, Median, Mode, Variance, SD, SE |
| Relationships | Measure association | Covariance, Correlation |
| Linear Model | Predict outcomes | OLS, Residuals, \(\hat{\sigma}^2\) |
| Model Fit | Evaluate model quality | PRE, \(R^2\), Log-lik, AIC, BIC |
| Inference | Test hypotheses | t-stat, F-stat, p-value, CI |
The workflow:
- Specify → What’s your model? (Design matrix \(\mathbf{X}\), outcome \(\mathbf{y}\))
- Estimate → Fit parameters (\(\hat{\boldsymbol{\beta}}\))
- Evaluate → How good is the fit? (\(R^2\), AIC, BIC)
- Compare → Is complexity worth it? (PRE, F-test)
- Infer → What can we conclude? (t-tests, CIs, p-values)