{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 03-05: Transformations and Assumptions\n", "\n", "This notebook provides a brief reference guide for **data transformation**, specifically:\n", "1. Why we transform data\n", "2. How to check if you should transform (illustrative example with log transform)\n", "3. Common transformations in social science research\n", "4. Additional reading and reference materials\n", "\n", "**Recommended Readings:** \n", "*The following chapters are available on the course website for week 9 and are highly recommended references for additional context summarized by this notebook*\n", "\n", "- Regression and Other Stories\n", " - Chapter 11: Assumptions, Diagnostics, and Model Evaluation \n", " - Chapter 12: Transformations and Regression\n", " - 10 Quick Tips to Improve your Regression Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Why Transform?\n", "\n", "You'll often be in a situation where you've collected data on different measurement scales or those that look a bit quirky when you visualize them. In these circumstances it can often be help to **transform** your data. We typically transform data for 3 primary reasons:\n", "1. To better match the **assumptions** of our model\n", "2. To improve **interpretation**\n", "3. To reduce the **influence** of outliers (or rare extreme values)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### GLM Assumptions\n", "\n", "Let's remind ourselves of the core assumption of the General-Linear-Model: **I**dependent and **I**dentitically **D**istributed errors (**i.i.d**)\n", "\n", "This overall idea captures a few critical pieces that we've encountered in different forms:\n", "\n", "- **Normality:** assumes that the *residuals* ($y - \\hat{y}$) are normally distributed. It’s actually okay if the predictors \n", " and the outcome are non-normal, so long as the residuals are normal.\n", "- **Additivity & Linearity:** assumes that the outcome $y$ is a linear function of separate predictors $\\beta_0 + \\beta_1X_1 +...$\n", "- **Homoscedasticity:** assumes that the variance of our residuals doesn't change as function of our predictors; we shouldn't be getting more or less wrong ($y - \\hat{y}$) depending upon what value our predictor $X$ takes on; this matters *a lot* when we are using categorical predictors and calculating ANOVA statistics\n", "- **Independent of errors:** assumes our residuals don't depend upon each other - this really only gets violated when you have repeated-measures, time-series, geospatial, or multi-level\n", "- **No perfect multi-collinearity:** assumes that our predictors are just linear combinations of each other, otherwise we can figure out what the \"unique variance\" each one contributes to explaining $y$!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| **Assumption** | **How to Notice** | **Effect on Model** | **Example Data** |\n", "|--------------|--------------|-----------------|-------------------|\n", "| **Linearity** | Curved relationships in data plot | Poor model fit| Perceptual measurements, learning curves |\n", "| **Homoscedasticity** | Residuals fan out in residual plot | Incorrect standard errors | Data with a huge range, e.g. 1-100000 | \n", "| **Normality of Residuals** | Skewed residuals | Invalid statistical tests | Reaction times, income |\n", "| **Multicollinearity** | Highly correlated predictors | Inflated standard errors | Z-score Standardization | Highly correlated predictors |\n", "\n", "
region | group | fertility | ppgdp | lifeExpF | pctUrban | infantMortality | |
---|---|---|---|---|---|---|---|
str | str | str | f64 | f64 | f64 | f64 | f64 |
"Afghanistan" | "Asia" | "other" | 5.968 | 499.0 | 49.49 | 23.0 | 124.535 |
"Albania" | "Europe" | "other" | 1.525 | 3677.2 | 80.4 | 53.0 | 16.561 |
"Algeria" | "Africa" | "africa" | 2.142 | 4473.0 | 75.0 | 67.0 | 21.458 |
"Angola" | "Africa" | "africa" | 5.135 | 4321.9 | 53.17 | 59.0 | 96.191 |
"Argentina" | "Latin Amer" | "other" | 2.172 | 9162.1 | 79.89 | 93.0 | 12.337 |