{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 01-29 Relationships & Similarity\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "So far, we spent explored how to **summarize single variables** in our datasets using measures of *central tendency*: the mean, median, mode, and measure of *spread*: variance, standard deviation, standard error (standard deviation of a *sampling distribution*).\n", "\n", "When we have a column of numbers – say, students' test scores or reaction times from an experiment – we can *compress* that information into a meaningful summary: the mean tells us about the typical value, the variance describes spread, the median gives us the middle value, etc. These are all ways of **aggregating** - *thoughtfully throwing away information to gain insight* - one of the **4 fundamental concepts** we learned about (aggregation, learning, sampling, & uncertainty). \n", "\n", "But this approach misses something crucial – it doesn't tell us anything about how the variables **change together**. \n", "\n", "How might we come up with a *new statistic* that summarize this change?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Thinking About Similarity\n", "\n", "Consider a real-world example that many of you have first hand experience with šŸ˜…: \n", "\n", "How would we **summarize the relationship** between the number of *hours* you spend working on an assignment, and the *score* you receive on an assignment?\n", "\n", "Just as we can compress a one column of numbers into a single meaningful value (like the mean), perhaps we can *compress the relationship* between these two variables in a way that tells us how \"similar\" their patterns are.\n", "\n", "This notion of **summarizing similarity** is foundational to many of the concepts we'll learn later in this course.\n", "\n", "Let's think about what it means for 2 variables to be \"similar.\"\n", "\n", "If *hours* and *score* are **similar** then we should be able to express that in a number that reflects how the \"move together:\" do students who study *more* tend to score *higher*?\n", "\n", "Let's play with some data to make this concrete - we'll load a data with observations from 100 students measuring study habits and performance. These data contain a column called `study_time` for the number of hours students prepared and `test_score` for the score they received on their exam:\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (100, 3)
studentstudy_timetest_score
strf64f64
"A"23.820262100.0
"B"17.00078680.523981
"C"19.8936987.08253
"D"26.204466100.0
"E"24.3377996.944346
"RRRR"18.53286695.350268
"SSSS"15.052597.822906
"TTTT"23.929352100.0
"UUUU"15.63456100.0
"VVVV"17.009947100.0
" ], "text/plain": [ "shape: (100, 3)\n", "ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”\n", "│ student ┆ study_time ┆ test_score │\n", "│ --- ┆ --- ┆ --- │\n", "│ str ┆ f64 ┆ f64 │\n", "ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•ā•”\n", "│ A ┆ 23.820262 ┆ 100.0 │\n", "│ B ┆ 17.000786 ┆ 80.523981 │\n", "│ C ┆ 19.89369 ┆ 87.08253 │\n", "│ D ┆ 26.204466 ┆ 100.0 │\n", "│ E ┆ 24.33779 ┆ 96.944346 │\n", "│ … ┆ … ┆ … │\n", "│ RRRR ┆ 18.532866 ┆ 95.350268 │\n", "│ SSSS ┆ 15.0525 ┆ 97.822906 │\n", "│ TTTT ┆ 23.929352 ┆ 100.0 │\n", "│ UUUU ┆ 15.63456 ┆ 100.0 │\n", "│ VVVV ┆ 17.009947 ┆ 100.0 │\n", "ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import a helper function we provide to generate some data\n", "from helpers import generate_student_data\n", "\n", "# This function returns a polars DataFrame\n", "df = generate_student_data()\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a scatter plot of the relationship we want to *summarize*:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 384, "width": 384 } }, "output_type": "display_data" } ], "source": [ "import polars as pl\n", "from polars import col\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "sns.set(style='ticks', palette='pastel')\n", "\n", "grid = sns.relplot(\n", " data=df,\n", " kind=\"scatter\",\n", " x=\"study_time\",\n", " y=\"test_score\",\n", " height=4,\n", ")\n", "\n", "grid.set_axis_labels('Study Time (hours)', 'Test Score (out of 100)')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This looks it might be approximately *linear* - as values on the x-axis (study time) *increases*, so do values on the y-axis (test scores)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A First Attempt: Dot Products\n", "\n", "So our goal is to come up with a single number that captures how well `study_time` and `test_score` *move together*.\n", "\n", "One way to capture this \"moving together\" mathematically is **simply through multiplication**. When two numbers have the same sign (both positive or both negative), their product is positive. When they have opposite signs, their product is negative. This suggests a simple approach: multiply corresponding values and add up the products. In statistics and linear algebra, this is called the **dot product**:\n", "\n", "$$\n", "\\text{dot\\ product}(x, y) = \\sum_{i=1}^n x_i y_i\n", "$$\n", "\n", "Let's try this out in Python - fortunately this is just a common operation that we can use the `np.dot` function from NumPy to do this for us:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Dot product: 139478.581\n" ] } ], "source": [ "import numpy as np\n", "\n", "# Break out the variables into numpy array to work with them directly\n", "study_time = df['study_time'].to_numpy()\n", "test_score = df['test_score'].to_numpy()\n", "\n", "dot_product = np.dot(study_time, test_score)\n", "\n", "print(f\"Dot product: {dot_product:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The Problem of Measurement Scale\n", "\n", "However, our simple dot product has a problem: it's sensitive to *scale of the data* - in other words the *units* that each variable is measure in. If we measured study time in hours versus minutes, or test scores in percentages versus raw points, our dot product would change drastically. \n", "\n", "You can see this by adjusting the score and time multipliers in the widget below - they control the *units* of each variable by multiplying them by a constant value. Notice how the scatter plots don't move, but the *axis limits* and *dot product* do.\n", "\n", "The dot product is also senstive to the *amount* of data we have - since we're just multiplying raw values and adding them up - just increasing the sample size will increase the dot product!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2c7807f681ea43ad8c8f4d6c65163028", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=100, description='Sample size', max=500, min=50, step=10), FloatSlider(v…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from helpers import dot_widget\n", "\n", "dot_widget();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We could *average* the products of the two variables instead of *summing* them. This is referred to the **mean inner product** in linear algebra. \n", "\n", "$$ \\text{mean\\ inner\\ product}(x, y) = \\frac{1}{n}\\sum_{i=1}^n x_i y_i $$\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average Inner (dot) Product: 1394.786\n" ] } ], "source": [ "mean_inner_product = np.dot(study_time, test_score) / len(study_time)\n", "\n", "# OR \n", "\n", "mean_inner_product = np.mean(study_time * test_score)\n", "\n", "\n", "print(f\"Average Inner (dot) Product: {mean_inner_product:.3f}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "It makes the dot product a bit less sensitive to the number of observations: - notice how increasing the sample size just increase the magnitude of the dot product as dramatically as it did before.\n", "\n", "However, it's still sensitive to the scale of the variables!" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "7fbe73b0170140e1a5ada188e4d143fd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=100, description='Sample size', max=500, min=50, step=10), FloatSlider(v…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from helpers import dot_avg_widget\n", "\n", "dot_avg_widget();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Second Attempt: Co-variance\n", "\n", "The short-coming of the average inner/dot product as a measurement of *similarity* is that we can't tell the difference between whether the summary we get back is due to the scale of our measurements or some *underlying relationship* between them - simply changing the scale of our data changes the \"scale\" of our summary.\n", "\n", "Instead, we might hone-in on a more *specific* measure of similarity: how much these variables move together *with respect to their typical values* - in other words - **how similar are the spreads of these variables**?\n", "\n", "We already know about a measure of spread for a single variable - it's *variance* - the average squared difference between each data-point and its mean.\n", "\n", "$$ var(x) = \\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2 $$\n", "\n", "What if we generalized this idea? We could first **center** each variable around it's mean, and *then* compute the average inner product. \n", "\n", "This is known as **co-variance**: it improves the dot product by summarizing *how similarily variables deviate from their means together*\n", "\n", "\n", "$$ \\text{cov}(x,y) = \\frac{1}{n-1} \\sum_{i=1}^n (x_i - \\bar{x})(y_i - \\bar{y}) $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's calculate it manually:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Covariance: 41.368349718712906\n" ] } ], "source": [ "study_time_centered = study_time - study_time.mean()\n", "test_score_centered = test_score - test_score.mean()\n", "\n", "covariance = np.mean(study_time_centered * test_score_centered)\n", "\n", "print(f\"Covariance: {covariance}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Numpy also provides a `np.cov` function that returns a **covariance matrix** - where each row/column represents a single variable - in this case our matrix is 2x2 because we have 2 variables: study time and test score.\n", "\n", "This matrix contains the *variance* of each variable along the *diagonals* and the *covariance* between each pair of variables along the *off-diagonals*" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 25.39566548, 41.36834972],\n", " [ 41.36834972, 133.55365192]])" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Off diagonals are the same calculation!\n", "# Diagnals are variances of each variable\n", "np.cov(study_time, test_score, ddof=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can verify this quickly:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Study time variance: 25.395665480373285\n", "Test score variance: 133.5536519193348\n" ] } ], "source": [ "study_time_variance = np.var(study_time, ddof=0)\n", "test_score_variance = np.var(test_score, ddof=0)\n", "\n", "print(f\"Study time variance: {study_time_variance}\")\n", "print(f\"Test score variance: {test_score_variance}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Covariance improves on the raw dot product by measuring how variables deviate from their means together. This value will be far from zero when individual data points deviate by similar amounts from their respective means; if they are deviant in the same direction then the covariance is *positive*, whereas if they are deviant in opposite directions the covariance is *negative*.\n", "\n", "**Visually** you can think about co-variance as the average dot-product, when we *move* the data to the *origin* of the plot - indicated by the dashed black lines below.\n", "\n", "But notice: it *still depends on the scale of our variables*. You can see when you increase the scales, covariance changes because the means of each variable change and therefore their *dispersion* increases." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "def3842f16d047d59bdf05cca237d3dc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=100, description='Sample size', max=500, min=50, step=10), FloatSlider(v…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from helpers import cov_widget\n", "\n", "cov_widget();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Third Attempt: Cosine Similarity\n", "\n", "One way we can make our summary statistic **invariant** to the scale of the data is to convert our measurement units, i.e. \"hours\" and \"score\" into a common \"unitless\" measure before we multiply them. A very common choice is to use a unit that reflects \"distance from the origin\". In linear algebra, we call this the **magnitude** or **norm** of a vector - how far the data is spread away from the value `0`.\n", "\n", "This looks a lot like the formula for *variance*, but we're not looking at the \"spread\" around the *mean*, but \"total distance\" from the value 0.\n", "\n", "$$\n", "norm(x) = \\sqrt{\\sum_{i=1}^n x_i^2}\n", "$$\n", "\n", "If we calculate this for both variables and divide the dot product by this value, we get a measure of **cosine similarity**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "$$\n", "\\cos(x,y) = \\frac{dot\\ product(x,y)}{norm(x) \\times norm(y)} = \\frac{\\sum_{i=1}^n x_i y_i}{\\sqrt{\\sum_{i=1}^n x_i^2} \\sqrt{\\sum_{i=1}^n y_i^2}}\n", "$$" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Study time norm: 161.07647027998854\n", "Test score norm: 892.1586038893493\n", "Cosine similarity: 0.9705844944670725\n" ] } ], "source": [ "study_norm = np.sqrt(np.sum([student**2 for student in study_time])) \n", "test_norm = np.sqrt(np.sum([student**2 for student in test_score]))\n", "\n", "print(f\"Study time norm: {study_norm}\")\n", "print(f\"Test score norm: {test_norm}\")\n", "\n", "print(f\"Cosine similarity: {np.dot(study_time, test_score) / (study_norm * test_norm)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check our work using a function from the `scipy` library that returns a cosine *distance*. We just need to substract this from 1 to convert it to a similarity:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Scipy Cosine Similarity: 0.9705844944670723\n" ] } ], "source": [ "from scipy.spatial.distance import cosine\n", "\n", "scipy_cos = 1 - cosine(study_time, test_score)\n", "print(f\"Scipy Cosine Similarity: {scipy_cos:}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This step of normalizing our dot-product makes our new measure **scale invariant**. Notice how changing the scale of each variable does **not** change the calculation of the cosine similarity. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e56316d5ddc1414c829835ec725c9640", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=100, description='Sample size', max=500, min=50, step=10), FloatSlider(v…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from helpers import cos_widget\n", "\n", "cos_widget();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While cosine similarity isn't that popular in basic descriptive statistics it plays an important role when we start building models. You might notice that the way we're calculating a **norm** of each variable includes a \"sum-of-squares\" operation. In fact this particular type of \"norm\" is called a **Euclidean norm** or **L2 norm** - because it's fundamentally calculating a value that's akin to the **squared distance** between data points.\n", "\n", "However, there's still one issue with this summary statistic...\n", "\n", "We've achieved *scale invariance* by assuming that 0 is a meaingful reference point for both variables, but our real question concerns how similar variables are **with respect to their typical values**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A Final Attempt: Correlation\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Making Variables Comparable: Z-Scores\n", "\n", "This is where z-scores come in: when we convert a value to a z-score, we're *normalizing* the data **with respect to its mean**. \n", "\n", "$$ x_{z} = \\frac{x - \\bar{x}}{\\sigma} $$\n", "\n", "Because the denominator here contains standard deviation, which depends upon the *mean* of a variable, we're converting our variable into units that reflect \"standardized distance from the mean\"\n", "\n", "What we're really doing is profound – we have a scale that we can translate *any variable* into the *same* units. \n", "\n", "A z-score of +1 always means one standard deviation above the mean, regardless of whether we're talking about hours, percentages, or milliseconds. When we z-score our variables, we're not just changing their units; we're making them \"speak the same language\" while preserving their relative patterns.\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import zscore\n", "\n", "test_score_z = (test_score - test_score.mean())/ test_score.std()\n", "\n", "#OR\n", "study_time_z = zscore(study_time)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's visualize the effect of z-scoring:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 584, "width": 984 } }, "output_type": "display_data" } ], "source": [ "f, ax = plt.subplots(2, 2, figsize=(10,6));\n", "ax[0,0].hist(study_time);\n", "ax[0,0].set_title(f'Study Time (hours)\\nMean: {study_time.mean():.2f}, Std: {study_time.std():.2f}');\n", "ax[0,1].hist(study_time_z);\n", "ax[0,1].set_title(f'Study Time (z-scores)\\nMean: {study_time_z.mean():.2f}, Std: {study_time_z.std():.2f}');\n", "\n", "ax[1,0].hist(test_score);\n", "ax[1,0].set_title(f'Test Score (points)\\nMean: {test_score.mean():.2f}, Std: {test_score.std():.2f}');\n", "ax[1,1].hist(test_score_z);\n", "ax[1,1].set_title(f'Test Score (z-scores)\\nMean: {test_score_z.mean():.2f}, Std: {test_score_z.std():.2f}');\n", "plt.tight_layout();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how we're not changing the *shape* of the data - just re-scaling it to a different range of standardized units:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "### Correlation: The Dot Product of Z-Scores\n", "\n", "And here's where everything comes together - what if we used a *summary statistic* that was simply the product of z-scores?\n", "\n", "This would combine the best properties of dot products - *higher* values means *more* similar - while accounting for different units of measurement - everything is in \"distance from mean\" units.\n", "\n", "In fact, this is **exactly** what Pearson correlation is: the **average dot product of z-scores**! \n", "\n", "$$ r_{x,y} = \\frac{1}{n-1} \\sum_{i=1}^n z_{x_i} z_{y_i} $$\n", "\n", "We can also see this through the lens of *co-variance*: \n", "\n", "If co-variance is equivalent to the *centered* average dot-product of two variables - correlation is just **normalizing covariance** using the product of both variables' standard deviations.\n", "\n", "$$ r_{x,y} = \\frac{\\text{cov}(x,y)}{\\sigma_x \\sigma_y} $$\n", "\n", "Or written out more completely:\n", "\n", "$$\n", "correlation(x,y) = \\frac{\\sum_{i=1}^n (a_i - \\bar{a})(b_i - \\bar{b})}{\\sqrt{\\sum_{i=1}^n (a_i - \\bar{a})^2} \\sqrt{\\sum_{i=1}^n (b_i - \\bar{b})^2}}\n", "$$\n", "\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e1854900982e4d1284df1f60cbda30e5", "version_major": 2, "version_minor": 0 }, "text/plain": [ "interactive(children=(IntSlider(value=100, description='Sample size', max=500, min=50, step=10), FloatSlider(v…" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from helpers import corr_widget\n", "\n", "corr_widget();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping Up" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key Takeaways\n", "\n", "Building up this notion of a **summary statistic that captures similarity** by-hand should give you a bit more of an intuition for some key properties of correlation:\n", "\n", "1. *Centering* data makes our metric **more interpretable with respect to the central tendency** of each variable \n", "\n", "2. *Normalizing* data makes our metric **scale invariant** \n", "\n", "3. Using *z-scores* (standardized distance from the mean) as our normalization factor, **guarantees our metric is always between -1 and 1**\n", "\n", "4. Using a *dot-product* means we can only capture approximately **linear relationships**\n", "\n", "\n", "Now that you can see how measures like correlation *summarize relationships* between variables, it should be abundantly clear that **correlation measures association, not causation** - it's a mathematical *necessity* that follows from how correlation is constructed.\n", "\n", "Nothing we've calculated above takes into account which variable *caused* the other - correlation only sees patterns in standardized deviations; it has no way to know whether X causes Y, Y causes X, or if both are caused by some third variable Z. When we find a correlation of some value `r` **all we know** is that their z-scores tend to move together.\n", "\n", "To go into additional depth on your own here are some additional resources:\n", "\n", "- [Cheatsheet comparing these measures](https://eshinjolly.com/2018/04/12/similarity_metrics/)\n", "- [Chap 13 - Modeling continuous relationships](https://statsthinking21.github.io/statsthinking21-core-site/modeling-continuous-relationships.html#covariance-and-correlation)\n", "- [Nice visual animation of co-variance](https://www.youtube.com/watch?v=TPcAnExkWwQ)\n", "- [Longer walkthrough video of co-variance](https://www.youtube.com/watch?v=qtaqvPAeEJYf)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Other Measures of Association\n", "\n", "We've seen how correlation emerged from a simple idea (dot products) refined through centering and scaling. But correlation isn't always the best tool for the job. Here are some other measures you're likely to encounter, each capturing different aspects of relationships:\n", "\n", "**Spearman Correlation**\n", "Instead of using raw values, Spearman correlation works with *ranks*. It's calculated exactly like Pearson correlation, but after converting values to their rank order. This makes it:\n", "- More robust to outliers\n", "- Able to capture *monotonic* relationships (consistently increasing/decreasing, even if not linear)\n", "- Useful when you care about order but not magnitude\n", "\n", "**Euclidean Distance** The straight-line distance between points \n", "- Intuitive for spatial data\n", "- Sensitive to scale (like the raw dot product)\n", "- Used in many clustering algorithms\n", "\n", "**Manhattan Distance** Sum of absolute differences \n", "- More robust to outliers than Euclidean\n", "- Natural for grid-like spaces (like city blocks)\n", "- Often used in high-dimensional data\n", "\n", "\n", "Remember: There's no \"best\" measure of association. Each captures different aspects of relationships between variables and provides different approaches for *summarizing* them. Understanding what each measure preserves and discards helps you choose the right tool for your research questions.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Looking Forward: Regression\n", "\n", "Next week we'll dive into *linear models* and regression, which builds naturally from correlation. \n", "\n", "The key connection we'll explore in more depth is:\n", "\n", "$$ β = r \\frac{σ_Y}{σ_X} $$\n", "\n", "which states that the *slope* of a \"line-of-best-fit\" is equivalent to correlation *scaled* by the ratio of standard deviations between variables. \n", "\n", "And critically, if we z-score our variables *first*, then correlation and simple regression are equivalent! β == r\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### For Next Time (also on course website)\n", "\n", "Please try to find some time to watch the following videos before next week. We won't cover them in depth but will build upon these ideas as we move forward\n", "\n", "- [The Essence of Linear Algebra](https://www.3blue1brown.com/topics/linear-algebra) by 3blue1brown. These are bite-sized videos to give you some *high level* intuitions about linear algebra basics, with particularly lovely visuals. If you never formally took any linear algebra (like Eshin), feel math-phobic, or simply need a refresher - this series offers a fresh and fun perspective on about the mathematics that underlies most of the modeling you're likely to do. You don't have watch the full series (unless you want to!), but please check out the following chapters: \n", " - [Chap 1: Vectors, what even ar they?](https://www.3blue1brown.com/lessons/vectors) *~10m*\n", " - [Chap 2: Linear combinations, span, and basis vectors](https://www.3blue1brown.com/lessons/span) *~10m*\n", " - [Chap 3: Linear transformations and matrices](https://www.3blue1brown.com/lessons/linear-transformations) ~*11m*\n", " - [Chap 4: Matrix multiplication as composition](https://www.3blue1brown.com/lessons/matrix-multiplication) ~*10m*\n", " - [Chap 5: Three-dimensional linear transformations](https://www.3blue1brown.com/lessons/3d-transformations) ~*5m*\n", " - [Chap 7: Inverse matrices, column space, and null space](https://www.3blue1brown.com/lessons/inverse-matrices) ~*12m*\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenge exercises - Your Turn\n", "\n", "Just as the some summary statistics of *central tendency* like the *mean* can hide important features of a distribution (like bimodality or skewness), correlation can hide important features of *relationships*.\n", "\n", "A correlation of 0 doesn't mean there's no relationship - it just means there's no detectable *linear* relationship: a perfect U-shaped relationship could have a correlation of zero!\n", "\n", "At the same time a correlation of 1/-1 doen't mean there's a perfectly linear relationshps: we need to *visualize* the underlying data.\n", "\n", "Here's a classic example we saw in a previous lab: Anscombe's quartet\n", "\n", "\"Figure\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next cell we've loaded this dataset for you. Use it to create some figures and answer the following questions:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 3)
datasetxy
strf64f64
"I"10.08.04
"I"8.06.95
"I"13.07.58
"I"9.08.81
"I"11.08.33
" ], "text/plain": [ "shape: (5, 3)\n", "ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”€ā”€ā”\n", "│ dataset ┆ x ┆ y │\n", "│ --- ┆ --- ┆ --- │\n", "│ str ┆ f64 ┆ f64 │\n", "ā•žā•ā•ā•ā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•Ŗā•ā•ā•ā•ā•ā•ā•”\n", "│ I ┆ 10.0 ┆ 8.04 │\n", "│ I ┆ 8.0 ┆ 6.95 │\n", "│ I ┆ 13.0 ┆ 7.58 │\n", "│ I ┆ 9.0 ┆ 8.81 │\n", "│ I ┆ 11.0 ┆ 8.33 │\n", "ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”“ā”€ā”€ā”€ā”€ā”€ā”€ā”˜" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "anscombe = pl.DataFrame(sns.load_dataset(\"anscombe\"))\n", "anscombe.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. Add 2 new columns that reflect: mean-centered and z-scored versions of x and y **separately** per dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "2. Create 3 scatterplot grids using `sns.relplot` to visualize the relationship between x & y, their centered version, and their z-scored versions separately per dataset. Does anything change? Why or why not?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "3. Create 2 new columns that reflect: ranked version of x and y **separately** per dataset\n", "\n", "*Hint: you can use `col('x').rank()` in [polars](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.rank.html) to get the rank of a column*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "4. Create two scatterplot grids this time using `sns.lmplot`. Have one plot the original (non-rank transformed) data with a line-of-best-fit and the other plot the rank-transformed data with a line-of-best fit. Do they look the same? Why or why not?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "5. Inspecting the raw-data scatterplots for datasets 3 and 4 - there look like there might be one \"outlier\" in each dataset. Use `.filter` in Polars to remove each one and create 3 *new* scatterplots using `sns.relplot` visualizing the relationship between x & y, their centered versions, and their z-scored version. Did anything change? Why or why not?\n", "\n", "*Hint: Remember to re-calculate centered and z-scored versions of x and y after filtering!*\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "201b", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 2 }