01-13: Summarization & the Central-Limit-Theorem#
Adapted from Statisical Thinking for the 21st Century
NOTE: This notebook makes use of interactive notebook widgets which won’t appear if you’re viewing this notebook on the course website. Make sure to clone and run it locally using the Github Classroom link provided for this week!
Why summarize data?#
One of the big discoveries of statistics is the idea that we can better understand the world by simplifying and summarizing information.
When we summarize data, we are necessarily throwing away information, and one might plausibly object to this. One reason that we summarize data is that it provides us with a way to generalize - that is, to make general statements that extend beyond specific observations. The importance of generalization was highlighted by the writer Jorge Luis Borges in his short story “Funes the Memorious”, which describes an individual who loses the ability to forget. Borges focuses in on the relation between generalization (i.e. throwing away data) and thinking:
“To think is to forget a difference, to generalize, to abstract. In the overly replete world of Funes, there were nothing but details.”
Psychologists have long studied all of the ways in which generalization is central to thinking. One example is categorization: We are able to easily recognize different examples of the category of “birds” even though the individual examples may be very different in their surface features (such as an ostrich, a robin, and a chicken). Importantly, generalization lets us make predictions about these individuals – in the case of birds, we can predict that they can fly and eat seeds, and that they probably can’t drive a car or speak English. These predictions won’t always be right, but they are often good enough to be useful in the world.
Why summarization works: The Central Limit Theorem (CLT)#
The Central Limit Theorem (CLT) is one of the most profound and elegant results in statistics. Its importance cannot be overstated because it underpins almost every method of statistical inference we use today. The roots of the CLT trace back to the 18th century, when mathematicians began exploring probability distributions and their behaviors.
The central limit theorem has an interesting history. The first version of this theorem was postulated by the French-born mathematician Abraham de Moivre who, in a remarkable article published in 1733, used the normal distribution to approximate the distribution of the number of heads resulting from many tosses of a fair coin. This finding was far ahead of its time, and was nearly forgotten until the famous French mathematician Pierre-Simon Laplace rescued it from obscurity in his monumental work Théorie analytique des probabilités, which was published in 1812. Laplace expanded De Moivre’s finding by approximating the binomial distribution with the normal distribution. But as with De Moivre, Laplace’s finding received little attention in his own time. It was not until the nineteenth century was at an end that the importance of the central limit theorem was discerned, when, in 1901, Russian mathematician Aleksandr Lyapunov defined it in general terms and proved precisely how it worked mathematically. Nowadays, the central limit theorem is considered to be the unofficial sovereign of probability theory. ~ Henk Tijms (2004)
What does it say? Convergence of the sample mean#
The CLT is the statistical premise that, given a sufficiently large sample size from a population with a finite variance, the estimated mean of all sampled variables from the same population will be approximately equal to the mean of the whole population.
Or from a different perspective, a sufficiently large sample size can predict the characteristics of a population more accurately.
This convergence of the sample mean to the population mean is known as the law of large numebrs. Let’s see it in action:
NOTE: You can ignore the code below and just run the cell
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatText, IntSlider
# Widget for LLN demonstration
def lln_widget(pop_mean=50, pop_sd=10, max_sample_size=1000):
np.random.seed(0)
population = np.random.normal(loc=pop_mean, scale=pop_sd, size=100000)
sample_sizes = np.arange(1, max_sample_size + 1, 1)
sample_means = [np.mean(np.random.choice(population, size=n)) for n in sample_sizes]
# Plot convergence
plt.figure(figsize=(10, 5))
plt.plot(sample_sizes, sample_means, label="Sample Mean", alpha=0.7)
plt.axhline(pop_mean, color="red", linestyle="--", label="True Population Mean")
plt.title(f"Law of Large Numbers: Convergence of Sample Mean\nSample Mean = {sample_means[-1]:.3f}")
plt.xlabel("Increasing Sample Size ->")
plt.ylabel("Mean")
plt.ylim(pop_mean - 5, pop_mean + 5)
plt.legend()
plt.show()
interact(
lln_widget,
pop_mean=FloatText(value=50., description="Pop Mean"),
pop_sd=FloatText(value=5., min=1, max=25, step=1, description="Pop SD"),
max_sample_size=IntSlider(
value=10, min=10, max=1000, step=1, description="Sample Size"
),
)
<function __main__.lln_widget(pop_mean=50, pop_sd=10, max_sample_size=1000)>
Here’s another way of seeing it based on the shape of the sampling distribution:
NOTE: You can ignore the code below and just run the cell
# Widget for LLN demonstration with histogram visualization
def lln_histogram_widget(pop_mean=50, pop_sd=0.5, sample_size=10):
np.random.seed(0)
population = np.random.normal(loc=pop_mean, scale=pop_sd, size=100000)
sample = np.random.choice(population, size=sample_size)
# Calculate max height of normal distribution PDF
max_height = 1 / (pop_sd * np.sqrt(2 * np.pi))
y_max = max_height * 1.3 # Add 10% padding
# Plot histogram with fixed density
plt.figure(figsize=(12, 6))
plt.hist(
population,
bins=50,
alpha=0.5,
label="Population",
density=True,
weights=np.ones_like(population) / len(population),
)
plt.hist(
sample,
bins=50,
alpha=0.5,
label="Sample",
density=True,
weights=np.ones_like(sample) / len(sample),
)
plt.axvline(pop_mean, color="blue", linestyle="--", label="Population Mean")
plt.axvline(sample.mean(), color="orange", linestyle="--", label="Sample Mean")
plt.title(
f"Convergence of Sample Mean {sample.mean():.3f} to Population Mean {pop_mean:.3f}"
)
plt.xlabel("Value")
plt.ylabel("Density")
plt.ylim(0, y_max)
plt.xlim(-6 + pop_mean, 6 + pop_mean)
plt.legend()
plt.show()
interact(
lln_histogram_widget,
pop_mean=FloatText(value=1.50, description="Pop Mean"),
pop_sd=FloatText(value=1.0, min=0.001, step=.1, description="Pop SD"),
sample_size=IntSlider(value=10, min=10, max=1000, step=1, description="Sample Size"),
)
<function __main__.lln_histogram_widget(pop_mean=50, pop_sd=0.5, sample_size=10)>
What does it say? Convergence of the sample mean#
The CLT is the statistical premise that, given a sufficiently large sample size from a population with a finite variance, the estimated mean of all sampled variables from the same population will be approximately equal to the mean of the whole population. Or from a different perspective, a sufficiently large sample size can predict the characteristics of a population more accurately.
Let’s see this convergence of the sample mean to the population mean in action:
NOTE: You can ignore the code below and just run the cell
What else does it say? Sampling distributions are “normal”#
Importantly, the CLT also states that taken together the distribution of these samples will approximate a normal distribution as the size of each sample increases, regardless of the population’s actual distribution shape.
In other words, if you take repeated random samples of size n
from a population, the distribution of the sample means will tend to form a normal distribution, even if the original population is not normal! And the mean of this sampling distribution will approximate the population mean.
Why do we care about this sampling distribution? Because its width (spread, variance) is determined by sample size! As sample size increases, the sampling distribution’s width (what we call standard error) decreases, and the distribution of sample means becomes more and more similar to a normal distribution.
This is why we say that the sampling distrubtion converges to a normal distribution as the sample size increases!
import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, FloatText, IntSlider, IntText, RadioButtons
def clt_widget(population_type="Normal", sample_size=30, n_simulations=1000):
np.random.seed(0)
if population_type == "Normal":
population = np.random.normal(loc=50, scale=10, size=100000)
elif population_type == "Exponential":
population = np.random.exponential(scale=10, size=100000)
elif population_type == "Uniform":
population = np.random.uniform(low=20, high=80, size=100000)
# Generate sampling distribution of the mean
means = [
np.mean(np.random.choice(population, size=sample_size))
for _ in range(n_simulations)
]
# Plot the sampling distribution
plt.figure(figsize=(8, 5))
plt.hist(means, bins=30, edgecolor="k", alpha=0.7, density=True)
plt.title(
f"Sampling Distribution of the Mean\nPopulation: {population_type}, Sample Size: {sample_size}"
)
plt.xlabel("Mean")
plt.ylabel("Density")
plt.axvline(
np.mean(population), color="red", linestyle="--", label="True Population Mean", linewidth=4
)
plt.axvline(
np.mean(means),
color="blue",
linestyle="--",
linewidth=2,
label="Mean of Sampling Distribution",
)
plt.legend()
plt.xlim(population.mean() - 5, population.mean() + 5)
plt.show()
interact(
clt_widget,
population_type=RadioButtons(
options=["Normal", "Exponential", "Uniform"],
value="Normal",
description="Population Distribution",
),
sample_size=IntSlider(value=30, min=5, max=100, step=5, description="Sample Size"),
n_simulations=IntSlider(value=1, min=1, max=500, step=1, description="Simulations"),
)
<function __main__.clt_widget(population_type='Normal', sample_size=30, n_simulations=1000)>
Other flavors of intuitions about the CLT#
Analytic Intuition: The Math Behind the CLT#
Mathematically, the CLT describes how the sampling distribution of the mean approaches normality. For a population with mean \(( \mu )\) and variance \(( \sigma^2 )\), the sampling distribution of the sample mean \(( \bar{X} )\) has:
Mean: \(( \mu )\)
Variance: \(( \frac{\sigma^2}{n} )\)
As \(( n )\) increases, \(( \frac{\sigma^2}{n} )\) decreases, resulting in a narrower distribution centered around \(( \mu )\).
Algorithmic Intuition: Simulating the CLT in Code#
We can also think of the CLT as an “algorithm” or recipe. Here’s how you might manually simulate it:
# 1. Draw a sample of size N from a population
# 2. Summarize that sample using its mean (arithmetic average)
# 3. Store it in a list
# 4. Repeat 1-3 as many times as you want (e.g. 1000)
# The resulting list of samples will be approximately normal
# and their mean will be close to the population mean
Why Is the CLT the Basis of Statistics?#
The CLT provides a bridge from the chaotic, unpredictable world of raw data to the orderly realm of statistical inference. It justifies:
Using sample means to estimate population means.
Constructing confidence intervals.
Performing hypothesis tests.
By understanding the CLT, you’re not just learning a theorem—you’re embracing the foundation of modern computational statistics, where we’re use computers to sample and re-sample from data to build out distrubtions of quantities we estimate.
In summary, the CLT and law-of-large-numbers work together to make statistics possible. They remind us that while individual data points may be noisy, if there is true signal that the data-points refelct, aggregation can help us reveal that signal amongst natural and random variation.
Exercise: Simulate the CLT#
Your turn! See if you can use the pseudo-code formula above to make a plot of the sampling distribution of the mean and understand it:
Use the cell below to simulate the CLT
Does it look like a normal distribution? If not, why not?
What can you adjust to make it look more normal? (hint: how big is each sample?)
How does this relate to standard-error (i.e. the spread of the sampling distribution)?
Tips:
use interactive help to get more info about the functions we’re importing for you, e..
gauss?
orhist?
you can make multiple copies of the same cell to generate additional plots - check out the copy-cell button/keyboard shortcut/command in VSCode or JupyterLab
# Her are some improts fom standard Python library
# Later on, we'll be using scientific python libraries like numpy
from random import gauss, sample
from statistics import mean
from matplotlib.pyplot import hist # for plotting a histogram
population_size = 10000
population = [gauss() for _ in range(population_size)]
# SET ME
sample_size =
n_repitions =
# To store the sample means
sampling_distribution = []
# Loop over the number of repetitions
for _ in range(n_repitions):
# Draw a sample from the population
this_sample =
# Calculate the sample mean
sample_mean =
# Store it in a list
# Visualize it; customize as needed!
hist(
sampling_distribution,
bins=30,
density=True,
alpha=0.7,
label="Sampling Distribution",
color="skyblue",
);