Models I: GLM & OLS

Models I: GLM & OLS#

Learning Goals#

This week’s lab notebooks are designed to give you hands-on experience with the topics we covered last week’s lectures. This first notebook will review some of the basics we learned about modeling and walk you through how to estimate a model by-hand using numpy.

Specifically, we’ll cover:

What a model is
The General Linear Model
Estimating a GLM with Ordinary Least Squares (OLS)

Slides for reference#

Modeling Data I (slides)
Modeling Data II (slides)
Modeling Data II 1/2 (slides)

How to use this notebook#

This notebook is designed for you to work through at your own pace. When you’re finished you should save and commit your changes and push them to your Github Classroom repository

Try experimenting by creating new code cells and playing around with the demonstrated functionality.

Remember to use ? or help() from within this notebook to look up how functionality works.

What is a model?#

In the physical world, “models” are generally simplifications of things in the real world that nonetheless convey the essence of the thing being modeled. A model of a building conveys the structure of the building while being small and light enough to pick up with one’s hands; a model of a cell in biology is much larger than the actual thing, but again conveys the major parts of the cell and their relationships.

In statistics, a model is meant to provide a similarly condensed description, but for data rather than for a physical structure. Like physical models, a statistical model is generally much simpler than the data being described; it is meant to capture the structure of the data as simply as possible. In both cases, we realize that the model is a convenient fiction that necessarily glosses over some of the details of the actual thing being modeled. As the statistician George Box famously said: “All models are wrong but some are useful.” It can also be useful to think of a statistical model as a theory of how the observed data were generated; our goal then becomes to find the model that most efficiently and accurately summarizes the way in which the data were actually generated. But as we will see below, the desires of efficiency and accuracy will often be diametrically opposed to one another.

The basic structure of a statistical model is:

\[ data = model + error \]

This expresses the idea that the data can be broken into two portions: one portion that is described by a statistical model, which expresses the values that we expect the data to take given our knowledge, and another portion that we refer to as the error that reflects the difference between the model’s predictions and the observed data.

In essence we would like to use our model to predict the value of the data for any given observation \(i\). We would write the equation like this:

\[ \widehat{data}_i = model_i \]

The “hat” over the data denotes that it’s our prediction rather than the actual value of the data. This means that the predicted value of the data for observation \(i\) is equal to the value of the model for that observation. Once we have a prediction from the model, we can then compute the error:

\[ error_i = data_i - \widehat{data}_i \]

That is, the error for any observation is the difference between the observed value of the data and the predicted value of the data from the model. We refer to these errors as model residuals: what’s “left-over” that our model doesn’t explain.

The mean as a model#

Let’s look at an example of building a model for data, using the toy example we discussed in class: whether chococlate consumption is related to how happy people are. In particular, we will try to build a model of the happiness score of people in our sample. First let’s load the data and plot them:

import numpy as np
np.set_printoptions(suppress=True)
import polars as pl
from polars import col
import seaborn as sns
import matplotlib.pyplot as plt

df = pl.read_csv("./data/chocolate.csv")
df.head()

Let’s make a histogram of Happiness scores:

sns.displot(x='Happiness',data=df)

Remember that we want to describe these data as simply as possible while still capturing their important features. The simplest model that we can imagine would involve only a single number; that is, the model would predict the same value for each observation (each person’s happiness), regardless of what else we might know about those observations (how much chocolate they eat). We generally describe a model in terms of its parameters, which are values that we can change in order to modify the predictions of the model.

You’ll most often see these denoted using the Greek letter beta (\(\beta\)); when the model has more than one parameter, we use subscripted numbers to denote the different betas (e.g. \(\beta_0\), \(\beta_1\)). It’s also customary to refer to the values of the data using the letter \(y\), and to use a subscripted version \(y_i\) to refer to each individual observation \(i\).

We generally don’t know the true values of the parameters, so we have to estimate them from the data. For this reason, we will generally put a “hat” over the \(\beta\) symbol to denote that we are using an estimate of the parameter value rather than its true value (which we generally don’t know). Thus, our simple model for happiness using a single parameter would be:

\[ \hat{y_i} = \hat{\beta} + \epsilon \]

The subscript \(i\) doesn’t appear on the right side of the equation, which means that the prediction of the model doesn’t depend on which observation we are looking at — it’s the same for all of them. The question then becomes: how do we estimate the best values of the parameter(s) in the model? In this particular case, what single value is the best estimate for \(\beta\) ? And, more importantly, how do we even define best?

Error minimization#

One notion of best is that we want our model to be as accurate as possible, i.e. or have the smallest error \(y_i - \hat{y}_i\).

We learned in class that to account for the negative/positive signs of our errors we often prefer models that minimize the sum-of-squared-errorr (SSE):

\[ SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

We also learned that the best single parameter model that minimizes this error is the mean.

You can play with the widget below to make sure you understand the relationship between “best” and “smallest error”.
X1 = Chocolate; y = Happiness

from helpers import mean_model_widget
mean_model_widget()

Note: Other measures of error#

Because our notion of “best” depends on the way we calculate our error - there are other “best” measures accordining to a different error minimization criteria.

Most commonly, you’ll encounter the median which is the best single parameter model that minimizes the sum-of-absolute errors (SAE). Because the error associated with each observation \(y_i\) isn’t squared, larger errors (\(y_i - \hat{y}_i\)) are treated the same as smaller errors.

This is why we say the median is a “robust” measure - it treats errors linearly - whereas SSE treats them quadratically:

Improving our model: Univariate Regression#

Can we imagine a better model? Clearly we’re ignoring additional information we know about each observation: how much chocolate a person eats. In other words, rather than predicting the same value for each observation (mean happiness), we want our model to be able to change its predictions based on the amount of chocolate eaten by each person.

You might remember from basic algebra that a line is defined as:

\[ y = intercept + slope * x \]

We can extend our model to include this information by adding a parameter that controls the slope of our predictions.

\[ \hat{y_i} = \hat{\beta}_0 + \hat{\beta}_1 * chocolate_i \]

where \(\hat{\beta}_1\) is our estimate of the parameter that we multiply by a person’s chocolate consumption by to generate a prediction for their happiness

And \(\hat{\beta}_0\) is our estimate of the intercept of our model which is a constant value added to the prediction for each individual. We call it the intercept because it maps onto the intercept in the equation for a straight line. In machine-learning this is called the model’s bias term.

How do we know what the best slope is? We can use the same criteria we did for a 1-parameter model: the value that minimizes the SSE

You can play with the widget below to make sure you understand the relationship between “best” and “smallest error”.
X1 = Chocolate; y = Happiness

from helpers import slope_model_widget
slope_model_widget()

The General Linear Model#

We can keep extending this idea by adding additional parameters to our model and minimizing the error between the model and the data to evaluate how well we’re able to predict the data.

It turns out the simple univariate regression and even multiple regression are just special cases of the General Linear Model (GLM):

An approach for modeling a single scalar variable \(y\) as a linear function of predictor variables \(X\) plus an independent and identically distributed (iid) set of errors (\(\epsilon\)).

Building a deep understanding of the GLM will lay a strong foundation for the rest of your statistics journey. Why?
Because it turns out that most of the statistical tests you’ll use on a regular basis are just GLMs (or extensions) in disguise…

This week we’ll focus on the “standard” GLM - with uncorrelated errors. Later on we’ll discuss extensions that allow us to handle corrrelated errors (e.g. repeated measures) and link-functions that allow us to model non-continuous dependent variables (e.g. logistic regression).

Estimating a GLM: Ordinary-Least-Squares (OLS)#

So how do we estimate the parameters of our model?

One approach, we discussed is “intellgient guesstimation” or an iterative approach to estimating the parameters of a model: guess, check, update guess, repeat…

However, for linear models, we can use a faster “closed-form” equation called Ordinary-Least-Squares:

\[ \beta = (X^T X)^{-1} X^T y \]

This is what most statistics software is estimating this for you automatically when you use a formula e.g. in R lm(Happiness ~ Chocolate). In the next notebook we’ll see how to how to use the same Wilkson formula syntax in Python.

But it’s first important to understand what’s happening under-the-hood and not only rely on these easy-to-use functions.
Let’s build some intuitions by implementing OLS by-hand using using numpy and some basic linear algebra.

Let’s start with some fundamentals using the Internet Access dataset we discussed in class.

df_net = pl.read_csv("./data/internet.csv")
df_net.head()

Each row of the dataframe contains measurements about a single state. Our goal to estimate a model that tries to use a state’s socio-economic-status to predict the % of residents with internet access:

\[ \hat{internet}_i = \hat{\beta_0} + \hat{\beta_1} SES_i \]

Let’s see how to do this using linear algebra.

Linear Algebra Fundamentals#

At its core, OLS is about solving a system of equations by minimizing SSE. What equations?

One per observation in our dataset:

\[\begin{split} \begin{align*} 79 &= \beta_0 + \beta_1*41.26 \\ 63.5 &= \beta_0 + \beta_1*32.15 \\ 60.9 &= \beta_0 + \beta_1*31.43 \\ \vdots \end{align*} \end{split}\]

# E.g. first 3 rows 
df_net.select(['State','Internet','SES']).head(3)

Linear algebra gives us a set of tools for representing a system of many equations: vectors and matrices.

\[\begin{split} \begin{bmatrix} 79 \\ 63.5 \\ 60.9 \end{bmatrix} = \begin{bmatrix} 1 & 41.26 \\ 1 & 32.15 \\ 1 & 31.43 \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \end{bmatrix} \end{split}\]

Here’s the correspondence:

Vectors#

A vector is simply a list of numbers (e.g., a column of values from your DataFrame). In Python we can represent a vector as a 1-dimensional numpy array:

\[\begin{split} y = \begin{bmatrix} 79 \\ 63.5 \\ 60.9 \end{bmatrix} \end{split}\]

# Convert column to numpy array
y = df_net['Internet'].to_numpy()

# Print first 3 values
y[:3]

# Check dimensionality
y.ndim

# Shape
# Only has 1 value hence 1 dimensional
y.shape

Matrices#

A matrix is a collection of vectors arranged in a rectangular shape column-wise. In the GLM, we store all the predictor variables in a matrix called the Design Matrix. We’ll be exploring this matrix in more detail in the future especially when we talk about categorical predictors.

\[\begin{split} X = \begin{bmatrix} 1 & 41.26 \\ 1 & 32.15 \\ 1 & 31.43 \end{bmatrix} \end{split}\]

It’s standard practice to set the first column of this matrix to all ones, which allows us to model the mean of \(y\) in addition to slope of other predictors:

X = df_net['SES'].to_numpy()

# Intercept
intercept_column = np.ones(len(X))

# Concatenate 2 arrays column-wise
X = np.column_stack([intercept_column, X])

# First 3 rows
X[:3, :]

# 2 dimensions
X.ndim

# Shape has 2 values (rows x columns) hence 2d
X.shape

Extensibility#

Vectors and matrices allow us to formulate our model in an extensible way:

\[ y = X\beta \]

Where:

\( y \) is \(n \times 1\) vector of observed responses (the dependent variable).
\( X \) is the \(n \times p\) design matrix (it contains the values of the predictors, including a column of ones for the intercept).
\( \beta \) is a \(p \times 1\) vector of the parameters we want to estimate; 1 per predictor.

Expanded out to many observations and many predictors, the matrix multiplication looks like this:

\[\begin{split} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \\ \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \ldots & x_{1p} \\ 1 & x_{21} & x_{22} & \ldots & x_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \ldots & x_{np} \\ \end{bmatrix} * \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \\ \end{bmatrix} \end{split}\]

Which is the same as multiplying each column of \(X\) by it’s corresponding \(\beta\):

\[\begin{split} \begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \\ \end{bmatrix} = \beta_0 * \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \\ \end{bmatrix} + \beta_1 * \begin{bmatrix} x_{11} \\ x_{21} \\ \vdots \\ x_{n1} \\ \end{bmatrix} + \beta_2 * \begin{bmatrix} x_{12} \\ x_{22} \\ \vdots \\ x_{n2} \\ \end{bmatrix} \ldots + \beta_p * \begin{bmatrix} x_{1p} \\ x_{2p} \\ \vdots \\ x_{np} \\ \end{bmatrix} \end{split}\]

The Ordinary-Least-Squares Solution#

The key OLS equation is:

\[ \beta = (X^T X)^{-1} X^T y \]

Let’s understand this equation by implementing each part of the diagram above in separate steps.
Here are the relevant linear algebra operations and how to think about them:

If you prefer visuals make sure to watch The Essence of Linear Algebra videos we added to the course website in Week 4

Transpose#

The transpose of a matrix, denoted \(( X^T \)), flips rows into columns and visa-versa.

Example:

\[\begin{split} X = \begin{bmatrix} 1 & 41.26 \\ 1 & 32.15 \\ 1 & 31.43 \end{bmatrix} \end{split}\]

\[\begin{split} X^T = \begin{bmatrix} 1 & 1 & 1 \\ 41.26 & 32.15 & 31.42 \end{bmatrix} \end{split}\]

We can transpose a matrix in numpy using .T

# Normal
# Rows = observations
# Columns = predictors
X[:3,:]

# Transposed
# Rows = predictors
# Columns = observations
X[:3,:].T

# And back again
X[:3,:].T.T

Similarity between predictors: Matrix-Matrix dot-product (inner product)#

Remember from our previous class on summarizing relationships that the dot-product allows us to capture a notion of how similar two sets of elements are based on the sum of their products.
The dot-product (or inner product) of two matrices is a measure of the similarity between each pair of columns from both matrices.

Each element \(C_{ij}\) in the resulting matrix = the sum over the element-wise products of the \(i\)-th row of \(A\) and the \(j\)-th column of \(B\). To perform this calculation, both matrices must have the same inner dimensions: the columns \(k\) in matrix \(A\), and the rows \(k\) in matrix \(B\):

\[ C_{ij} = \sum A_{ik} B_{kj} \]

When we calculate the inner product of a matrix \(X\) with itself \(X.T\), it’s like calculating the dot-product between all pairs of columns in that single matrix, including columns with themselves. This results in a square matrix with the same number of rows & columns as columns \(X\).
The diagonal elements capture the (unscaled) variance of each individual column and off-diagonal elements capture the (un-scaled) co-variance between columns:

\[ C_{kk} = \sum X^T_{ki} X_{ik} \]

XTX = np.dot(X.T, X)
XTX

In our internet dataset, the top left value is the dot-product of the intercept with itself - just the total number of observations in the dataset:

np.dot(X[:,0], X[:, 0])


# Equivalent to:
# np.sum(X[:,0] * X[:, 0]) 

The bottom right value is the dot-product of SES with itself:

np.dot(X[:,1], X[:, 1]) 

And the off diagonal elements is the dot product between the intercept and SES:

np.dot(X[:,0], X[:, 1]) 

The Gram Matrix#

In OLS this part of the equation has a special name: the “Gram Matrix”

You can think of it like an un-scaled version of the covariance of our predictors. If we mean-center each column of \(X\) before computing the inner product, we can calculate the co-variance matrix of \(X\)

\[\begin{split} \begin{align*} G &= X^TX \\ \\ cov(X) &\approx G\frac{1}{n-1} \end{align*} \end{split}\]

This symmetric matrix allows us to do a few important things:

Capture how similar different predictors are to each other (off-diagonals), so we can remove it to estimate parameters \(\beta\)
Capture the variance (uncertainty) associated with each predictor \(var(\beta)\) and later perform parameter inference by calculating t-statistics and p-values
Estimate the degree of multi-collinearity of our predictors - how correlated they are with each other - an important assumption of the GLM that we’ll explore when we discusus multiple regression

Linear Algebra “undo”: Matrix Inverse#

Now that we’ve captured the similarity between predictors we want to remove it so we can calculate the resulting similarity with \(y\). This allows us to isolate the unique similarity between each column of \(X\) and \(y\) - i.e. the \(\beta\) coefficients we want to solve for with OLS.

Just like you can “undo” or “cancel-out” the effect of multiplication in basic algebra using division or multiplying by the inverse:

\[\begin{split} \begin{align*} 5 / 5 &= 1 \\ 5 * \frac{1}{5} &= 1 \\ 5^{-1} &= \frac{1}{5} \\ 5 * 5^{-1} &= 1 \\ \end{align*} \end{split}\]

In linear algebra, you can “undo” the effect of matrix by multiplying by its matrix inverse, denoted by raising to the \(-1\) power.
The linear aglebra equivalent of the scalar value \(1\) is called the identity-matrix \(I\), which has 0s for all off-diagonals and 1s across the diagonal:

\[\begin{split} \begin{pmatrix} 1 & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{pmatrix} = I \end{split}\]

And just like in basic algebra, multiplying a matrix by its inverse is equal to the identity matrix:

\[ (X^TX)({X^TX})^{-1} = I \]

Not all matrices have inverses, but if they do you can calculate them using np.linalg.inv:

XTX_inv = np.linalg.inv(XTX)

# 2x2 identity matrix because X has 2 columns
np.dot(XTX, XTX_inv)

Let’s see how using the inverse undoes the dot product. Let’s start with a vector of 2 values:

vector = np.array([1., 2.])
vector

If we take the dot product of the vector and the matrix \(X^T X\) we get a new vector:

new_vector = np.dot(vector, XTX) 

new_vector

But if we take the dot product of this new vector and the inverse \((X^T X)^{-1}\), we can recover our original vector!

np.dot(new_vector, XTX_inv)  

Similarity between predictors & outcome: Matrix-vector dot-product#

You just saw the last piece of our puzzle: the matrix-vector dot-product!

Just like the matrix-matrix dot-product computes the similarity between columns of two matrices, the matrix-vector dot-product between \(X^Ty\) computes the similarity between each column of \(X\) and \(y\).

The result is another vector that has the same number of rows as \(X^T\), i.e. the same number of columns as \(X\)

np.dot(X.T, y)

Challenge: Putting the pieces together#

Now we have have all the pieces to put together the OLS equation:

\[ \beta = (X^T X)^{-1} X^T y \]

Take a shot writing this out using numpy:

# Your turn
# Fill out the right-hand side of each equation below

XTX = 

XTX_inv = 

XTy = 

betas = 

# display betas
betas

If you did things right you should get a vector of 2 \(\beta\) values:

\[\begin{split} \beta = \begin{bmatrix} 11.28 \\ 1.68 \end{bmatrix} \end{split}\]

Which are our parameter estimates:

\[\begin{split} \begin{align*} \hat{\beta_0} &= 11.28 \\ \hat{\beta_1} &= 1.68 \end{align*} \end{split}\]

That allow us to complete our original model equation:

\[\begin{split} \begin{align*} \hat{internet}_i &= \hat{\beta_0} + \hat{\beta_1} SES_i \\ \hat{internet}_i &= 11.28 + 1.68 * SES_i \end{align*} \end{split}\]

# Solution
XTX = np.dot(X.T, X)

XTX_inv = np.linalg.inv(XTX)

XTy = np.dot(X.T, y)

betas = np.dot(XTX_inv, XTy)

betas

# OR

# np.linalg.inv(np.dot(X.T, X)).dot(np.dot(X.T, y))

Using betas for prediction#

Now that we’ve estimated values for \(\beta\) we can use them to generate predictions for each observation \(\hat{internet}_i\).
Once again linear algebra makes this easy!
Rather than looping over each row of our DataFrame and plugging in our parameter estimates, we can use the dot-product:

\[ \hat{y} = X\hat{\beta} \]

y_hats = np.dot(X,betas)

Let’s visualize these along with the true values of \(y\):

# Add a column to our dataframe with the predictions
df_net = df_net.with_columns(
    Internet_predictions = y_hats
)

df_net

We’ll used sns.FacetGrid to overlay 2 scatterplots:

\(Internet \sim SES\) - the observed data
\(\hat{Internet} \sim SES\) - the predicted data

You can refer to the layering plots section of our previous lab to refresh yourself on seaborn

# Create grid
grid = sns.FacetGrid(data=df_net, height=4)
# Original
grid.map(sns.scatterplot, 'SES', 'Internet', color='black', label='Observed')
# Predicted
grid.map(sns.scatterplot, 'SES', 'Internet_predictions', color='steelblue', label='Predicted');
# Labels
grid.set(ylabel='Internet');
grid.add_legend();

Evaluating our predictions#

Let’s see how well our model did by calculating the residuals: \(y - \hat{y}\)

# Add another column with residuals
df_net = df_net.with_columns(
    residuals = col('Internet') - col('Internet_predictions')
)
df_net

To inspect our model assumptions we can plot these residuals and look for any structure

grid = sns.FacetGrid(data=df_net, height=4)
grid.map(sns.scatterplot, 'SES', 'residuals', color='black');
grid.map(plt.axhline, y=0, color='gray', linestyle='--');

Nothing too obvious pops out which is good!

It’s also helpful to see them as a histogram ignoring our predictor \(SES\)

# Todo map kde and historgram along with axvline
sns.displot(df_net,x='residuals', color='black', kde=True)

Gauss Markov Theorem#

Hopefully that wasn’t too bad for some math! We’ll go over more details about regression in the next notebook, but now you should have a decent idea of what’s actually getting done in the background.

OLS is one of the most fundamental results in all of statistics and is motivated by the Gauss-Markov Theorem:

If the errors in a regression model have mean zero, are homoskedastic, and uncorrelated, then the OLS \(\beta\) is BLUE: the Best Linear Unbiased Estimator.

In other words, if we meet these assumptions, then the slopes we calculate using OLS will have the lowest variance and least bias of all possible linear slopes we could have calculated using any other model!

Revisiting the bias-variance tradeoff#

Remember when we discussed what makes a model good?

First, we want it to describe our data well; that is, we want it to have the lowest possible error when predicting our observed data.

Second, we want it to generalize to new data; that is, we want its error to be as low as possible when we use it to predict unobserved data.

It turns out that these two features are ofthen in conflict with each other:

This happens because our error is actually driven by 2 sources:

Bias error (underfitting): error that occurs because we’re making the wrong assumptions about our data, e.g. a straight line is a the best model, but in reality the relationship is non-linear (left-most panel below)

Variance error (overfitting): error that occurs because our model is *too sensitive to small changes in the data that could be cause by random noise, e.g. a curve that fits every data point perfectly (“memorizes” the data) but fails catastrophically when used to make predicts on new data (right-most panel below)

From this perspective the Gauss-Markov theorem tells us that when we use OLS to estimate our model parameters, if our assumptions hold, then estimates we get will be the best in term of both minimizing bias and variance across all possible linear models.

This doesn’t mean the OLS is the best model, but that we can’t find a better slope that minimizes SSE without introducing some additional bias or variane to the model.

Challenge#

Let’s use the same internet dataset estimate model with 2 additional parameters:

\[ \hat{internet}_i = \hat{\beta_0} + \hat{\beta_1} SES_i + \hat{\beta_2} College_i + \hat{\beta_3} Auto_i \]

Where \(\hat{\beta_2}\) an \(\hat{\beta_3}\) are capturing the slopes of:

\(College\) - the % of a state’s population that are college educated
\(Auto\) - the average number of cars per household

df_net = pl.read_csv("./data/internet.csv")
df_net.head()

(1)#

Use numpy to implement OLS and estimate values for \(\beta_0\), \(\beta_1\), \(\beta_3\), and \(\beta_4\)

For reference the OLS formula is:

\[ \beta = (X^T X)^{-1} X^T y \]

Where \(X\) is a matrix of your independent variables (and an intercept) and \(y\) is a vector of your dependent variable.

If you did things right you should get a vector of 4 \(\beta\) values:

\[\begin{split} \beta = \begin{bmatrix} 18.24 \\ 1.27 \\ 0.31 \\ -0.58 \end{bmatrix} \end{split}\]

# Your code here

(2)#

Using your parameter estimates, compute the predictions of the model \(\hat{internet_i}\) and calculate to calculate the sum-of-squared-errors (SSE) for the model

You can generate predictions using:

\[ \hat{y} = X\beta \]

And as a reminder the formula for SSE is:

\[SSE = \sum_{i=1}^n (y_i - \hat{y_i})^2\]

where \(y_i\) is the true value of your dependent variable and \(\hat{y_i}\) is the predicted value of the dependent variable from your model

If you did things right you should get

\[ SSE = 150.183 \]

# Your code here

Wrapping Up#

In this notebook we calculated a GLM by-hand by implementing the OLS solution using numpy. Our goal was to demystify the process of estimating a linear model, by understanding the basic inner-workings using linear algebra. In practice, we’ll often use additional Python libraries and functions to automate this process.

You’re ready to move onto the next notebook and meet statsmodels a library that allows us to much more easily estimate, evaluate, and compare models to each other.

Models I: GLM & OLS

Contents

Models I: GLM & OLS#

Learning Goals#

Slides for reference#

How to use this notebook#

What is a model?#

The mean as a model#

Error minimization#

Note: Other measures of error#

Improving our model: Univariate Regression#

The General Linear Model#

Estimating a GLM: Ordinary-Least-Squares (OLS)#

Linear Algebra Fundamentals#

Vectors#

Matrices#

Extensibility#

The Ordinary-Least-Squares Solution#

Transpose#

Similarity between predictors: Matrix-Matrix dot-product (inner product)#

The Gram Matrix#

Linear Algebra “undo”: Matrix Inverse#

Similarity between predictors & outcome: Matrix-vector dot-product#

Challenge: Putting the pieces together#

Using betas for prediction#

Evaluating our predictions#

Gauss Markov Theorem#

Revisiting the bias-variance tradeoff#

Challenge#

(1)#

(2)#

Wrapping Up#