GLMs¶

\[ \newcommand{\sumN}{\sum_{n = 1}^N} \newcommand{\sumn}{\sum_n} \newcommand{\prodN}{\prod_{n = 1}^N} \newcommand{\by}{\mathbf{y}} \newcommand{\bX}{\mathbf{X}} \newcommand{\bx}{\mathbf{x}} \newcommand{\bbeta}{\boldsymbol{\beta}} \newcommand{\btheta}{\boldsymbol{\theta}} \newcommand{\blambda}{\boldsymbol{\lambda}} \newcommand{\bbetahat}{\boldsymbol{\hat{\beta}}} \newcommand{\bthetahat}{\boldsymbol{\hat{\theta}}} \newcommand{\bSigma}{\boldsymbol{\Sigma}} \newcommand{\bT}{\mathbf{T}} \newcommand{\dadb}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\iid}{\overset{\small{\text{i.i.d.}}}{\sim}} \]

Ordinary linear regression comes with several assumptions that can be relaxed with a more flexible model class: generalized linear models (GLMs). Specifically, OLS assumes

The target variable is a linear function of the input variables
The errors are Normally distributed
The variance of the errors is constant

When these assumptions are violated, GLMs might be the answer.

GLM Structure¶

A GLM consists of a link function and a random component. The random component identifies the distribution of the target variable $y_n$ conditional on the input variables $\bx_n$. For instance, we might model $Y_n$ as a Poisson random variable where the rate parameter $\lambda_n$ depends on $\bx_n$.

The link function specifies how $\bx_n$ relates to the expected value of the target variable, $\mu_n = E(Y_n)$. Let $\eta$ be a linear function of the input variables, i.e. $\eta_n = \bbeta^\top \bx_n$ for some coefficients $\bbeta$. We then chose a nonlinear link function to relate $\mu$ to $\eta$. For link function $g$ we have

\[ \eta_n = g(\mu_n). \]

In a GLM, we calculate $\eta$ before calculating $\mu$, so we often work with the inverse of $g$:

\[ \mu_n = g^{-1}(\eta_n) \]

Note

Note that because $\eta_n$ is a function of the data, it will vary for each observation (though the $\beta$s will not).

In total then, a GLM assumes

\[\begin{split} \begin{aligned} Y_n &\sim F_{\mu_n} \\ \mu_n &= g^{-1}(\eta_n) \\ \eta_n &= \bbeta^\top \bx_n, \end{aligned} \end{split}\]

where $F$ is some distribution with mean parameter $\mu_n$.

Fitting a GLM¶

“Fitting” a GLM, like fitting ordinary linear regression, really consists of estimating the coefficients, $\bbeta$. Once we know $\bbeta$, we have $\eta$. Once we have a link function, $\eta$ gives us $\mu$ through $g^{-1}$. A GLM can be fit in these four steps:

Specify the distribution of $Y_n$, indexed by its mean parameter $\mu_n$.
Specify the link function $\eta_n = g(\mu_n)$.
Identify a loss function. This is typically the negative log-likelihood.
Find the $\bbetahat$ that minimize that loss function.

In general, we can write the log-likelihood across our observations for a GLM as follows.

\[ \log L\left(\{\mu_n\}_{n = 1}^N; \{Y_n\}_{n = 1}^N\right) = \sumN \log L(\mu_n; Y_n) = \sumN \log L(g^{-1}(\eta_n); Y_n) = \sumN \log L(g^{-1}(\bbeta^\top \bx_n); Y_n). \]

This shows how the log-likelihood depends on $\bbeta$, the parameters we want to estimate. To fit the GLM, we want to find the $\bbetahat$ to maximize this log-likelihood.

Example: Poisson Regression¶

Step 1

Suppose we choose to model $Y_n$ conditional on $\bx_n$ as a Poisson random variable with rate parameter $\lambda_n$:

\[ Y_n|\bx_n \sim \text{Pois}(\lambda_n). \]

Since the expected value of a Poisson random variable is its rate parameter, $E(Y_n) = \mu_n = \lambda_n$.

Step 2

To determine the link function, let’s think in terms of its inverse, $\lambda_n = g^{-1}(\eta_n)$. We know that $\lambda_n$ must be non-negative and $\eta_n$ could be anywhere in the reals since it is a linear function of $\bx_n$. One function that works is

\[ \lambda_n = \exp(\eta_n), \]

meaning

\[ \eta_n = g(\lambda_n) = \log(\lambda_n). \]

This is the “canonical link” function for Poisson regression. More on that here.

Step 3

Let’s derive the negative log-likelihood for the Poisson. Let $\blambda = \begin{bmatrix} \lambda_1, \dots, \lambda_N\end{bmatrix}^\top$.

Math Note

The PMF for $Z \sim \text{Pois}(\lambda)$ is

\[ p(z) = \frac{e^{-\lambda}\lambda^z}{z!} \propto e^{-\lambda}\lambda^z. \]

\[\begin{split} \begin{align*} L(\blambda; \{Y_n\}_{n = 1}^N) &= \prodN \exp(-\lambda_n) \lambda_n^{Y_n} \\ \log L(\blambda; \{Y_n\}_{n = 1}^N) &= \sumN Y_n\log\lambda_n - \lambda_n. \end{align*} \end{split}\]

Now let’s get our loss function, the negative log-likelihood. Recall that this should be in terms of $\bbeta$ rather than $\blambda$ since $\bbeta$ is what we control.

\[\begin{split} \begin{align} \mathcal{L}_N(\bbeta) &= -\left(\sumN Y_n\log(\exp(\eta_n)) - \exp(\eta_n)\right) \\ &= \sumN \left(\exp(\eta_n) - Y_n \eta_n \right) \\ &= \sumN \left(\exp(\bbeta^\top \bx_n) - Y_n\bbeta^\top\bx_n\right). \end{align} \end{split}\]

Step 4

We obtain $\bbetahat$ by minimizing this loss function. Let’s take the derivative of the loss function with respect to $\bbeta$.

\[ \dadb{\mathcal{L}_N(\bbeta)}{\bbeta} = \sumN\left(\exp(\bbeta^\top\bx_n)\bx_n - y_n\bx_n \right). \]

Ideally, we would solve for $\bbetahat$ by setting this gradient equal to 0. Unfortunately, there is no closed-form solution. Instead, we can approximate $\bbetahat$ through gradient descent. This is done in the construction section.

Since gradient descent calculates this gradient a large number of times, it’s important to calculate it efficiently. Let’s see if we can clean this expression up. First recall that $$ \hat{y}_n = \hat{\lambda}_n = \exp(\bbetahat^\top \bx_n). $$

The loss function can then be written as

\[ \dadb{\mathcal{L}_N(\bbetahat)}{\bbetahat} = \sumN\left(\hat{y}_n\bx_n - y_n\bx_n \right). \]

Further, this can be written in matrix form as

\[ \dadb{\mathcal{L}_N(\bbetahat)}{\bbetahat} = \bX^\top(\hat{\by} - \by), \]

where $\hat{\by}$ is the vector of fitted values. Finally note that this vector can be calculated as

\[ \hat{\by} = \exp(\bX \bbetahat), \]

where the exponential function is applied element-wise to each observation.

Many other GLMs exist. One important example is logistic regression, the topic of the next chapter.

Machine Learning from Scratch

GLMs¶

GLM Structure¶

Fitting a GLM¶

Example: Poisson Regression¶