Approach 1: Minimizing Loss¶
1. Simple Linear Regression¶
Model Structure¶
Simple linear regression models the target variable, \(y\), as a linear function of just one predictor variable, \(x\), plus an error term, \(\epsilon\). We can write the entire model for the \(n^\text{th}\) observation as
Fitting the model then consists of estimating two parameters: \(\beta_0\) and \(\beta_1\). We call our estimates of these parameters \(\hat{\beta}_0\) and \(\hat{\beta}_1\), respectively. Once we’ve made these estimates, we can form our prediction for any given \(x_n\) with
One way to find these estimates is by minimizing a loss function. Typically, this loss function is the residual sum of squares (RSS). The RSS is calculated with
We divide the sum of squared errors by 2 in order to simplify the math, as shown below. Note that doing this does not affect our estimates because it does not affect which \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimize the RSS.
Parameter Estimation¶
Having chosen a loss function, we are ready to derive our estimates. First, let’s rewrite the RSS in terms of the estimates:
To find the intercept estimate, start by taking the derivative of the RSS with respect to \(\hat{\beta}_0\):
where \(\bar{y}\) and \(\bar{x}\) are the sample means. Then set that derivative equal to 0 and solve for \(\hat{\beta}_0\):
This gives our intercept estimate, \(\hat{\beta}_0\), in terms of the slope estimate, \(\hat{\beta}_1\). To find the slope estimate, again start by taking the derivative of the RSS:
Setting this equal to 0 and substituting for \(\hat{\beta}_0\), we get
To put this in a more standard form, we use a slight algebra trick. Note that
for any constant \(c\) and any collection \(z_1, \dots, z_N\) with sample mean \(\bar{z}\) (this can easily be verified by expanding the sum). Since \(\bar{x}\) is a constant, we can then subtract \(\sumN \bar{x}(y_n - \bar{y})\) from the numerator and \(\sumN \bar{x}(x_n - \bar{x})\) from the denominator without affecting our slope estimate. Finally, we get
2. Multiple Regression¶
Model Structure¶
In multiple regression, we assume our target variable to be a linear combination of multiple predictor variables. Letting \(x_{nj}\) be the \(j^\text{th}\) predictor for observation \(n\), we can write the model as
Using the vectors \(\bx_n\) and \(\bbeta\) defined in the previous section, this can be written more compactly as
Then define \(\bbetahat\) the same way as \(\bbeta\) except replace the parameters with their estimates. We again want to find the vector \(\hat{\bbeta}\) that minimizes the RSS:
Minimizing this loss function is easier when working with matrices rather than sums. Define \(\by\) and \(\bX\) with
which gives \(\hat{\by} = \bX\bbetahat \in \mathbb{R}^N\). Then, we can equivalently write the loss function as
Parameter Estimation¶
We can estimate the parameters in the same way as we did for simple linear regression, only this time calculating the derivative of the RSS with respect to the entire parameter vector. First, note the commonly-used matrix derivative below 1.
Math Note
For a symmetric matrix \(\mathbf{W}\),
Applying the result of the Math Note, we get the derivative of the RSS with respect to \(\bbetahat\) (note that the identity matrix takes the place of \(\mathbf{W}\)):
We get our parameter estimates by setting this derivative equal to 0 and solving for \(\bbetahat\):
- 1
A helpful guide for matrix calculus is The Matrix Cookbook