Lecture 22

4/3/2020

Multiple Linear Regression

Several Predictors

Suppose our model is instead

\[ Y = \beta_0 + \beta_1x_1 + \dots + \beta_kx_k + \epsilon \]

and we make observations \(y_1,\dots,y_n\). We can assemble all this data into vectors and matrices: \[ \hspace{-1em} Y = \begin{pmatrix}y_1\\\vdots\\y_n\end{pmatrix} \quad X = \begin{pmatrix} 1 & x_{11} & \dots & x_{1k} \\ \vdots & \vdots & & \vdots \\ 1 & x_{n1} & \dots & x_{nk} \end{pmatrix} \quad \beta = \begin{pmatrix}\beta_0\\\vdots\\\beta_k\end{pmatrix} \quad \epsilon = \begin{pmatrix}\epsilon_1\\\vdots\\\epsilon_n\end{pmatrix} \]

Then the model equations for all the observations can be summarized as

\[ Y = X\beta + \epsilon \]

Least Squares

The Least Squares approach would be to minimize

\[ \sum(y_i-\hat y_i)^2 = \sum\epsilon_i^2 = \epsilon^T\epsilon \]

Since \(\epsilon=Y-X\beta\), we need to focus on

\[ (\color{green}{Y}-\color{purple}{X\beta})^T (\color{blue}{Y}-\color{orange}{X\beta}) = \color{green}{Y^T}\color{blue}{Y} - \color{green}{Y^T}\color{orange}{X\beta} - \color{purple}{\beta^TX^T}\color{blue}{Y} + \color{purple}{\beta^TX^T}\color{orange}{X\beta} \]

Seek a critical point:

\[ \frac{\partial}{\partial\beta}(Y^TY - 2 \beta^TX^TY + \beta^TX^TX\beta) = -2X^TY + 2 X^TX\beta \\ =0 \text{ when } X^TY = X^TX\beta \text{ so } \color{blue}{\hat\beta = (X^TX)^{-1}X^TY} \]

The Simple Linear Regression case

With simple linear regression, the matrix formalism looks like:

\[ Y = \begin{pmatrix}y_1\\\vdots\\y_n\end{pmatrix} \quad X = \begin{pmatrix} 1 & x_{1}\\ \vdots & \vdots \\ 1 & x_{n} \end{pmatrix} \quad \beta = \begin{pmatrix}\beta_0\\\beta_1\end{pmatrix} \quad \epsilon = \begin{pmatrix}\epsilon_1\\\vdots\\\epsilon_n\end{pmatrix} \\ X^TX = \begin{pmatrix} n & \sum x_i \\ \sum x_i & \sum x_i^2 \end{pmatrix} \\ (X^TX)^{-1} = \begin{pmatrix} \sum x_i^2/S_{xx} & -\overline x/S_{xx} \\ -\overline x/S_{xx} & 1/S_{xx} \end{pmatrix} = \begin{pmatrix} c_{00} & c_{01} \\ c_{10} & c_{11} \end{pmatrix} \]

The “Hat matrix”

From all this we get an estimator \(\hat Y = X\hat\beta = X(X^TX)^{-1}X^TY\).

The matrix \(X(X^TX)^{-1}X^T\) is often called the hat matrix - \(H = X(X^TX)^{-1}X^T\).

“It puts the hat on the Y…”

SSE

Using this we can calculate the sum of squared residuals \[ SSE = \sum(y_i-\hat y_i)^2 = (Y-X\hat\beta)^T(Y-X\hat\beta) = \\ Y^TY - 2\hat\beta^TX^TY + \hat\beta^TX^TX\hat\beta \]

Here, \[ \hat\beta^TX^TX\hat\beta = \hat\beta^TX^TX(X^TX)^{-1}X^TY = \hat\beta^TX^TY \]

So \(SSE = Y^TY - \hat\beta^TX^TY\).

Properties of \(\hat\beta\)

All our previous properties of the \(\hat\beta\) carry over to the multiple case. Assume \(\mathbb{E}\epsilon_i=0\), \(\mathbb{V}\epsilon=\sigma^2\) iid:

\(\mathbb{E}\hat\beta_i=\beta_i\)
\(\mathbb{V}\hat\beta_i = c_{ii}\sigma^2\), where \(c_{ii} = (X^TX)^{-1}_{ii}\)
\(\text{cov}(\hat\beta_i,\hat\beta_j) = c_{ij}\sigma^2\) where \(c_{ij} = (X^TX)^{-1}_{ij}\)
\(SSE/(n-k-1)\) is an unbiased estimator of \(\sigma^2\)

If all \(\epsilon_i\) are normal:

Each \(\hat\beta_i\) is normal
\(SSE/\sigma^2 \sim \chi^2(n-k-1)\)
\(SSE\) and \(\hat\beta_i\) are independent

Inferences on \(a^T\beta\)

\(a^T\beta\)

Just as in the simple regression case, we are interested in inferences on linear combinations of the model parameters. These can be used to

Make inferences about each \(\beta_i\) separately (using \(a=(0,0,\dots,1,\dots,0)\))
Make inferences about a generalization \(a = x^*\)
Make predictions using \(a = x^*\)

The Estimator

\(a^T\hat\beta\) estimates \(a^T\beta\). It is unbiased because of linearity of expectations. If each \(\hat\beta_i\) is normally distributed, then \(a^T\hat\beta\) is also normally distributed.

By expanding \(\mathbb{V}(a^T\hat\beta)\), we can recognize the resulting expressions as \[ \mathbb{V}(a^T\hat\beta) = (a^T(X^TX)^{-1}a)\sigma^2 \]

Tests and confidence intervals

With a normally distributed \(a^T\hat\beta\) with known variance we get a Wald test with a standard normal test statistic

\[ Z = \frac{a^T\hat\beta - (a^T\beta)_0}{\sigma\sqrt{a^T(X^TX)^{-1}a}} \]

and confidence intervals

\[ a^T\beta \in a^T\hat\beta \pm z_{\alpha/2}\sigma\sqrt{a^T(X^TX)^{-1}a} \]

where \(z_\alpha = F_{\mathcal{N}(0,1)}^{-1}(1-\alpha)\).

Tests and confidence intervals

Since \(SSE/\sigma^2\sim\chi^2(n-k-1)\) and \(S^2=SSE/(n-k-1)\), the usual construction works to create a T-test with test statistic with \(n-k-1\) degrees of freedom:

\[ T = \frac{a^T\hat\beta - (a^T\beta)_0}{S\sqrt{a^T(X^TX)^{-1}a}} \]

and confidence intervals

\[ a^T\beta \in a^T\hat\beta \pm t_{\alpha/2}S\sqrt{a^T(X^TX)^{-1}a} \]

where \(t_\alpha = F_{t(n-k-1)}^{-1}(1-\alpha)\).

Prediction

For prediction, we turn to the error \(Y^*-\hat Y^*\). The exact same derivation as in the simple linear regression case produces an extra \(1+\) in the variance, yielding \[ \mathbb{V}\text{error} = (1+a^T(X^TX)^{-1}a)\sigma^2 \] and corresponding Wald and T confidence intervals, so that the prediction interval is

\[ a^T\beta \in a^T\hat\beta \pm t_{\alpha/2}S\sqrt{1+a^T(X^TX)^{-1}a} \]

where \(t_\alpha = F_{t(n-k-1)}^{-1}(1-\alpha)\).

The Partial F-Test

Variable Selection

Using more variables makes for a more flexible model that can model more compmlex relationships. Complexity comes with a risk for overfitting; creating a model that fits the datapoints so precisely that it fails to generalize and interpolate.

We can approach the variable selection problem by fitting two models - one with all variables, one with a subset - and compare their variances.

Models

We would have a complete model

\[ Y = \beta_0+\beta_1x_1+\dots+\beta_gx_g+\beta_{g+1}x_{g+1}+\dots+\beta_kx_k+\epsilon \]

And a reduced model

\[ Y = \beta_0+\beta_1x_1+\dots+\beta_gx_g \]

We may calculate \(SSE\) for each, yielding quantities \(SSE_C\) for the complete model and \(SSE_R\) for the reduced model.

\(SSE_R\) and \(SSE_C\)

Since the complete model is at least as expressive as the reduced model, we expect \(SSE_C\leq SSE_R\). The quantity \(SSE_R-SSE_C\) measures the importance of the additional predictors on the model performance.

It is straightforward that \[ \frac{SSE_R}{\sigma^2}\sim\chi^2(n-g-1) \qquad \frac{SSE_C}{\sigma^2}\sim\chi^2(n-k-1) \]

By the theorem on differences of \(\chi^2\) variables, it follows that \((SSE_R-SSE_C)/\sigma^2\sim\chi^2(k-g)\).

A null hypothesis for a reduced model

We set out to test \(H_0:\beta_{g+1}=\dots=\beta_k=0\) vs \(H_A:\) at least one \(\neq0\).

If the null hypothesis is false, there will be a larger reduction of variance between the reduced and the complete model and thus a larger value for \((SSE_R-SSE_C)/\sigma^2\).

Comparing the complete model and the reduction, we are led to define an \(F\) statistic

\[ F = \frac{(SSE_R-SSE_C)/(k-g)}{SSE_C/(n-k-1)} \sim F_{n-k-1}^{k-g} \]

In R

This is accessible by creating the complete and the reduced linear models separately, and then comparing them using the anova command:

model.r = lm(response ~ x1 + x2, data)
model.c = lm(response ~ x1 + x2 + x3 + x4, data)
anova(model.r, model.c)

By default, the case comparing \(Y = \beta_0+\epsilon\) with \(Y = \beta_0+\beta_1x_1+\dots+\beta_kx_k+\epsilon\) is automatically computed by lm and included in the output for summary(lm)

In R

## 
## Call:
## lm(formula = hwy ~ cty, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3408 -1.2790  0.0214  1.0338  4.0461 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89204    0.46895   1.902   0.0584 .  
## cty          1.33746    0.02697  49.585   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.752 on 232 degrees of freedom
## Multiple R-squared:  0.9138, Adjusted R-squared:  0.9134 
## F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

Comparing to \(Y=\beta_0+\epsilon\)

When the reduced model removes all predictors, \(SSE_R=S_{YY}\). The ratio of explained variance \[ R^2 = \frac{SSE_R-SSE_C}{SSE_R} = \frac{S_{YY}-SSE_C}{S_{YY}} = 1 - \frac{SSE_C}{S_{YY}} \] is called the Multiple Coefficient of Determination and plays the same role as \(r^2\) does for simple linear regression.

\(R^2\) can be inflated by including more predictors. To compensate, it is common to use the Adjusted \(R^2\), defined as \[ \overline R^2 = 1-\frac{SSE_C/(n-k-1)}{SSE_R/(n-1)} \]

ANOVA as a Linear Model

Similar \(H_0\) structures

Recall the ANOVA null hypothesis

\[ H_0:\mu_1=\dots=\mu_k=0 \]

Compare this to the F-test null hypothesis

\[ H_0:\beta_1=\dots=\beta_k=0 \]

Dummy Variables

Given data for an ANOVA, provided as pairs \((Y_i,g_i)\) of observation and group membership, we can construct a new dataset \((Y_i, e_{g_i})\), where \(e_n\) is the \(n\)th unit vector \((0,0,\dots,0,1,0,\dots,0)\).

Then \(Y=X\beta+\epsilon\) reduces straight to the ANOVA model.

Linear Independence

If the columns in \(X\) are not linearly independent, then \((X^TX)^{-1}\) does not exist and the least square estimation fails. Even if columns are just very similar, numerical instability shows up with the matrix inversion step.

In the ANOVA construction, linear dependence is guaranteed by the inclusion of \(\beta_0\) as well as all the unit vectors. A common solution is to drop the last of the dummy variables - and use \(\beta_0\) in its place.