Lecture 19

3/31/2020

Probabilistic Linear Modeling

Modeling

By modeling here we mean attempting to construct a function \(f\) from some family of functions that best agrees with (fits to) observed input/output pairs.

The process starts with observations \((x_1,y_1),\dots,(x_n,y_n)\) and tries to construct \(y_* = f(x_*)\).

Modeling can be done deterministically: \(y = f(x)\)

Deterministic models will have difficulty accounting for noise and uncertainty. As a alternative, we can explicitly include a random variable \(\epsilon\) in the model: \(y = f(x) + \epsilon\)

Here \(x\) is a predictor variable, \(y\) is a response variable and \(\epsilon\) is a noise term.

Normal Noise

Very common choice is to construct the model as

\[ Y = f(x) + \epsilon \qquad\text{where } \epsilon\sim\mathcal{N}(0,\sigma^2) \]

Equivalently, we can write this as \(Y\sim\mathcal{N}(f(x), \sigma^2)\)

Once a random model is established, we can inspect, analyze and predict from it using familiar tools from probability.

Linear Modeling

Our focus here is on linear models:

\[ Y = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n + \epsilon \\ \epsilon\sim\mathcal{N}(0,\sigma^2) \]

Linearity here refers to the coefficients \(\beta_i\) - the predictor variables can contain non-linear dependencies and more complicated functions. Examples of linear models in this sense include

\[ \beta_0+\beta_1t+\beta_2e^t \qquad \beta_0+\beta_1x+\beta_2\sin x \]

Zero-centered noise

We will throughout assume that \(\mathbb{E}\epsilon=0\).

If this were not the case, we could replace \(\beta_0\) with \(\beta_0-\mathbb{E}\epsilon\) to create an equivalent model with zero-centered noise.

Simple vs. Multiple

This week, we focus on simple linear regression models, where the model is \[ Y = \beta_0+\beta_1 x +\epsilon \]

Simple: only one predictor; regression: continuous response

Next week we will look closer at multiple regression models, where the model is \[ Y = \beta_0+\beta_1 x_1 + \dots + \beta_nx_n +\epsilon \]

Least Squares Estimation

Given observations \((x_1,y_1),\dots,(x_n,y_n)\), our goal is to estimate \(\beta_0\) and \(\beta_1\) such that \(\hat y_i = \hat\beta_0+\hat\beta_1x_i\approx y_i\).

The process will depend fundamentally on how we measure the approximation error, comparing \(\hat y_i\) with \(y_i\).

If we choose to measure by sum of squared deviations (or errors or residuals) then we can produce explicit, closed form and easy to use estimators!

Simple Linear Model

Write \(\hat y_i = \hat\beta_0+\hat\beta_1x_i\) for the predicted value and \(y_i-\hat y_i\) for the deviation (/error/residual) and for an error measure use

\[ SSE = \sum(y_i-\hat y_i)^2 = \sum(y_i-(\beta_0+\beta_1x_i))^2 \]

To find the minimizer of this error measure, we take partial derivatives and seek their vanishing points.

Simple Linear Model

\[ \begin{aligned} \frac{\partial SSE}{\partial\hat\beta_0} &= \sum 2(y_i-(\hat\beta_0+\hat\beta_1x_i))\cdot(-1) \\ &= -2\left(\sum y_i - n\hat\beta_0 - \hat\beta_1\sum x_i\right) \\ =0 &\text{ when} \\ \hat\beta_0 &= \frac{1}{n}\left(\sum y_i-\hat\beta_1\sum x_i\right) \\ &= \overline y - \hat\beta_1\overline x \end{aligned} \]

Simple Linear Model

\[ \begin{aligned} \frac{\partial SSE}{\partial\hat\beta_1} &= \sum 2(y_i-(\hat\beta_0+\hat\beta_1x_i))\cdot(-x_i) \\ &= -2\left(\sum y_ix_i - \hat\beta_0\sum x_i - \hat\beta_1\sum x_i^2\right) \\ =0 &\text{ when} \\ \hat\beta_1\sum x_i^2 &= \sum x_iy_i - \hat\beta_0\sum x_i = \sum x_iy_i - \hat\beta_0n\overline x\\ &= \sum x_iy_i - n\overline x\overline y + \hat\beta_1 n \overline x^2 \\ \text{so }\hat\beta_1\left(\sum x_i^2-n\overline x^2\right) &= \sum x_iy_i-n\overline x\overline y \\ \hat\beta_1 &= \left(\sum x_iy_i-n\overline x\overline y_i\right)/\left(\sum x_i^2-n\overline x^2\right) \\ &=\frac{\sum(x_i-\overline x)(y_i-\overline y)}{\sum(x_i-\overline x)^2} \end{aligned} \]

Notation

We will adopt the convention

\[ S_{ab} = \sum(a_i-\overline a)(b_i-\overline b) \] and observe that \(S_{ab} = \sum a_ib_i - n\overline a\overline b\). Both the notation and the observation will get a lot of use in the coming material.

With this notation, we have established that for a simple linear model,

\[ \begin{aligned} \hat\beta_1 &= S_{xy}/S_{xx} & \hat\beta_0 &= \overline y - \hat\beta_1\overline x \end{aligned} \]

are the least squares estimators.

No-intercept Model

The same method works in more settings. Consider the model \(Y = \beta_1x+\epsilon\). Then \[ \begin{aligned} SSE &= \sum(y_i-\hat\beta_1x_i)^2 \\ \text{so } \frac{\partial SSE}{\partial\hat\beta_1} &= \sum2(y_i-\hat\beta_1x_i)\cdot(-x_i) \\ &= -2\left(\sum x_iy_i-\hat\beta_1\sum x_i^2\right) \\ =0&\text{ when } \\ \hat\beta_1\sum x_i^2 &= \sum x_iy_i \\ \hat\beta_1 &= \left.\sum x_iy_i \right/\sum x_i^2 \end{aligned} \]

Properties of \(\hat\beta_0\) and \(\hat\beta_1\)

Assume a homoscedastic (constant variance) linear model.

\(\hat\beta_0\) and \(\hat\beta_1\) are unbiased.
We can calculate their variances and covariance.
We can estimate \(\sigma^2\) using \(SSE\).
If all \(\epsilon\) are normal, then \(\hat\beta_0\) and \(\hat\beta_1\) are normal and \(SSE/\sigma^2\) is \(\chi^2\).

We will now prove these assertions and calculate all quantities.

Random Variable Setup

Assume we have made \(n\) observations. We can view these as random variables: \[ Y_i = \beta_0+\beta_1x_i+\epsilon_i \] where all \(\epsilon_i\sim\mathcal{N}(0,\sigma^2)\) iid. In particular \(\mathbb{V}\epsilon\) does not vary with \(x_i\).

In this setup, we view all \(x_i\) as deterministic, fixed constants and all \(Y_i\) as random variables, linearly dependent on \(\epsilon_i\). This makes \(\hat\beta_0\) and \(\hat\beta_1\) random variables.

\(\hat\beta_1\)

Claim

\(\hat\beta_1\) is unbiased.

Proof

\[ \hspace{-1em} \hat\beta_1 = S_{xy}/S_{xx} = \frac{\sum(x_i-\overline x)(Y_i-\overline Y)}{S_{xx}} = \frac{\sum(x_i-\overline x)Y_i-\overline Y\sum(x_i-\overline x)}{S_{xx}} \]

Recall that \(\sum(x_i-\overline x)=0\). Hence \[ \hat\beta_1 = \frac{\sum(x_i-\overline x)Y_i}{S_{xx}} \]

\(\hat\beta_1\)

\[ \text{Now, } \mathbb{E}\hat\beta_1 = \frac{\sum(x_i-\overline x)\mathbb{E}Y_i}{S_{xx}} = \frac{\sum(x_i-\overline x)(\beta_0+\beta_1x_i)}{S_{xx}} = \\ \beta_0\frac{\sum(x_i-\overline x)}{S_{xx}} + \beta_1\frac{\sum(x_i-\overline x)x_i}{S_{xx}} \]

Again, \(\sum(x_i-\overline x)=0\). By the argument we used for \((Y_i-\overline Y)\) previously, \(\sum(x_i-\overline x)x_i=\sum(x_i-\overline x)^2=S_{xx}\)

This shows that \(\mathbb{E}\hat\beta_1 = \beta_1\): \(\hat\beta_1\) is unbiased.

\(\mathbb{V}\hat\beta_1\)

We can pick up our intermediate expression for \(\hat\beta_1\) for the calculation.

\[ \mathbb{V}\hat\beta_1 = \mathbb{V}\left(\frac{\sum(x_i-\overline x)Y_i}{S_{xx}}\right) = \\ \frac{1}{\color{green}{S_{xx}}^2}\color{green}{\sum(x_i-\overline x)^2} \color{purple}{\mathbb{V}Y_i} = \frac{\color{purple}{\sigma^2}}{\color{green}{S_{xx}}} \]

Covariance terms vanish because the \(Y_i\)s are assumed to be independent.

\(\hat\beta_0\)

Claim

\(\hat\beta_0\) is unbiased.

Proof

\[ \mathbb{E}\hat\beta_0 = \mathbb{E}[\overline Y-\hat\beta_1\overline x] = \color{green}{\mathbb{E}\overline Y}- \color{purple}{\mathbb{E}[\hat\beta_1]\overline x} = \\ \color{green}{\beta_0+\beta_1\overline x+\mathbb{E}\overline\epsilon}- \color{purple}{\beta_1\overline x} = \beta_0 \]

Cancel the \(\beta_1\overline x\) terms against each other, and use \(\mathbb{E}[\epsilon]=0\), so therefore \(\mathbb{E}[\overline\epsilon]=0\).

\(\mathbb{V}\hat\beta_0\)

By the formulas for variances of sums, \[ \mathbb{V}\hat\beta_0 = \mathbb{V}\left[\overline Y-\hat\beta_1\overline x\right] = \mathbb{V}\overline Y+x^2\mathbb{V}\hat\beta_1-2\overline x\text{cov}(\overline Y,\hat\beta_1) \]

Now, we know the value of \(\mathbb{V}\hat\beta_1=\sigma^2/S_{xx}\). Remains \(\mathbb{V}\overline Y\) and the covariance term.

\[ \mathbb{V}\overline Y = \mathbb{V}\left[\beta_0+\beta_1\overline x+\overline\epsilon\right] = \mathbb{V}\overline\epsilon = \mathbb{V}\left[\sum\epsilon_i/n\right] = \\ \frac{1}{n^2}\sum\mathbb{V}\epsilon_i = \frac{1}{n^2}\cdot n\sigma^2 = \sigma^2/n \]

\(\text{cov}(\overline Y,\hat\beta_1)\)

Let’s write \(c_i = (x_i-\overline x)/S_{xx}\). Then \(\hat\beta_1=\sum c_iY_i\) and by bilinearity of covariances, \[ \text{cov}\left(\overline Y, \hat\beta_1\right) = \text{cov}\left(\frac{1}{n}\sum Y_i, \sum c_iY_i\right) = \sum_i \sum_j \frac{c_i}{n}\text{cov}(Y_i,Y_j) \]

Since the \(Y_i\) are independent of each other, the only terms that survive are when \(i=j\), leaving us with: \[ \text{cov}\left(\overline Y, \hat\beta_1\right) = \sum_i\frac{c_i}{n}\mathbb{V}Y_i = \sum\frac{c_i}{n}\sigma^2 = \frac{\sigma^2}{n}\frac{\sum(x_i-\overline x)}{S_{xx}} = 0 \]

\(\mathbb{V}\hat\beta_0\)

Returning to \(\mathbb{V}\hat\beta_0\):

\[ \mathbb{V}\hat\beta_0 = \color{green}{\mathbb{V}\overline Y}+ \overline x^2\color{purple}{\mathbb{V}\hat\beta_1} -2\overline x\color{blue}{\text{cov}(\overline Y,\hat\beta_1)} = \\ \color{green}{\frac{\sigma^2}{n}} + \overline{x}^2\color{purple}{\frac{\sigma^2}{S_{xx}}} -2\overline{x}\cdot\color{blue}{0} = \sigma^2\left(\frac{1}{n}+\frac{\overline x^2}{S_{xx}}\right) = \\ \sigma^2\left(\frac{S_{xx}+\overline x^2}{nS_{xx}}\right) = \sigma^2\frac{\sum x_i^2}{nS_{xx}} \]

\(\text{cov}(\hat\beta_0,\hat\beta_1)\)

\[ \text{cov}(\hat\beta_0,\hat\beta_1) = \text{cov}(\overline Y-\hat\beta_1\overline x, \hat\beta_1) = \\ \color{green}{\text{cov}(\overline Y,\hat\beta_1)} -\overline x\text{cov}(\hat\beta_1,\hat\beta_1) = \\ \color{green}{0}-\overline x\mathbb{V}\hat\beta_1 = -\overline x\sigma^2/S_{xx} \]

Estimating \(\sigma^2\)

Let’s consider the \(SSE\):

\[ SSE = \sum(Y_i-\hat Y_i)^2 = \sum(Y_i-(\color{purple}{\hat\beta_0}+\hat\beta_1x_i))^2 = \\ \sum(Y_i\color{purple}{-\overline Y+\hat\beta_1\overline x}-\hat\beta_1x_i)^2 = \\ \sum(Y_i-\overline Y)^2 -2\hat\beta_1\sum(x_i-\overline x)(Y_i-\overline Y) + \hat\beta_1^2\sum(x_i-\overline x)^2 \\ =S_{YY}-2\hat\beta_1S_{xY}+\hat\beta_1^2S_{xx} \]

Recall that \(\hat\beta_1=S_{xy}/S_{xx}\), so \(S_{xY} = \hat\beta_1S_{xx}\). Applying this to one of the \(\hat\beta_1\) to the right: \[ SSE = S_{YY}-2\hat\beta_1\cdot\hat\beta_1S_{xx} + \hat\beta_1^2S_{xx} = S_{YY} - \hat\beta_1^2 S_{xx} \]

Estimating \(\sigma^2\)

\[ \mathbb{E}[SSE] = \mathbb{E}[S_{YY}-\hat\beta_1^2 S_{xx}] = \mathbb{E}\left[\sum(Y_i-\overline Y)^2 - \hat\beta_1^2S_{xx}\right] = \\ \mathbb{E}\left[\sum Y_i^2 - n\overline Y^2-\hat\beta_1^2S_{xx}\right] = \sum\color{green}{\mathbb{E}[Y_i^2]} -n\color{purple}{\mathbb{E}[\overline Y^2]} - \color{blue}{\mathbb{E}[\hat\beta_1^2]}S_{xx} \]

Recall that \(\mathbb{E}[X^2] = \mathbb{V}X + \mathbb{E}[X]^2\).

\[ \mathbb{E}[SSE] = \\ \sum\color{green}{(\mathbb{V}Y_i + \mathbb{E}[Y_i]^2)} -n\color{purple}{(\mathbb{V}\overline Y+\mathbb{E}[\overline Y]^2)} - \color{blue}{(\mathbb{V}\hat\beta_1+\mathbb{E}[\hat\beta_1]^2)}S_{xx} = \\ \hspace{-1.5em}\color{green}{n\sigma^2 + \sum(\beta_0+\beta_1x_i)^2} -n\color{purple}{\left(\frac{\sigma^2}{n}+(\beta_0+\beta_1\overline x)^2\right)} -S_{xx}\color{blue}{\left(\frac{\sigma^2}{S_{xx}}+\beta_1^2\right)} \]

Estimating \(\sigma^2\)

\[ \mathbb{E}[SSE] = \\ \hspace{-1.5em}{n\sigma^2 + \sum(\beta_0+\beta_1x_i)^2} -n{\left(\frac{\sigma^2}{n}+(\beta_0+\beta_1\overline x)^2\right)} -S_{xx}{\left(\frac{\sigma^2}{S_{xx}}+\beta_1^2\right)} \\ \hspace{-1.5em}=\sigma^2\left(n-\frac{n}{n}-\frac{S_{xx}}{S_{xx}}\right) + \sum(\beta_0+\beta_1x_i)^2-n(\beta_0+\beta_1\overline x)^2-\beta_1^2S_{xx} \]

Let’s take a closer look at the part with all the \(\beta_i\): \[ \sum(\beta_0+\beta_1x_i)^2-n(\beta_0+\beta_1\overline x)^2-\beta_1^2S_{xx} =\\ \hspace{-2em}\left(n\color{green}{\beta_0^2}+\color{purple}{2\beta_0\beta_1\sum x_i}+\beta_1^2\sum x_i^2\right) -n\left(\color{green}{\beta_0^2}+\color{purple}{2\beta_0\beta_1\overline x}+\beta_1^2\overline x^2\right) -\beta_1^2S_{xx} \\ =\beta_1^2\left(\sum x_i^2 - n\overline x^2 - S_{xx}\right) = 0 \]

Estimating \(\sigma^2\)

\[ \mathbb{E}[SSE] = \sigma^2\left(n-\frac{n}{n}-\frac{S_{xx}}{S_{xx}}\right) = \sigma^2(n-2) \]

It follows that \(S^2=SSE/(n-2)\) is an unbiased estimator for \(\sigma^2\).

Normal Distribution Assumption

If we assume \(\epsilon_i\sim\mathcal{N}(0,\sigma^2)\), then

\(\hat\beta_1=\sum(x_i-\overline x)Y_i/S_{xx}\) is linear in the \(Y_i\), and thus normal.
\(\hat\beta_0=\overline Y-\hat\beta_1\overline x\) is linear in the \(Y_i\), and thus normal.
\((n-2)S^2/\sigma^2 = SSE/\sigma^2\sim\chi^2(n-2)\).

Summary

We have proven and calculated:

\(\hat\beta_0\) and \(\hat\beta_1\) are unbiased
\(\mathbb{V}\hat\beta_0 = c_{00}\sigma^2\) where \(c_{00}=\sum x_i^2/(nS_{xx})\)
\(\mathbb{V}\hat\beta_1 = c_{11}\sigma^2\) where \(c_{11}=1/S_{xx}\)
\(\text{cov}(\hat\beta_0,\hat\beta_1)=c_{01}\sigma^2\) where \(c_{01}=-\overline x/S_{xx}\)
\(S^2=SSE/(n-2)\) is unbiased for \(\sigma^2\). SSE can be written as \(S_{yy}-\hat\beta_1S_{xy}\) or as \(S_{yy}-\hat\beta_1^2S_{xx}\).

If all \(\epsilon_i\) are iid normal, then

\(\hat\beta_0\) and \(\hat\beta_1\) are normal
\((n-2)S^2/\sigma^2 = SSE/\sigma^2 \sim \chi^2(n-2)\)