4/2/2020

T-tests for \(\beta_i\)

Facts about \(\beta_i\)

Recall that, for normally distributed residuals,

  1. \(\mathbb{V}\beta_i = c_{ii}\sigma^2\)
  2. \(\hat\beta_i\) is normally distributed
  3. \(S^2=SSE/(n-2)\) estimates \(\sigma^2\)
  4. \((n-2)S^2/\sigma^2 = SSE/\sigma^2\sim\chi^2(n-2)\)

We can use this information to construct confidence intervals and hypothesis tests for the coefficients \(\beta_i\).

Wald test

Large sample confidence intervals and tests can be constructed from any normally distributed statistics with known variance. Here, we get a test statistic \[ Z = \frac{\hat\beta_i-\beta_{i0}}{\sigma\sqrt{c_{ii}}}\sim\mathcal{N}(0,1) \]

Using this as a pivot and \(z_\alpha=F^{-1}_{\mathcal{N}(0,1)}(1-\alpha)\), we get confidence intervals \[ \beta_i\in\hat\beta_i\pm z_{\alpha/2}\sigma\sqrt{c_{ii}} \]

T-test

Notice that \(Z\) is normal, \(W=(n-2)S^2/\sigma^2\) is \(\chi^2(n-2)\), and these quantities are independent of each other.

\[ T = \frac{Z}{\sqrt{W/(n-2)}} = \frac{\hat\beta_i-\beta_{i0}}{\sigma\sqrt{c_{ii}}}\cdot\sqrt{\frac{(n-2)S^2/\sigma^2}{n-2}} = \\ \frac{\hat\beta_i-\beta_{i0}}{\sigma\sqrt{c_{ii}}}\cdot\frac{\sigma}{S} = \frac{\hat\beta_i-\beta_{i0}}{S\sqrt{c_{ii}}}\sim t(n-2) \]

\(T\) works as a test statistic for us, using the \(t(n-2)\) distribution to create cutoff values for the rejection regions. We can also use \(T\) as a pivot and with \(t_\alpha=F^{-1}_{t(n-2)}(1-\alpha)\), we get confidence intervals

\[ \beta_i\in\hat\beta_i\pm t_{\alpha/2}S\sqrt{c_{ii}} \]

T-tests for linear combinations

\(a_0\beta_0+a_1\beta_1\)

We may have reason to seek inferences about linear combinations of the coefficients. One example is to make inferences about \(\beta_0+\beta_1x^*\) for some (new) value \(x^*\).

We can create an estimator \(\hat\theta=a_0\hat\beta_0+a_1\hat\beta_1\). By linearity of \(\mathbb{E}\), this is unbiased for \(\theta=a_0\beta_0+a_1\beta_1\).

We know values for \(\mathbb{V}\hat\beta_0, \mathbb{V}\hat\beta_1, \text{cov}(\hat\beta_0,\hat\beta_1)\) - these allow us to calculate \[ \mathbb{V}\hat\theta = a_0^2\mathbb{V}\hat\beta_0 + a_1^2\mathbb{V}\hat\beta_1 + 2a_0a_1\text{cov}(\hat\beta_0,\hat\beta_1) = \\ \sigma^2(a_0^2c_{00} + a_1^2c_{11} + 2a_0a_1c_{01}) \\ c_{00} = \sum x_i^2/nS_{xx} \qquad c_{01} = -\overline x/S_{xx} \qquad c_{11} = 1/S_{xx} \]

Wald test

Knowing \(\mathbb{V}\hat\theta\) means we can create Wald tests and confidence intervals. Using the test statistic \[ Z = \frac{\hat\theta-\theta_0}{\sigma\sqrt{a_0^2c_{00}+a_1^2c_{11}+2a_0a_1c_{01}}} \sim\mathcal{N}(0,1) \] we use the usual rejection regions. Writing \(z_\alpha=F^{-1}_{\mathcal{N}(0,1)}(1-\alpha)\) we get confidence intervals \[ \theta\in\hat\theta\pm z_{\alpha/2}\sigma\sqrt{a_0^2c_{00}+a_1^2c_{11}+2a_0a_1c_{01}} \]

T-test

Since the standard error \(\sigma\sqrt{a_0^2c_{00}+a_1^2c_{11}+2a_0a_1c_{01}}\) has \(\sigma\) as a factor, we can do the usual manipulations and compute \[ T = \frac{Z}{\sqrt{W}/(n-2)} = Z\cdot\frac{\sigma}{S} =\\ \frac{\hat\theta-\theta_0}{S\sqrt{a_0^2c_{00}+a_1^2c_{11}+2a_0a_1c_{01}}} \sim t(n-2) \] with the usual rejection regions. Writing \(t_\alpha=F^{-1}_{t(n-2)}(1-\alpha)\) we get confidence intervals \[ \theta\in\hat\theta\pm t_{\alpha/2}S\sqrt{a_0^2c_{00}+a_1^2c_{11}+2a_0a_1c_{01}} \]

\(\beta_0+\beta_1x^*\)

Of particular interest is the linear combination we get by evaluating the regression function at a new observation. In this case, \(a_0=1\) and \(a_1=x^*\). \[ \mathbb{V}\hat\theta = \sigma^2(a_0^2c_{00} + a_1^2c_{11} + 2a_0a_1c_{01}) = \\ \frac{\sigma^2}{S_{xx}}\left(a_0^2\frac{\sum x_i^2}{n}+a_1^2-2a_0a_1\overline x\right) = \\ \frac{\sigma^2}{S_{xx}}\left(\frac{\sum x_i^2}{n}+x^{*2}-2x^*\overline x\right) \]

\(\beta_0+\beta_1x^*\)

Add and subtract \(\overline x^2\) to get \[ \frac{\sigma^2}{S_{xx}}\left(\color{green}{\frac{\sum x_i^2}{n}-\frac{n\overline x^2}{n}}+\color{purple}{\overline x^2+x^{*2}-2x^*\overline x}\right) = \\ \frac{\sigma^2}{S_{xx}}\left(\color{green}{\frac{S_{xx}}{n}}+\color{purple}{(x^*-\overline x)^2}\right) = \\ \sigma^2\left(\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}\right) \]

We get an estimation confidence interval of \[ Y^*=\beta_0+\beta_1x^*\in \hat\beta_0+\hat\beta_1x^* \pm t_{\alpha/2} S \sqrt{\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}} \]

Estimation Interval

We get an estimation confidence interval of \[ Y^*=\beta_0+\beta_1x^*\in \hat\beta_0+\hat\beta_1x^* \pm t_{\alpha/2} S \sqrt{\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}} \]

The width of this interval shrinks when:

  • \((x^*-\overline x)^2\) shrinks. Estimation is easier near \(\overline x\).
  • \(n\) grows. Estimation is easier with more data.
  • \(S_{xx}\) grows. Estimation is easier if the \(x_i\) are spread out more.

Prediction Intervals

Prediction vs. Estimation

It can be more interesting to predict a range for the next observation than a range for the long-term average value.

For the new value \(Y^*=\beta_0+\beta_1x^*+\epsilon\) we would want an estimator. One choice is \[ \hat Y^*=\hat\beta_0+\hat\beta_1x^* \]

This is simultaneously an estimator of the expected value \(\mathbb{E}Y\) and a predictor for the next value \(Y^*\).

Prediction error

We would seek to minimize the prediction error \(Y^*-\hat Y^*\).

\[ \mathbb{E}[Y^*-\hat Y^*] = \mathbb{E}Y^* - \mathbb{E}\hat Y^* = (\beta_0+\beta_1x^*) - (\beta_0+\beta_1x^*) = 0 \\ \mathbb{V}(Y^*-\hat Y^*) = \mathbb{V}Y^* + \mathbb{V}\hat Y^* - 2\text{cov}(Y^*,\hat Y^*) \]

Since the new observation is new, it is independent of the values for \(\hat\beta_0\) and \(\hat\beta_1\). Hence the covariance vanishes.

\[ \mathbb{V}(Y^*-\hat Y^*) = \mathbb{V}Y^* + \mathbb{V}\hat Y^* = \sigma^2 + \sigma^2\left(\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}\right) =\\ \sigma^2\left(1+\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}\right) \]

Wald and T prediction

If \(\epsilon\sim\mathcal{N}(0,\sigma^2)\), then \(Y^*-\hat Y^*\) is also normal. We get a Wald statistic and a T-statistic \[ Z = \frac{Y^*-\hat Y^*}{\sigma\sqrt{1+\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}}} \text{ and } T = \frac{Y^*-\hat Y^*}{S\sqrt{1+\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}}} \] with \(Z\sim\mathcal{N}(0,1)\) and \(T\sim t(n-2)\). Using \(T\) as a pivot, we get prediction intervals \[ Y^*\in\hat\beta_0+\hat\beta_1x^*\pm t_{\alpha/2}S \sqrt{1+\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}} \]

Prediction interval width

\[ Y^*\in\hat\beta_0+\hat\beta_1x^*\pm t_{\alpha/2}S \sqrt{1+\frac{1}{n}+\frac{(x^*-\overline x)^2}{S_{xx}}} \]

Notice the \(1+\) in the square root here. Estimation interval width can shrink to very tight intervals - prediction interval widths have a lower bound of \(2z_{\alpha/2}\sigma\). Prediction cannot get better than this, no matter how much data is collected. The way to get better prediction is to reduce the inherent uncertainty.

Example in R (Exercise 11.8, 11.18, 11.37, 11.45)

The Data

Two different methods to measure LC50 for various toxicants.

toxicants = tribble(
  ~toxicant, ~flow, ~static,
  1, 23.00, 39.00,
  2, 22.30, 37.50,
  3,  9.40, 22.20,
  4,  9.70, 17.50,
  5,   .15,   .64,
  6,   .28,   .45,
  7,   .75,  2.62,
  8,   .51,  2.36,
  9, 28.00, 32.00,
  10,  .39,   .77
)

The Data

The Model

We are to fit \(\text{flow} = \beta_0 + \beta_1\cdot\text{static} + \epsilon\). This is handled in R by the command lm (linear model):

model = lm( response ~ predictor, data=dataset)

The Model

model = lm( flow ~ static, data=toxicants)
summary(model)
## 
## Call:
## lm(formula = flow ~ static, data = toxicants)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4356 -1.4347 -0.2905  0.5578  7.7430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.71104    1.48824  -0.478    0.646    
## static       0.65525    0.06818   9.610 1.14e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.313 on 8 degrees of freedom
## Multiple R-squared:  0.9203, Adjusted R-squared:  0.9103 
## F-statistic: 92.35 on 1 and 8 DF,  p-value: 1.141e-05

The Model

Model Assumptions

Recall the assumptions we made when deriving the linear model:

  • independent observations (OK from experiment setup)
  • constant variance (homoscedasticity)
  • normal residuals

We can check for homoscedasticity by examining a scatter plot of residuals vs. predicted values

We can check for normal residuals by examining a quantile-quantile plot of the residuals.

All of these quantities are available from the fitted model - most easily using broom::augment.

Constant variance

broom::augment(model) %>%
  gf_point(.resid ~ .fitted)

Normal residuals

broom::augment(model) %>%
  gf_qq(~.resid) %>% gf_qqline()

Predicting and estimating

Estimation and prediction are handled in R by the predict command:

predict(model, predictor.dataset)

The command needs the model, and a dataset with values for all the predictors, and will return a vector of the predicted values.

Predicting and estimating

Exercise 11.8 asks for an estimation of the flow-through LC50 where the static LC50 has been measured as 12ppm.

predict(model, data.frame(static=12))
##        1 
## 7.151994

Predicting and estimating

Exercise 11.37 asks for a 95% confidence interval for the estimation of the flow-through LC50 where the static LC50 has been measured as 12ppm.

predict(model, data.frame(static=12), interval="confidence")
##        fit      lwr      upr
## 1 7.151994 4.674353 9.629636

Predicting and estimating

Exercise 11.45 asks for a 95% prediction interval of the flow-through LC50 where the static LC50 has been measured as 12ppm.

predict(model, data.frame(static=12), interval="prediction")
##        fit        lwr      upr
## 1 7.151994 -0.8785919 15.18258

The prediction interval is quite a bit wider than the confidence interval.

Predicting and estimating

By estimating and predicting along a grid of values, we can get the ingredients needed to plot prediction and estimation bands in the scatter plot of the data.

tox_pred = cbind(data.frame(static=seq(0,40)),
                 predict(model, data.frame(static=seq(0,40)), interval="confidence"),
                 predict(model, data.frame(static=seq(0,40)), interval="prediction"))
names(tox_pred) <- c("static","flow","lwr_ci","upr_ci","fit","lwr_pi","upr_pi")

cbind here adds new columns to a dataset.

Predicting and estimating

By estimating and predicting along a grid of values, we can get the ingredients needed to plot prediction and estimation bands in the scatter plot of the data.

static flow lwr_ci upr_ci fit lwr_pi upr_pi
0 -0.7110386 -4.1429286 2.720851 -0.7110386 -9.085373 7.663296
1 -0.0557858 -3.3778317 3.266260 -0.0557858 -8.385708 8.274136
2 0.5994669 -2.6166726 3.815606 0.5994669 -7.688788 8.887721
3 1.2547197 -1.8598532 4.369293 1.2547197 -6.994654 9.504094
4 1.9099724 -1.1078116 4.927756 1.9099724 -6.303348 10.123293
5 2.5652252 -0.3610219 5.491472 2.5652252 -5.614905 10.745355

Predicting and estimating

By estimating and predicting along a grid of values, we can get the ingredients needed to plot prediction and estimation bands in the scatter plot of the data.

tox_pred %>%
  gf_line(flow ~ static) %>%
  gf_ribbon(lwr_ci + upr_ci ~ static, alpha=0.5) %>%
  gf_ribbon(lwr_pi + upr_pi ~ static, alpha=0.5) %>%
  gf_point(flow ~ static, data=toxicants, color=~"Data")