Lecture 25

MVJ

12April, 2018

Linear regression. The F-test, t-test on coefficients, correlation test.

Linear Regression

Recall that a linear function, or straight line, is a function on the shape

\[ y(x) = b_0 + b_1x \]

A linear model is a straight line relationship between two variables.

This is the next step up in complexity after the one-sample and two-sample means tests.

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

Linear models

We can modify the categorical subpopulations in two-sample or two-way table testing to a numeric variable controlling subpopulations.

In other words, let the predictor take numeric values, not categorical values.

We would transform the two-sample model

\[ (y | x) \sim \mathcal N(\mu_x, \sigma_x) \]

to a linear model

\[ (y|x) \sim \mathcal N(\beta_0 + \beta_1x | \sigma) \]

The model implicitly assumes that errors are normally distributed.

Linear models - example

Linear models - example

Linear models - example

Simple linear regression

A simple linear regression model is \((y|x) \sim \mathcal N(\beta_0 + \beta_1x | \sigma)\).

This is to distinguish it from a multiple regression model, where more than one predictor is used.

In simple linear regression, the conditional means all lie on a straight line and have the same variance for all choices of \(x\) (homoscedastic).

The true line \(\beta_0+\beta_1x\) that contains all the means is called the population regression line. Inference in linear regression is about finding values \(\hat\beta_0\) and \(\hat\beta_1\) to estimate the population regression line witha sample regression line and to find confidence intervals for these quantities.

We can also study the correlation alone, and produce inference as well as testing for values of the correlation coefficient.

Preliminary data analysis

Always start with plotting your data. In which of these cases is a simple linear regression reasonable?

Estimation, fit and residuals

Any model will split the variation in the response variable into a model (or fit) component and a residual component.

DATA = FIT + RESIDUAL

For a simple linear regression, the parameters \(\beta_0\) and \(\beta_1\) can be estimated using regression \(r_{xy}\), sample means \(\overline x, \overline y\) and sample standard deviations \(s_x, s_y\): We write \(\hat y\) for the model means function, and \(b_0\) and \(b_1\) for the estimates of \(\beta_0\) and \(\beta_1\).

\[\begin{align*} \hat y &= b_0 + b_1 x \\ b_1 &= r_{xy}\frac{s_y}{s_x} \\ b_0 &= \overline y - b_1\overline x \end{align*}\]

Estimation, fit and residuals

Any model will split the variation in the response variable into a model (or fit) component and a residual component.

DATA = FIT + RESIDUAL

For a simple linear regression, the parameters \(\beta_0\) and \(\beta_1\) can be estimated using regression \(r_{xy}\), sample means \(\overline x, \overline y\) and sample standard deviations \(s_x, s_y\): We write \(\hat y\) for the model means function, and \(b_0\) and \(b_1\) for the estimates of \(\beta_0\) and \(\beta_1\).

The residuals are the differences between observed and expected responses. For data \((x_1,y_1), \dots, (x_N, y_N)\) and predictions \(\hat y_i = \hat y(x_i)\):

\[ e_i = y_i - \hat y_i = y_i - b_0 - b_1x_i \]

The only source of randomness in the model is from the residual component. We can estimate the standard deviation \(\sigma\) of the model as the standard deviation of the residuals:

\[ s^2 = \frac{\sum e_i^2}{n-2} \]

In R

A linear model can be fitted using the command lm. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model). Plots are often easier using augment from the package broom.

model = lm(Y2014 ~ Y2000, data=tuit)
model
## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Coefficients:
## (Intercept)        Y2000  
##     5367.62         1.61

In R

A linear model can be fitted using the command lm. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model). Plots are often easier using augment from the package broom.

model = lm(Y2014 ~ Y2000, data=tuit)
model.summary = summary(model)
model.summary
## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4196  -1489    -54   1647   3647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5367.621   1342.377     4.0  0.00037 ***
## Y2000          1.615      0.311     5.2  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2130 on 31 degrees of freedom
## Multiple R-squared:  0.466,  Adjusted R-squared:  0.449 
## F-statistic:   27 on 1 and 31 DF,  p-value: 1.21e-05

In R

A linear model can be fitted using the command lm. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model). Plots are often easier using augment from the package broom.

model = lm(Y2014 ~ Y2000, data=tuit)
model.summary = summary(model)
model.augment = augment(model)
model.augment
Y2014 Y2000 .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
17955 6852 16430.2 916.35 1524.806 0.18483 2144.6 0.07117 0.79235
18075 7002 16672.4 959.13 1402.630 0.20249 2147.6 0.06893 0.73689
14126 6513 15882.9 821.19 -1756.878 0.14844 2138.6 0.06954 -0.89322
14297 6333 15592.3 771.74 -1295.268 0.13110 2151.8 0.03206 -0.65193
13771 5170 13714.6 487.05 56.399 0.05221 2166.7 0.00002 0.02718
9734 5136 13659.7 480.27 -3925.708 0.05077 2038.0 0.09557 -1.89042
15938 4994 13430.4 453.55 2507.552 0.04528 2115.4 0.03438 1.20403
14889 4877 13241.6 433.68 1647.449 0.04140 2144.8 0.01346 0.78944
10186 4436 12529.6 381.23 -2343.556 0.03199 2122.6 0.02064 -1.11753
8784 4715 12980.0 409.90 -4196.002 0.03698 2021.2 0.07727 -2.00606
10991 4405 12479.5 379.13 -1488.506 0.03164 2149.0 0.00823 -0.70967
10995 4383 12444.0 377.79 -1448.987 0.03142 2149.9 0.00774 -0.69075
13373 4160 12084.0 371.04 1289.047 0.03030 2153.5 0.00589 0.61415
15589 4072 11941.9 371.91 3647.123 0.03045 2058.4 0.04741 1.73776
14421 4047 11901.5 372.52 2519.486 0.03055 2115.7 0.02271 1.20053
14686 3969 11775.6 375.46 2910.417 0.03103 2098.4 0.03081 1.38716
10868 3872 11619.0 381.24 -750.976 0.03199 2162.2 0.00212 -0.35811
14785 3848 11580.2 383.02 3204.772 0.03229 2083.4 0.03898 1.52845
10254 3819 11533.4 385.35 -1279.407 0.03269 2153.6 0.00629 -0.61031
11429 3788 11483.4 388.06 -54.358 0.03315 2166.7 0.00001 -0.02594
13757 3761 11439.8 390.60 2317.234 0.03358 2123.5 0.02125 1.10589
14224 3701 11342.9 396.82 2881.104 0.03466 2099.5 0.03398 1.37576
11094 3575 11139.5 412.32 -45.469 0.03742 2166.7 0.00001 -0.02174
8724 3522 11053.9 419.76 -2329.900 0.03878 2122.8 0.02508 -1.11494
8807 3204 10540.5 474.05 -1733.488 0.04947 2142.2 0.01811 -0.83418
10388 3188 10514.7 477.16 -126.656 0.05012 2166.6 0.00010 -0.06097
8430 3132 10424.2 488.28 -1994.244 0.05248 2134.2 0.02559 -0.96119
8616 2768 9836.6 568.29 -1220.565 0.07109 2154.3 0.01351 -0.59415
10760 2725 9767.1 578.48 992.859 0.07366 2158.5 0.00931 0.48398
11205 2348 9158.5 672.51 2046.526 0.09955 2130.6 0.05659 1.01184
6748 1941 8501.4 781.04 -1753.371 0.13428 2139.2 0.06062 -0.88412
11094 3454 10944.1 430.04 149.886 0.04071 2166.5 0.00011 0.07180
9461 3374 10815.0 443.12 -1353.954 0.04322 2151.9 0.00953 -0.64942

Validity for Simple Linear Regression

The simple linear regression assumes:

To check these conditions:

All of the important information is available in augment(model).

Validity for Simple Linear Regression

model.augment %>% gf_point(Y2014 ~ Y2000)

Validity for Simple Linear Regression

Scatter plot: check for obvious signs against linearity

model.augment %>% gf_point(Y2014 ~ Y2000)

Validity for Simple Linear Regression

Residual plot: plot .resid against .fitted. Look for the data to be evenly distributed above and below the line \(y=0\).

model.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0)

Validity for Simple Linear Regression

QQ-plot of the residuals.

model.augment %>% gf_qq(~.resid) %>% gf_qqline()

Sums of squared errors

We will have a lot of use for the following quantities:

For now, we will also introduce

Inference for linear regression

A lot of our ability to do statistical testing and inference with linear regressions come from identifying several quantities that have a sampling distribution that follows Student’s T distribution. If they follow Student’s T, we can build tests and confidence intervals.

The T distribution often shows up to produce confidence intervals on the form \[ \text{parameter} = \text{estimate} \pm t^*SE_{\text{estimate}} \]

Recall we defined \(b_0, b_1\) estimated line coefficients, \(s\) the model standard deviation, \(r_{xy}\) the sample regression.

ANOVA tables: is the slope non-zero?

To test whether \(\beta_1\neq 0\), the usual way is to compare the model to the one-sample means model where only \(\beta_0\) influences the mean.

The basic idea is that if \(\beta_1\) helps build an accurate model, then the variance introduced by \(\beta_1\) is larger than the inherent error in the residuals.

Since all the parts of the calculation have different degrees of freedom, this is usually studied using an ANalysis Of VAriation table (ANOVA):

Type Sum of Squares Degrees of Freedom Mean Squares F-statistic p-value
Model SSM \(1\) MSM MSM/MSE 1-pf(F, df1=DFM, df2=DFE)
Residual (Error) SSE \(n-2\) MSE
Total SST \(n-1\) MST

DFM + DFR = DFT

Correlation test

Input Vectors x, y of equal length or variables dataset$x and dataset$y
Null hypothesis \(\rho_{xy} = 0\)
Alternative Greater, less, not equal
Test statistic \(r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}\)

Requirements

Correlation test

Command: cor.test with arguments alternative and conf.level.

test = cor.test(Y2014 ~ Y2000, data=tuit, alternative="greater", conf.level=0.99)
test
## 
##  Pearson's product-moment correlation
## 
## data:  Y2014 and Y2000
## t = 5.2, df = 31, p-value = 6.1e-06
## alternative hypothesis: true correlation is greater than 0
## 99 percent confidence interval:
##  0.38761 1.00000
## sample estimates:
##     cor 
## 0.68247

Correlation test

Effect size: R-squared

test$estimate^2
##     cor 
## 0.46576

Correlation test

To test assumptions: check residual plots and QQ-plots for both possible linear models:

model1 = lm(Y2014 ~ Y2000, data=tuit)
model2 = lm(Y2000 ~ Y2014, data=tuit)
model1.augment = augment(model1)
model2.augment = augment(model2)
ggmatrix(list(
  model1.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
  model2.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
  model1.augment %>% gf_qq(~.resid) %>% gf_qqline(),
  model2.augment %>% gf_qq(~.resid) %>% gf_qqline()
), nrow=2, ncol=2)

Linear Regression test for association

Input Vectors x, y of equal length or variables dataset$x and dataset$y
Null hypothesis \(\rho_{xy} = 0\)
Alternative Greater, less, not equal
Test statistic \(r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}\)

Requirements

Linear Regression test for association

Command: lm.

model = lm(Y2014 ~ Y2000, data=tuit)
model
## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Coefficients:
## (Intercept)        Y2000  
##     5367.62         1.61

Linear Regression test for association

Make sure you report all relevant information, including model coefficients, confidence intervals, p-values, and test statistics. The summary(model) printout includes all these as well as the effect size R-squared.

model.summary = summary(model)
model.summary
## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4196  -1489    -54   1647   3647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5367.621   1342.377     4.0  0.00037 ***
## Y2000          1.615      0.311     5.2  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2130 on 31 degrees of freedom
## Multiple R-squared:  0.466,  Adjusted R-squared:  0.449 
## F-statistic:   27 on 1 and 31 DF,  p-value: 1.21e-05

Linear Regression test for association

To report the summary information individually, the following helps:

Value Command
Model coefficients coef(model)
Model coefficients with all details coef(model.summary)
Test statistic with degrees of freedom model.summary$fstatistic
Effect size - percentage of explained variance model.summary$r.squared

Linear Regression test for association

To test assumptions: check residual plots and QQ-plots:

model.augment = augment(model)
ggmatrix(list(
  model.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
  model.augment %>% gf_qq(~.resid) %>% gf_qqline()
), nrow=2, ncol=2)

Test title

   
Input Vectors x, y of equal length or variables dataset$x, dataset$y
Null hypothesis Population mean difference \(\mu_{x-y}\) is equal to mu.0 (usually 0)
Alternative hypothesis Population mean difference \(\mu_{x-y}\) is [less than / not equal / greater than] mu.0
Test statistic T = ( (mean(x-y)) - mu.0)/( df ) where df in full generality is given by a complicated formula

Requirements The differences x-y need to be appropriate for a one-sample test:

Dataset size Requires How to check
\(N\leq 15\) Normal distribution gf_qq
\(15<N\leq40\) No skew, no outliers gf_histogram and gf_boxplot
\(40<N\leq\)hundreds No outliers gf_boxplot
hundreds\(<N\) Few outliers gf_boxplot

Test title

Command: t.test with arguments mu, alternative, conf.level

{r echo=TRUE}
test = t.test(x, y, mu=5, alternative="two.sided", conf.level=0.99, paired=TRUE)
test

Test title

Effect size: Cohen’s \(d\) - library effsize, command cohen.d

{r echo=TRUE}
d = cohen.d(x, y, paired=TRUE)
d

Test title

To check validity, check data size first, then use a QQ- or a boxplot to check the distribution

{r echo=TRUE}
NROW(mpg)
{r fig.height=2,echo=TRUE}
gf_boxploth("hwy-cty" ~ hwy-cty, data=mpg)