Lecture 25

MVJ

12April, 2018

Linear regression. The F-test, t-test on coefficients, correlation test.

Linear Regression

Recall that a linear function, or straight line, is a function on the shape

\[ y(x) = b_0 + b_1x \]

A linear model is a straight line relationship between two variables.

This is the next step up in complexity after the one-sample and two-sample means tests.

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write $y$ for responses, and $x$ (or $x_1, x_2, \dots$) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

One-sample means test: $y\sim\mathcal N(\mu, \sigma)$ for some $\mu$.
Inference: produce confidence intervals for $\mu$.
Test: Null hypothesis is $\mu = \mu_0$, alternative is one of $\mu < \mu_0$, $\mu\neq\mu_0$ or $\mu > \mu_0$.
Two-sample means test: Predictor variable $x$ takes exactly two different values. $(y | x = A) \sim\mathcal N(\mu_A, \sigma_A)$ and $(y | x = B) \sim\mathcal N(\mu_B, \sigma_B)$.
Inference: produce confidence intervals for $\mu_A - \mu_B$.
Test: Null hypothesis is $\mu_A - \mu_B = 0$.

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write $y$ for responses, and $x$ (or $x_1, x_2, \dots$) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

One-sample proportions test: $y$ is the count of successes. $y\sim\text{Binomial}(n, p)$ for some $p$.
Inference: produce confidence intervals for $p$.
Test: Null hypothesis is $p = p_0$.
Two-sample proportions test: Predictor $x$ takes values $A$ and $B$. $(y | x = A) \sim\text{Binomial}(N, p_A)$ and $(y | x = B)\sim\text{Binomial}(N, p_B)$.
Inference: produce confidence intervals for $p_A - p_B$.
Test: Null hypothesis is $p_A = p_B$.

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write $y$ for responses, and $x$ (or $x_1, x_2, \dots$) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

Goodness of Fit chi-square: $y\sim\text{Multinomial}(p_1, p_2, \dots, p_m)$.
Test: Null hypothesis is $P = (p_1, p_2, \dots, p_m)$ are equal to some prescribed list $P_0$.
Two-way table chi-square: $y\sim\text{Multinomial}(p_{11}, \dots, p_{mn})$.
Test: Null hypothesis is $p_{ij} = p_i\cdot p_j$ where $p_i$ and $p_j$ are given by marginal distributions estimated from the data.

Statistical Models

A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.

We will write $y$ for responses, and $x$ (or $x_1, x_2, \dots$) for predictor(s).

All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:

Goodness of Fit chi-square: $y\sim\text{Multinomial}(p_1, p_2, \dots, p_m)$.
Test: Null hypothesis is $P = (p_1, p_2, \dots, p_m)$ are equal to some prescribed list $P_0$.
Two-way table chi-square: $(y | x = A_i)\sim\text{Multinomial}(p_{i1}, \dots, p_{in})$.
Test: Null hypothesis is $p_{ij} = p_{ik}$.

Linear models

We can modify the categorical subpopulations in two-sample or two-way table testing to a numeric variable controlling subpopulations.

In other words, let the predictor take numeric values, not categorical values.

We would transform the two-sample model

\[ (y | x) \sim \mathcal N(\mu_x, \sigma_x) \]

to a linear model

\[ (y|x) \sim \mathcal N(\beta_0 + \beta_1x | \sigma) \]

The model implicitly assumes that errors are normally distributed.

Linear models - example

Simple linear regression

A simple linear regression model is $(y|x) \sim \mathcal N(\beta_0 + \beta_1x | \sigma)$.

This is to distinguish it from a multiple regression model, where more than one predictor is used.

In simple linear regression, the conditional means all lie on a straight line and have the same variance for all choices of $x$ (homoscedastic).

The true line $\beta_0+\beta_1x$ that contains all the means is called the population regression line. Inference in linear regression is about finding values $\hat\beta_0$ and $\hat\beta_1$ to estimate the population regression line witha sample regression line and to find confidence intervals for these quantities.

We can also study the correlation alone, and produce inference as well as testing for values of the correlation coefficient.

Preliminary data analysis

Always start with plotting your data. In which of these cases is a simple linear regression reasonable?

Estimation, fit and residuals

Any model will split the variation in the response variable into a model (or fit) component and a residual component.

DATA = FIT + RESIDUAL

For a simple linear regression, the parameters $\beta_0$ and $\beta_1$ can be estimated using regression $r_{xy}$, sample means $\overline x, \overline y$ and sample standard deviations $s_x, s_y$: We write $\hat y$ for the model means function, and $b_0$ and $b_1$ for the estimates of $\beta_0$ and $\beta_1$.

\[\begin{align*} \hat y &= b_0 + b_1 x \\ b_1 &= r_{xy}\frac{s_y}{s_x} \\ b_0 &= \overline y - b_1\overline x \end{align*}\]

Estimation, fit and residuals

Any model will split the variation in the response variable into a model (or fit) component and a residual component.

DATA = FIT + RESIDUAL

For a simple linear regression, the parameters $\beta_0$ and $\beta_1$ can be estimated using regression $r_{xy}$, sample means $\overline x, \overline y$ and sample standard deviations $s_x, s_y$: We write $\hat y$ for the model means function, and $b_0$ and $b_1$ for the estimates of $\beta_0$ and $\beta_1$.

The residuals are the differences between observed and expected responses. For data $(x_1,y_1), \dots, (x_N, y_N)$ and predictions $\hat y_i = \hat y(x_i)$:

\[ e_i = y_i - \hat y_i = y_i - b_0 - b_1x_i \]

The only source of randomness in the model is from the residual component. We can estimate the standard deviation $\sigma$ of the model as the standard deviation of the residuals:

\[ s^2 = \frac{\sum e_i^2}{n-2} \]

In R

A linear model can be fitted using the command lm. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model). Plots are often easier using augment from the package broom.

model = lm(Y2014 ~ Y2000, data=tuit)
model

## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Coefficients:
## (Intercept)        Y2000  
##     5367.62         1.61

In R

A linear model can be fitted using the command lm. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model). Plots are often easier using augment from the package broom.

model = lm(Y2014 ~ Y2000, data=tuit)
model.summary = summary(model)
model.summary

## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4196  -1489    -54   1647   3647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5367.621   1342.377     4.0  0.00037 ***
## Y2000          1.615      0.311     5.2  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2130 on 31 degrees of freedom
## Multiple R-squared:  0.466,  Adjusted R-squared:  0.449 
## F-statistic:   27 on 1 and 31 DF,  p-value: 1.21e-05

In R

A linear model can be fitted using the command lm. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model). Plots are often easier using augment from the package broom.

model = lm(Y2014 ~ Y2000, data=tuit)
model.summary = summary(model)
model.augment = augment(model)
model.augment

Y2014	Y2000	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
17955	6852	16430.2	916.35	1524.806	0.18483	2144.6	0.07117	0.79235
18075	7002	16672.4	959.13	1402.630	0.20249	2147.6	0.06893	0.73689
14126	6513	15882.9	821.19	-1756.878	0.14844	2138.6	0.06954	-0.89322
14297	6333	15592.3	771.74	-1295.268	0.13110	2151.8	0.03206	-0.65193
13771	5170	13714.6	487.05	56.399	0.05221	2166.7	0.00002	0.02718
9734	5136	13659.7	480.27	-3925.708	0.05077	2038.0	0.09557	-1.89042
15938	4994	13430.4	453.55	2507.552	0.04528	2115.4	0.03438	1.20403
14889	4877	13241.6	433.68	1647.449	0.04140	2144.8	0.01346	0.78944
10186	4436	12529.6	381.23	-2343.556	0.03199	2122.6	0.02064	-1.11753
8784	4715	12980.0	409.90	-4196.002	0.03698	2021.2	0.07727	-2.00606
10991	4405	12479.5	379.13	-1488.506	0.03164	2149.0	0.00823	-0.70967
10995	4383	12444.0	377.79	-1448.987	0.03142	2149.9	0.00774	-0.69075
13373	4160	12084.0	371.04	1289.047	0.03030	2153.5	0.00589	0.61415
15589	4072	11941.9	371.91	3647.123	0.03045	2058.4	0.04741	1.73776
14421	4047	11901.5	372.52	2519.486	0.03055	2115.7	0.02271	1.20053
14686	3969	11775.6	375.46	2910.417	0.03103	2098.4	0.03081	1.38716
10868	3872	11619.0	381.24	-750.976	0.03199	2162.2	0.00212	-0.35811
14785	3848	11580.2	383.02	3204.772	0.03229	2083.4	0.03898	1.52845
10254	3819	11533.4	385.35	-1279.407	0.03269	2153.6	0.00629	-0.61031
11429	3788	11483.4	388.06	-54.358	0.03315	2166.7	0.00001	-0.02594
13757	3761	11439.8	390.60	2317.234	0.03358	2123.5	0.02125	1.10589
14224	3701	11342.9	396.82	2881.104	0.03466	2099.5	0.03398	1.37576
11094	3575	11139.5	412.32	-45.469	0.03742	2166.7	0.00001	-0.02174
8724	3522	11053.9	419.76	-2329.900	0.03878	2122.8	0.02508	-1.11494
8807	3204	10540.5	474.05	-1733.488	0.04947	2142.2	0.01811	-0.83418
10388	3188	10514.7	477.16	-126.656	0.05012	2166.6	0.00010	-0.06097
8430	3132	10424.2	488.28	-1994.244	0.05248	2134.2	0.02559	-0.96119
8616	2768	9836.6	568.29	-1220.565	0.07109	2154.3	0.01351	-0.59415
10760	2725	9767.1	578.48	992.859	0.07366	2158.5	0.00931	0.48398
11205	2348	9158.5	672.51	2046.526	0.09955	2130.6	0.05659	1.01184
6748	1941	8501.4	781.04	-1753.371	0.13428	2139.2	0.06062	-0.88412
11094	3454	10944.1	430.04	149.886	0.04071	2166.5	0.00011	0.07180
9461	3374	10815.0	443.12	-1353.954	0.04322	2151.9	0.00953	-0.64942

Validity for Simple Linear Regression

The simple linear regression assumes:

The sample is a simple random sample from the population.
There is a linear relationship in the population
The relationship is homoscedastic (variance doesn’t change with $x$)
The residual component is normally distributed.

To check these conditions:

Check data collection information to ensure that the data is a simple random sample
Inspect a scatter plot to look for obvious deviations from linearity
Plot a residual plot to check for deviations from linearity and for homoscedasticity
Plot a QQ-plot to check for normality of the residuals

All of the important information is available in augment(model).

Validity for Simple Linear Regression

model.augment %>% gf_point(Y2014 ~ Y2000)

Validity for Simple Linear Regression

Scatter plot: check for obvious signs against linearity

model.augment %>% gf_point(Y2014 ~ Y2000)

Validity for Simple Linear Regression

Residual plot: plot .resid against .fitted. Look for the data to be evenly distributed above and below the line $y=0$.

model.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0)

Validity for Simple Linear Regression

QQ-plot of the residuals.

model.augment %>% gf_qq(~.resid) %>% gf_qqline()

Sums of squared errors

We will have a lot of use for the following quantities:

$SST = \sum(y_i-\overline y)^2$ Total sum of square error
$SSM = \sum(\hat y_i-\overline y)^2$ Model sum of square error
$SSE = \sum(y_i-\hat y_i)^2$ Error sum of square error

For now, we will also introduce

$SSX = \sum(x_i-\overline x)^2$ Predictor sum of square error

Inference for linear regression

A lot of our ability to do statistical testing and inference with linear regressions come from identifying several quantities that have a sampling distribution that follows Student’s T distribution. If they follow Student’s T, we can build tests and confidence intervals.

The T distribution often shows up to produce confidence intervals on the form \[ \text{parameter} = \text{estimate} \pm t^*SE_{\text{estimate}} \]

Recall we defined $b_0, b_1$ estimated line coefficients, $s$ the model standard deviation, $r_{xy}$ the sample regression.

$b_0\pm t^*SE_{b_0}$ with n-2 DoF and $SE_{b_0} = s\sqrt{1/n + \overline x^2/SSX}$
$b_1\pm t^*SE_{b_1}$ with n-2 DoF and $SE_{b_1} = s/\sqrt{SSX}$
$\hat y(x^*)\pm t^*SE_{\hat y}$ with n-2 DoF and $S_{\hat y} = s\sqrt{1/n + \overline (x^*-\overline x)^2/SSX}$ to find the mean.
$\hat y(x^*)\pm t^*SE_{\hat y}$ with n-2 DoF and $S_{\hat y} = s\sqrt{1 + 1/n + \overline (x^*-\overline x)^2/SSX}$ to find a prediction.
$r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}\sim t(n-2)$

ANOVA tables: is the slope non-zero?

To test whether $\beta_1\neq 0$, the usual way is to compare the model to the one-sample means model where only $\beta_0$ influences the mean.

The basic idea is that if $\beta_1$ helps build an accurate model, then the variance introduced by $\beta_1$ is larger than the inherent error in the residuals.

Since all the parts of the calculation have different degrees of freedom, this is usually studied using an ANalysis Of VAriation table (ANOVA):

Type	Sum of Squares	Degrees of Freedom	Mean Squares	F-statistic	p-value
Model	SSM	$1$	MSM	MSM/MSE	`1-pf(F, df1=DFM, df2=DFE)`
Residual (Error)	SSE	$n-2$	MSE
Total	SST	$n-1$	MST

DFM + DFR = DFT

Correlation test

Input	Vectors `x`, `y` of equal length or variables `dataset$x` and `dataset$y`
Null hypothesis	$\rho_{xy} = 0$
Alternative	Greater, less, not equal
Test statistic	$r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}$

Requirements

Variables need to be paired
Data a random sample of independent observations.
$x$ is homoscedastic normal wrt $y$
$y$ is homoscedastic normal wrt $x$

Correlation test

Command: cor.test with arguments alternative and conf.level.

test = cor.test(Y2014 ~ Y2000, data=tuit, alternative="greater", conf.level=0.99)
test

## 
##  Pearson's product-moment correlation
## 
## data:  Y2014 and Y2000
## t = 5.2, df = 31, p-value = 6.1e-06
## alternative hypothesis: true correlation is greater than 0
## 99 percent confidence interval:
##  0.38761 1.00000
## sample estimates:
##     cor 
## 0.68247

Correlation test

Effect size: R-squared

test$estimate^2

##     cor 
## 0.46576

Correlation test

To test assumptions: check residual plots and QQ-plots for both possible linear models:

model1 = lm(Y2014 ~ Y2000, data=tuit)
model2 = lm(Y2000 ~ Y2014, data=tuit)
model1.augment = augment(model1)
model2.augment = augment(model2)
ggmatrix(list(
  model1.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
  model2.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
  model1.augment %>% gf_qq(~.resid) %>% gf_qqline(),
  model2.augment %>% gf_qq(~.resid) %>% gf_qqline()
), nrow=2, ncol=2)

Linear Regression test for association

Input	Vectors `x`, `y` of equal length or variables `dataset$x` and `dataset$y`
Null hypothesis	$\rho_{xy} = 0$
Alternative	Greater, less, not equal
Test statistic	$r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}$

Requirements

Variables need to be paired
Data a random sample of independent observations.
$y$ is homoscedastic normal wrt $x$

Linear Regression test for association

Command: lm.

model = lm(Y2014 ~ Y2000, data=tuit)
model

## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Coefficients:
## (Intercept)        Y2000  
##     5367.62         1.61

Linear Regression test for association

Make sure you report all relevant information, including model coefficients, confidence intervals, p-values, and test statistics. The summary(model) printout includes all these as well as the effect size R-squared.

model.summary = summary(model)
model.summary

## 
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4196  -1489    -54   1647   3647 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5367.621   1342.377     4.0  0.00037 ***
## Y2000          1.615      0.311     5.2  1.2e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2130 on 31 degrees of freedom
## Multiple R-squared:  0.466,  Adjusted R-squared:  0.449 
## F-statistic:   27 on 1 and 31 DF,  p-value: 1.21e-05

Linear Regression test for association

To report the summary information individually, the following helps:

Value	Command
Model coefficients	`coef(model)`
Model coefficients with all details	`coef(model.summary)`
Test statistic with degrees of freedom	`model.summary$fstatistic`
Effect size - percentage of explained variance	`model.summary$r.squared`

Linear Regression test for association

To test assumptions: check residual plots and QQ-plots:

model.augment = augment(model)
ggmatrix(list(
  model.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
  model.augment %>% gf_qq(~.resid) %>% gf_qqline()
), nrow=2, ncol=2)

Test title


Input	Vectors `x, y` of equal length or variables `dataset$x`, `dataset$y`
Null hypothesis	Population mean difference $\mu_{x-y}$ is equal to `mu.0` (usually 0)
Alternative hypothesis	Population mean difference $\mu_{x-y}$ is [less than / not equal / greater than] `mu.0`
Test statistic	`T = ( (mean(x-y)) - mu.0)/( df )` where `df` in full generality is given by a complicated formula

Requirements The differences x-y need to be appropriate for a one-sample test:

Dataset size	Requires	How to check
$N\leq 15$	Normal distribution	`gf_qq`
$15<N\leq40$	No skew, no outliers	`gf_histogram` and `gf_boxplot`
$40<N\leq$hundreds	No outliers	`gf_boxplot`
hundreds$<N$	Few outliers	`gf_boxplot`

Test title

Command: t.test with arguments mu, alternative, conf.level

{r echo=TRUE}
test = t.test(x, y, mu=5, alternative="two.sided", conf.level=0.99, paired=TRUE)
test

Test title

Effect size: Cohen’s $d$ - library effsize, command cohen.d

{r echo=TRUE}
d = cohen.d(x, y, paired=TRUE)
d

Test title

To check validity, check data size first, then use a QQ- or a boxplot to check the distribution

{r echo=TRUE}
NROW(mpg)

{r fig.height=2,echo=TRUE}
gf_boxploth("hwy-cty" ~ hwy-cty, data=mpg)

Type	Sum of Squares	Degrees of Freedom	Mean Squares	F-statistic	p-value
Model	SSM	\(1\)	MSM	MSM/MSE	`1-pf(F, df1=DFM, df2=DFE)`
Residual (Error)	SSE	\(n-2\)	MSE
Total	SST	\(n-1\)	MST

Input	Vectors `x`, `y` of equal length or variables `dataset$x` and `dataset$y`
Null hypothesis	\(\rho_{xy} = 0\)
Alternative	Greater, less, not equal
Test statistic	\(r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}\)

Dataset size	Requires	How to check
\(N\leq 15\)	Normal distribution	`gf_qq`
\(15<N\leq40\)	No skew, no outliers	`gf_histogram` and `gf_boxplot`
\(40<N\leq\)hundreds	No outliers	`gf_boxplot`
hundreds\(<N\)	Few outliers	`gf_boxplot`