MVJ
12April, 2018
Linear regression. The F-test, t-test on coefficients, correlation test.
Recall that a linear function, or straight line, is a function on the shape
\[ y(x) = b_0 + b_1x \]
A linear model is a straight line relationship between two variables.
This is the next step up in complexity after the one-sample and two-sample means tests.
A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.
We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).
All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:
A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.
We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).
All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:
A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.
We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).
All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:
A statistical model is some way of producing a probability distribution for a response variable, possibly dependent on values for one or more predictor variables.
We will write \(y\) for responses, and \(x\) (or \(x_1, x_2, \dots\)) for predictor(s).
All statistical tests we learn build implicitly or explicitly on a choice of a model for the data:
We can modify the categorical subpopulations in two-sample or two-way table testing to a numeric variable controlling subpopulations.
In other words, let the predictor take numeric values, not categorical values.
We would transform the two-sample model
\[ (y | x) \sim \mathcal N(\mu_x, \sigma_x) \]
to a linear model
\[ (y|x) \sim \mathcal N(\beta_0 + \beta_1x | \sigma) \]
The model implicitly assumes that errors are normally distributed.
A simple linear regression model is \((y|x) \sim \mathcal N(\beta_0 + \beta_1x | \sigma)\).
This is to distinguish it from a multiple regression model, where more than one predictor is used.
In simple linear regression, the conditional means all lie on a straight line and have the same variance for all choices of \(x\) (homoscedastic).
The true line \(\beta_0+\beta_1x\) that contains all the means is called the population regression line. Inference in linear regression is about finding values \(\hat\beta_0\) and \(\hat\beta_1\) to estimate the population regression line witha sample regression line and to find confidence intervals for these quantities.
We can also study the correlation alone, and produce inference as well as testing for values of the correlation coefficient.
Always start with plotting your data. In which of these cases is a simple linear regression reasonable?
Any model will split the variation in the response variable into a model (or fit) component and a residual component.
DATA = FIT + RESIDUAL
For a simple linear regression, the parameters \(\beta_0\) and \(\beta_1\) can be estimated using regression \(r_{xy}\), sample means \(\overline x, \overline y\) and sample standard deviations \(s_x, s_y\): We write \(\hat y\) for the model means function, and \(b_0\) and \(b_1\) for the estimates of \(\beta_0\) and \(\beta_1\).
\[\begin{align*} \hat y &= b_0 + b_1 x \\ b_1 &= r_{xy}\frac{s_y}{s_x} \\ b_0 &= \overline y - b_1\overline x \end{align*}\]Any model will split the variation in the response variable into a model (or fit) component and a residual component.
DATA = FIT + RESIDUAL
For a simple linear regression, the parameters \(\beta_0\) and \(\beta_1\) can be estimated using regression \(r_{xy}\), sample means \(\overline x, \overline y\) and sample standard deviations \(s_x, s_y\): We write \(\hat y\) for the model means function, and \(b_0\) and \(b_1\) for the estimates of \(\beta_0\) and \(\beta_1\).
The residuals are the differences between observed and expected responses. For data \((x_1,y_1), \dots, (x_N, y_N)\) and predictions \(\hat y_i = \hat y(x_i)\):
\[ e_i = y_i - \hat y_i = y_i - b_0 - b_1x_i \]
The only source of randomness in the model is from the residual component. We can estimate the standard deviation \(\sigma\) of the model as the standard deviation of the residuals:
\[ s^2 = \frac{\sum e_i^2}{n-2} \]
A linear model can be fitted using the command lm
. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model)
. Plots are often easier using augment
from the package broom
.
model = lm(Y2014 ~ Y2000, data=tuit)
model
##
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
##
## Coefficients:
## (Intercept) Y2000
## 5367.62 1.61
A linear model can be fitted using the command lm
. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model)
. Plots are often easier using augment
from the package broom
.
model = lm(Y2014 ~ Y2000, data=tuit)
model.summary = summary(model)
model.summary
##
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4196 -1489 -54 1647 3647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5367.621 1342.377 4.0 0.00037 ***
## Y2000 1.615 0.311 5.2 1.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2130 on 31 degrees of freedom
## Multiple R-squared: 0.466, Adjusted R-squared: 0.449
## F-statistic: 27 on 1 and 31 DF, p-value: 1.21e-05
A linear model can be fitted using the command lm
. Assign a name to make information easily accessible. Additional important information can be found by assigning a name to summary(model)
. Plots are often easier using augment
from the package broom
.
model = lm(Y2014 ~ Y2000, data=tuit)
model.summary = summary(model)
model.augment = augment(model)
model.augment
Y2014 | Y2000 | .fitted | .se.fit | .resid | .hat | .sigma | .cooksd | .std.resid |
---|---|---|---|---|---|---|---|---|
17955 | 6852 | 16430.2 | 916.35 | 1524.806 | 0.18483 | 2144.6 | 0.07117 | 0.79235 |
18075 | 7002 | 16672.4 | 959.13 | 1402.630 | 0.20249 | 2147.6 | 0.06893 | 0.73689 |
14126 | 6513 | 15882.9 | 821.19 | -1756.878 | 0.14844 | 2138.6 | 0.06954 | -0.89322 |
14297 | 6333 | 15592.3 | 771.74 | -1295.268 | 0.13110 | 2151.8 | 0.03206 | -0.65193 |
13771 | 5170 | 13714.6 | 487.05 | 56.399 | 0.05221 | 2166.7 | 0.00002 | 0.02718 |
9734 | 5136 | 13659.7 | 480.27 | -3925.708 | 0.05077 | 2038.0 | 0.09557 | -1.89042 |
15938 | 4994 | 13430.4 | 453.55 | 2507.552 | 0.04528 | 2115.4 | 0.03438 | 1.20403 |
14889 | 4877 | 13241.6 | 433.68 | 1647.449 | 0.04140 | 2144.8 | 0.01346 | 0.78944 |
10186 | 4436 | 12529.6 | 381.23 | -2343.556 | 0.03199 | 2122.6 | 0.02064 | -1.11753 |
8784 | 4715 | 12980.0 | 409.90 | -4196.002 | 0.03698 | 2021.2 | 0.07727 | -2.00606 |
10991 | 4405 | 12479.5 | 379.13 | -1488.506 | 0.03164 | 2149.0 | 0.00823 | -0.70967 |
10995 | 4383 | 12444.0 | 377.79 | -1448.987 | 0.03142 | 2149.9 | 0.00774 | -0.69075 |
13373 | 4160 | 12084.0 | 371.04 | 1289.047 | 0.03030 | 2153.5 | 0.00589 | 0.61415 |
15589 | 4072 | 11941.9 | 371.91 | 3647.123 | 0.03045 | 2058.4 | 0.04741 | 1.73776 |
14421 | 4047 | 11901.5 | 372.52 | 2519.486 | 0.03055 | 2115.7 | 0.02271 | 1.20053 |
14686 | 3969 | 11775.6 | 375.46 | 2910.417 | 0.03103 | 2098.4 | 0.03081 | 1.38716 |
10868 | 3872 | 11619.0 | 381.24 | -750.976 | 0.03199 | 2162.2 | 0.00212 | -0.35811 |
14785 | 3848 | 11580.2 | 383.02 | 3204.772 | 0.03229 | 2083.4 | 0.03898 | 1.52845 |
10254 | 3819 | 11533.4 | 385.35 | -1279.407 | 0.03269 | 2153.6 | 0.00629 | -0.61031 |
11429 | 3788 | 11483.4 | 388.06 | -54.358 | 0.03315 | 2166.7 | 0.00001 | -0.02594 |
13757 | 3761 | 11439.8 | 390.60 | 2317.234 | 0.03358 | 2123.5 | 0.02125 | 1.10589 |
14224 | 3701 | 11342.9 | 396.82 | 2881.104 | 0.03466 | 2099.5 | 0.03398 | 1.37576 |
11094 | 3575 | 11139.5 | 412.32 | -45.469 | 0.03742 | 2166.7 | 0.00001 | -0.02174 |
8724 | 3522 | 11053.9 | 419.76 | -2329.900 | 0.03878 | 2122.8 | 0.02508 | -1.11494 |
8807 | 3204 | 10540.5 | 474.05 | -1733.488 | 0.04947 | 2142.2 | 0.01811 | -0.83418 |
10388 | 3188 | 10514.7 | 477.16 | -126.656 | 0.05012 | 2166.6 | 0.00010 | -0.06097 |
8430 | 3132 | 10424.2 | 488.28 | -1994.244 | 0.05248 | 2134.2 | 0.02559 | -0.96119 |
8616 | 2768 | 9836.6 | 568.29 | -1220.565 | 0.07109 | 2154.3 | 0.01351 | -0.59415 |
10760 | 2725 | 9767.1 | 578.48 | 992.859 | 0.07366 | 2158.5 | 0.00931 | 0.48398 |
11205 | 2348 | 9158.5 | 672.51 | 2046.526 | 0.09955 | 2130.6 | 0.05659 | 1.01184 |
6748 | 1941 | 8501.4 | 781.04 | -1753.371 | 0.13428 | 2139.2 | 0.06062 | -0.88412 |
11094 | 3454 | 10944.1 | 430.04 | 149.886 | 0.04071 | 2166.5 | 0.00011 | 0.07180 |
9461 | 3374 | 10815.0 | 443.12 | -1353.954 | 0.04322 | 2151.9 | 0.00953 | -0.64942 |
The simple linear regression assumes:
To check these conditions:
All of the important information is available in augment(model)
.
model.augment %>% gf_point(Y2014 ~ Y2000)
Scatter plot: check for obvious signs against linearity
model.augment %>% gf_point(Y2014 ~ Y2000)
Residual plot: plot .resid
against .fitted
. Look for the data to be evenly distributed above and below the line \(y=0\).
model.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0)
QQ-plot of the residuals.
model.augment %>% gf_qq(~.resid) %>% gf_qqline()
We will have a lot of use for the following quantities:
For now, we will also introduce
A lot of our ability to do statistical testing and inference with linear regressions come from identifying several quantities that have a sampling distribution that follows Student’s T distribution. If they follow Student’s T, we can build tests and confidence intervals.
The T distribution often shows up to produce confidence intervals on the form \[ \text{parameter} = \text{estimate} \pm t^*SE_{\text{estimate}} \]
Recall we defined \(b_0, b_1\) estimated line coefficients, \(s\) the model standard deviation, \(r_{xy}\) the sample regression.
To test whether \(\beta_1\neq 0\), the usual way is to compare the model to the one-sample means model where only \(\beta_0\) influences the mean.
The basic idea is that if \(\beta_1\) helps build an accurate model, then the variance introduced by \(\beta_1\) is larger than the inherent error in the residuals.
Since all the parts of the calculation have different degrees of freedom, this is usually studied using an ANalysis Of VAriation table (ANOVA):
Type | Sum of Squares | Degrees of Freedom | Mean Squares | F-statistic | p-value |
---|---|---|---|---|---|
Model | SSM | \(1\) | MSM | MSM/MSE | 1-pf(F, df1=DFM, df2=DFE) |
Residual (Error) | SSE | \(n-2\) | MSE | ||
Total | SST | \(n-1\) | MST |
DFM + DFR = DFT
Input | Vectors x , y of equal length or variables dataset$x and dataset$y |
---|---|
Null hypothesis | \(\rho_{xy} = 0\) |
Alternative | Greater, less, not equal |
Test statistic | \(r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}\) |
Requirements
Command: cor.test
with arguments alternative
and conf.level
.
test = cor.test(Y2014 ~ Y2000, data=tuit, alternative="greater", conf.level=0.99)
test
##
## Pearson's product-moment correlation
##
## data: Y2014 and Y2000
## t = 5.2, df = 31, p-value = 6.1e-06
## alternative hypothesis: true correlation is greater than 0
## 99 percent confidence interval:
## 0.38761 1.00000
## sample estimates:
## cor
## 0.68247
Effect size: R-squared
test$estimate^2
## cor
## 0.46576
To test assumptions: check residual plots and QQ-plots for both possible linear models:
model1 = lm(Y2014 ~ Y2000, data=tuit)
model2 = lm(Y2000 ~ Y2014, data=tuit)
model1.augment = augment(model1)
model2.augment = augment(model2)
ggmatrix(list(
model1.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
model2.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
model1.augment %>% gf_qq(~.resid) %>% gf_qqline(),
model2.augment %>% gf_qq(~.resid) %>% gf_qqline()
), nrow=2, ncol=2)
Input | Vectors x , y of equal length or variables dataset$x and dataset$y |
---|---|
Null hypothesis | \(\rho_{xy} = 0\) |
Alternative | Greater, less, not equal |
Test statistic | \(r_{xy}\sqrt{n-2}/\sqrt{1-r_{xy}^2}\) |
Requirements
Command: lm
.
model = lm(Y2014 ~ Y2000, data=tuit)
model
##
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
##
## Coefficients:
## (Intercept) Y2000
## 5367.62 1.61
Make sure you report all relevant information, including model coefficients, confidence intervals, p-values, and test statistics. The summary(model)
printout includes all these as well as the effect size R-squared.
model.summary = summary(model)
model.summary
##
## Call:
## lm(formula = Y2014 ~ Y2000, data = tuit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4196 -1489 -54 1647 3647
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5367.621 1342.377 4.0 0.00037 ***
## Y2000 1.615 0.311 5.2 1.2e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2130 on 31 degrees of freedom
## Multiple R-squared: 0.466, Adjusted R-squared: 0.449
## F-statistic: 27 on 1 and 31 DF, p-value: 1.21e-05
To report the summary information individually, the following helps:
Value | Command |
---|---|
Model coefficients | coef(model) |
Model coefficients with all details | coef(model.summary) |
Test statistic with degrees of freedom | model.summary$fstatistic |
Effect size - percentage of explained variance | model.summary$r.squared |
To test assumptions: check residual plots and QQ-plots:
model.augment = augment(model)
ggmatrix(list(
model.augment %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0),
model.augment %>% gf_qq(~.resid) %>% gf_qqline()
), nrow=2, ncol=2)
Input | Vectors x, y of equal length or variables dataset$x , dataset$y |
Null hypothesis | Population mean difference \(\mu_{x-y}\) is equal to mu.0 (usually 0) |
Alternative hypothesis | Population mean difference \(\mu_{x-y}\) is [less than / not equal / greater than] mu.0 |
Test statistic | T = ( (mean(x-y)) - mu.0)/( df ) where df in full generality is given by a complicated formula |
Requirements The differences x-y
need to be appropriate for a one-sample test:
Dataset size | Requires | How to check |
---|---|---|
\(N\leq 15\) | Normal distribution |
gf_qq
|
\(15<N\leq40\) | No skew, no outliers |
gf_histogram and gf_boxplot
|
\(40<N\leq\)hundreds | No outliers |
gf_boxplot
|
hundreds\(<N\) | Few outliers |
gf_boxplot
|
Command: t.test
with arguments mu
, alternative
, conf.level
{r echo=TRUE}
test = t.test(x, y, mu=5, alternative="two.sided", conf.level=0.99, paired=TRUE)
test
Effect size: Cohen’s \(d\) - library effsize
, command cohen.d
{r echo=TRUE}
d = cohen.d(x, y, paired=TRUE)
d
To check validity, check data size first, then use a QQ- or a boxplot to check the distribution
{r echo=TRUE}
NROW(mpg)
{r fig.height=2,echo=TRUE}
gf_boxploth("hwy-cty" ~ hwy-cty, data=mpg)