Lecture 6

MVJ

12April, 2018

Simple Models

One task for statistics is to make predictions with quantifiable certainty.

To produce a prediction, we use covariates or predictors or independent variables to inform the analysis.

A model takes in predictors and produces a prediction.

Most models are determined by parameters. Using data to find good values for parameters is called fitting the model.

Simple Models

One very simple model uses no predictors, and has two parameters: \(\mu\), \(\sigma\)

The model would predict \(y\sim\mathcal N(\mu,\sigma)\).

One step up in complexity would be to use a straight line: using a predictor \(x\) and parameters \(a,b,\sigma\) a linear model would predict

\[ y \sim \mathcal N(ax+b, \sigma) \]

A different way to think about this model would be

\[ y = ax+b \]

The model is trained on paired data: values \(x_i, y_i\) that are pairwise connected.

Simple Models: which is closer to a straight line?

Correlation

Correlation measures the extent to which one variable is large at the same time as another variable is large.

It does this by measuring the average product of z-scores within either dataset:

\[ r_{xy} = \frac{1}{n-1}\sum z_{x_i}z_{y_i} = \frac{1}{n-1}\sum \left(\frac{x_i-\overline x}{s_x}\right) \left(\frac{y_i-\overline y}{s_y}\right) \]

\(r_{xy}\) is calculated with the cor function:

cor(V1 ~ V2, data=dataset)

Correlation

If \(r_{xy}\) is near 0, it only means that the data is not close to a straight line – the data may still be strongly related but through a more complex model.

Simple Models: which is closer to a straight line?

Some example correlations

Linear Model

gf_point(Petal.Length ~ Petal.Width, data=iris) %>% 
  gf_smooth(method="lm")

Refresher: Straight Lines

A straight line is a function described by

\[ y = b_0 + b_1\cdot x \]

Linear Model

Given paired values \((x_1,y_1), (x_2,y_2), \dots, (x_n, y_n)\), a least squares regression line is a straight line

\[ y = b_0 + b_1\cdot x \]

such that the sum of square prediction errors is minimized.

Least Squares Model

Using R:

lm(y ~ x, data=dataset)

Least Squares Model

lm(Petal.Length ~ Petal.Width, data=iris)
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Petal.Width  
##       1.084        2.230

Least Squares Model

summary(lm(Petal.Length ~ Petal.Width, data=iris))
## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Commands with fitted linear models

coef(lm(Petal.Length ~ Petal.Width, data=iris)) %>% kable
(Intercept) 1.083558
Petal.Width 2.229941
residuals(lm(Petal.Length ~ Petal.Width, data=iris)) %>% favstats %>% kable(digits=2)
min Q1 median Q3 max mean sd n missing
-1.34 -0.3 -0.03 0.26 1.39 0 0.48 150 0

R-squared - explained variance

Correlation squared measures the proportion of variance in the data that is explained by the model.

Total variance decomposes into the variance of the predicted values and the variance of the residuals.

\[ r_{xy}^2 = \frac{\text{variance of }\hat y}{\text{variance of all }y_i} \]

Detailed example

BEER Brewery Calories Carbohydrates Alcohol Type
American Amber Lager Straub Brewery 136 10.5 4.1 Domestic
American Lager Straub Brewery 132 10.5 4.1 Domestic
American Light Straub Brewery 96 7.6 3.2 Domestic
Anchor Steam Anchor 153 16.0 4.9 Domestic
Anheuser Busch Natural Light Anheuser Busch 95 3.2 4.2 Domestic
Anheuser Busch Natural Ice Anheuser Busch 157 8.9 5.9 Domestic

Detailed example

Detailed example

lm(Calories ~ Alcohol, data=beerd) %>% summary %>% tidy %>% kable
term estimate std.error statistic p.value
(Intercept) 6.41895 5.758141 1.114761 0.2666564
Alcohol 28.34687 1.062580 26.677404 0.0000000

Diagnostics

A linear regression can fail in different ways:

Influential outliers

Curved relationship

Heteroscedastic

cor cor.lo cor.hi
-0.7204232 -0.3517975 -0.4488909

How can we tell?

Residual plot: scatter plot of \(\hat y\) vs. residuals.

Using \(\hat y\) instead of \(x_i\) makes this generalizable

Caution