Simple Models

One task for statistics is to make predictions with quantifiable certainty.

To produce a prediction, we use covariates or predictors or independent variables to inform the analysis.

A model takes in predictors and produces a prediction.

Most models are determined by parameters. Using data to find good values for parameters is called fitting the model.

Simple Models

One very simple model uses no predictors, and has two parameters: \(\mu\), \(\sigma\)

The model would predict \(y\sim\mathcal N(\mu,\sigma)\).

One step up in complexity would be to use a straight line: using a predictor \(x\) and parameters \(a,b,\sigma\) a linear model would predict

\[ y \sim \mathcal N(ax+b, \sigma) \]

A different way to think about this model would be

\[ y = ax+b \]

The model is trained on paired data: values \(x_i, y_i\) that are pairwise connected.

Simple Models: which is closer to a straight line?

Correlation

Correlation measures the extent to which one variable is large at the same time as another variable is large.

It does this by measuring the average product of z-scores within either dataset:

\[ r_{xy} = \frac{1}{n-1}\sum z_{x_i}z_{y_i} = \frac{1}{n-1}\sum \left(\frac{x_i-\overline x}{s_x}\right) \left(\frac{y_i-\overline y}{s_y}\right) \]

\(r_{xy}\) is calculated with the cor function:

cor(V1 ~ V2, data=dataset)

Correlation

\(x\) and \(y\) both have to be numeric
Correlation itself does not depend on choosing responses & predictors
Correlation, like mean and standard deviation, is not robust
\(-1 \leq r_{xy} \leq 1\)
\(r_{xy}\) is positive if \(x\) tends to be large when \(y\) is large.
\(r_{xy}\) is negative if \(x\) tends to be small when \(y\) is large.
\(|r_{xy}|\) is large if the data is close to a straight line

If \(r_{xy}\) is near 0, it only means that the data is not close to a straight line – the data may still be strongly related but through a more complex model.

Simple Models: which is closer to a straight line?

Some example correlations

Linear Model

gf_point(Petal.Length ~ Petal.Width, data=iris) %>% 
  gf_smooth(method="lm")

Refresher: Straight Lines

A straight line is a function described by

\[ y = b_0 + b_1\cdot x \]

\(b_0\) is the intercept, represents the value of \(y\) when \(x=0\)
\(b_1\) is the slope, represents the marginal increase in \(y\) when \(x\) grows by 1 unit

Linear Model

Given paired values \((x_1,y_1), (x_2,y_2), \dots, (x_n, y_n)\), a least squares regression line is a straight line

\[ y = b_0 + b_1\cdot x \]

such that the sum of square prediction errors is minimized.

Write \(\hat b_0, \hat b_1\) for the parameters of a fitted linear model
For a particular value of \(x_0\), we write \(\hat y = \hat b_0 + \hat b_1\cdot x_0\)
Prediction error for a single observation \((x_i,y_i)\) is the residual \(y_i-\hat y\)
Sum of Squared Errors (SSE) is \(\sum(y_i-\hat y)^2\)
\(\hat b_0, \hat b_1\) are chosen to minimize SSE

Least Squares Model

\((\overline x, \overline y)\) is on the regression line
The slope is \(\hat b_1 = r_{xy}\frac{s_y}{s_x}\)
The intercept is \(\hat b_0 = \overline y - \hat b_1\cdot\overline x\)
The regression line can be completely determined using \(\overline x, \overline y, s_x, s_y, r_{xy}\)
As opposed to correlation, the regression line definitely depends on the choice of explanatory and response variables.

Using R:

lm(y ~ x, data=dataset)

Least Squares Model

lm(Petal.Length ~ Petal.Width, data=iris)

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Coefficients:
## (Intercept)  Petal.Width  
##       1.084        2.230

Least Squares Model

summary(lm(Petal.Length ~ Petal.Width, data=iris))

## 
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.33542 -0.30347 -0.02955  0.25776  1.39453 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08356    0.07297   14.85   <2e-16 ***
## Petal.Width  2.22994    0.05140   43.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared:  0.9271, Adjusted R-squared:  0.9266 
## F-statistic:  1882 on 1 and 148 DF,  p-value: < 2.2e-16

Commands with fitted linear models

lm Fit a model
summary Print out a detailed summary with significance tests
coef Return the coefficients of the model
residuals Return the residuals from fitting

coef(lm(Petal.Length ~ Petal.Width, data=iris)) %>% kable

(Intercept)	1.083558
Petal.Width	2.229941

residuals(lm(Petal.Length ~ Petal.Width, data=iris)) %>% favstats %>% kable(digits=2)

	min	Q1	median	Q3	max	mean	sd	n	missing
	-1.34	-0.3	-0.03	0.26	1.39	0	0.48	150	0

R-squared - explained variance

Correlation squared measures the proportion of variance in the data that is explained by the model.

Total variance decomposes into the variance of the predicted values and the variance of the residuals.

\[ r_{xy}^2 = \frac{\text{variance of }\hat y}{\text{variance of all }y_i} \]

Detailed example

BEER	Brewery	Calories	Carbohydrates	Alcohol	Type
American Amber Lager	Straub Brewery	136	10.5	4.1	Domestic
American Lager	Straub Brewery	132	10.5	4.1	Domestic
American Light	Straub Brewery	96	7.6	3.2	Domestic
Anchor Steam	Anchor	153	16.0	4.9	Domestic
Anheuser Busch Natural Light	Anheuser Busch	95	3.2	4.2	Domestic
Anheuser Busch Natural Ice	Anheuser Busch	157	8.9	5.9	Domestic

Detailed example

lm(Calories ~ Alcohol, data=beerd) %>% summary %>% tidy %>% kable

term	estimate	std.error	statistic	p.value
(Intercept)	6.41895	5.758141	1.114761	0.2666564
Alcohol	28.34687	1.062580	26.677404	0.0000000

Diagnostics

Linear Regression is based on means and standard deviations; so not robust
Outliers could change the regression line – or not
Outliers are influential if removing them changes the regression significantly

A linear regression can fail in different ways:

Influential outliers
Relationship is not linear (eg is curved)
Data variance varies with the predictor

Influential outliers

Heteroscedastic

cor	cor.lo	cor.hi
-0.7204232	-0.3517975	-0.4488909

How can we tell?

Residual plot: scatter plot of \(\hat y\) vs. residuals.

Using \(\hat y\) instead of \(x_i\) makes this generalizable

Lecture 6

Simple Models

Simple Models

Simple Models: which is closer to a straight line?

Correlation

Correlation

Simple Models: which is closer to a straight line?

Some example correlations

Linear Model

Refresher: Straight Lines

Linear Model

Least Squares Model

Least Squares Model

Least Squares Model

Commands with fitted linear models

R-squared - explained variance

Detailed example

Detailed example

Detailed example

Diagnostics

Influential outliers

Curved relationship

Heteroscedastic

How can we tell?

Caution