MVJ
12April, 2018
One task for statistics is to make predictions with quantifiable certainty.
To produce a prediction, we use covariates or predictors or independent variables to inform the analysis.
A model takes in predictors and produces a prediction.
Most models are determined by parameters. Using data to find good values for parameters is called fitting the model.
One very simple model uses no predictors, and has two parameters: \(\mu\), \(\sigma\)
The model would predict \(y\sim\mathcal N(\mu,\sigma)\).
One step up in complexity would be to use a straight line: using a predictor \(x\) and parameters \(a,b,\sigma\) a linear model would predict
\[ y \sim \mathcal N(ax+b, \sigma) \]
A different way to think about this model would be
\[ y = ax+b \]
The model is trained on paired data: values \(x_i, y_i\) that are pairwise connected.
Correlation measures the extent to which one variable is large at the same time as another variable is large.
It does this by measuring the average product of z-scores within either dataset:
\[ r_{xy} = \frac{1}{n-1}\sum z_{x_i}z_{y_i} = \frac{1}{n-1}\sum \left(\frac{x_i-\overline x}{s_x}\right) \left(\frac{y_i-\overline y}{s_y}\right) \]
\(r_{xy}\) is calculated with the cor
function:
cor(V1 ~ V2, data=dataset)
If \(r_{xy}\) is near 0, it only means that the data is not close to a straight line – the data may still be strongly related but through a more complex model.
gf_point(Petal.Length ~ Petal.Width, data=iris) %>%
gf_smooth(method="lm")
A straight line is a function described by
\[ y = b_0 + b_1\cdot x \]
Given paired values \((x_1,y_1), (x_2,y_2), \dots, (x_n, y_n)\), a least squares regression line is a straight line
\[ y = b_0 + b_1\cdot x \]
such that the sum of square prediction errors is minimized.
Using R:
lm(y ~ x, data=dataset)
lm(Petal.Length ~ Petal.Width, data=iris)
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Coefficients:
## (Intercept) Petal.Width
## 1.084 2.230
summary(lm(Petal.Length ~ Petal.Width, data=iris))
##
## Call:
## lm(formula = Petal.Length ~ Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.33542 -0.30347 -0.02955 0.25776 1.39453
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08356 0.07297 14.85 <2e-16 ***
## Petal.Width 2.22994 0.05140 43.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4782 on 148 degrees of freedom
## Multiple R-squared: 0.9271, Adjusted R-squared: 0.9266
## F-statistic: 1882 on 1 and 148 DF, p-value: < 2.2e-16
lm
Fit a modelsummary
Print out a detailed summary with significance testscoef
Return the coefficients of the modelresiduals
Return the residuals from fittingcoef(lm(Petal.Length ~ Petal.Width, data=iris)) %>% kable
(Intercept) | 1.083558 |
Petal.Width | 2.229941 |
residuals(lm(Petal.Length ~ Petal.Width, data=iris)) %>% favstats %>% kable(digits=2)
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
-1.34 | -0.3 | -0.03 | 0.26 | 1.39 | 0 | 0.48 | 150 | 0 |
Correlation squared measures the proportion of variance in the data that is explained by the model.
Total variance decomposes into the variance of the predicted values and the variance of the residuals.
\[ r_{xy}^2 = \frac{\text{variance of }\hat y}{\text{variance of all }y_i} \]
BEER | Brewery | Calories | Carbohydrates | Alcohol | Type |
---|---|---|---|---|---|
American Amber Lager | Straub Brewery | 136 | 10.5 | 4.1 | Domestic |
American Lager | Straub Brewery | 132 | 10.5 | 4.1 | Domestic |
American Light | Straub Brewery | 96 | 7.6 | 3.2 | Domestic |
Anchor Steam | Anchor | 153 | 16.0 | 4.9 | Domestic |
Anheuser Busch Natural Light | Anheuser Busch | 95 | 3.2 | 4.2 | Domestic |
Anheuser Busch Natural Ice | Anheuser Busch | 157 | 8.9 | 5.9 | Domestic |
lm(Calories ~ Alcohol, data=beerd) %>% summary %>% tidy %>% kable
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 6.41895 | 5.758141 | 1.114761 | 0.2666564 |
Alcohol | 28.34687 | 1.062580 | 26.677404 | 0.0000000 |
A linear regression can fail in different ways:
cor | cor.lo | cor.hi |
---|---|---|
-0.7204232 | -0.3517975 | -0.4488909 |
Residual plot: scatter plot of \(\hat y\) vs. residuals.
Using \(\hat y\) instead of \(x_i\) makes this generalizable