Week 3

statistics-dot-com

Week 3 of this class is about the basic linear model of statistics. We
look at it in terms of simple linear regression, but with the right
level of abstraction it can be used as a framework to do most all of
the first two weeks as well. That is, the one-sample t-test is the
linear model with no slope, the two-sample paired t-test is the linear
model with slope 1, the two-sample t-test is a special case of oneway
ANOVA which is the linear model with a categorical predictor…

But that is too much too early. For now lets think about the linear
model as a means to understand the relationship between two numeric
variables, one a predictor and one a response. The basic idea of
course being that the response is somehow modeled by the predictor and
if you knew that model you would say a value of the response is that
model for that value of the predictor plus some random error. The
simple linear model uses a linear equation for the model. This gives
us three things to estimate: the intercept of the line, the slope of
the line, and a standard deviation of the random error.

The basic plot for showing a relationship between two numeric
variables is the scatterplot. R has several ways to plot this. Suppose
x and y are variables in a data set our_data. Then these all
should work:

plot(our_data$x, our_data$y)
with(our_data, plot(x, y))
plot(y ~ x, data=our_data)

The latter uses R's model formula. For this use we have the left side
is the response, the right side a description of the predictor. The
data= argument allows variable lookup within the data frame
our_data thereby avoiding the need to type it twice or use with.

For the heartrate data set in the UsingR package we have two
variables age and maxrate. Make a scatter plot with maxrate as
the response variable.

Which seems true

## Loading required package: MASS

Roughly as age increases maximum heart rate increases Roughly as age increase maximum heart rate decreases

Add two lines to your graph with the equations

with(heartrate, {
    abline(h = mean(maxrate))
    abline(v = mean(age))
})

Why is the intersection the “center” of the diagram?

The vertical line is the mean – or center – of the age variable The horizontal line is the mean – or center – of the max heart rate variable Both of the above

The center is useful for identifying correlation in a scatterplot. If
the center is used as a new coordinate system how many of the 15 data
points are in the upper right or lower left quadrant?

Which of the following R commands will list the number of cases in
data frame (15 for the heartrate data set):

nocases(heartrate) length(heartrate) nrow(heartrate) n(heartrate)

Correlation is described in the notes. Visually the center picture you
made above captures the idea: data is correlated if it tends to align
in 2 opposite quadrants of the four quadrants formed by
centering. Well that is a rough idea. Basically, if the data is as in
the heartrate data, large values of age are related with smaller
than average values of heartrate. This is an example of negative
correlation.

Enough with graphically, numerically we can find the correlation with
cor.

Find the correlation of age and max heartrate using cor. Use

with(heartrate, cor(age, maxrate))

Does with(heartrate, cor(maxrate, age)) give the same answer as the
previous one:

yes no

Does with(heartrate, cor(scale(maxrate), scale(age))) give the same
answer as the previous one:

yes no

The linear model is fit in R using formula notation. For us here, a
template is simply:

reponse ~ predictor, data=data_set

More general models (multiple predictors, mathematical transformations
of predictors, categorical predictors, interactions) are also fit with
extensions of this notation.

Here is how we would fit the heartrate data:

res <- lm(maxrate ~ age, heartrate)
summary(res)

## 
## Call:
## lm(formula = maxrate ~ age, data = heartrate)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.926 -2.538  0.388  3.187  6.624 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  210.048      2.867    73.3  < 2e-16 ***
## age           -0.798      0.070   -11.4  3.8e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 4.58 on 13 degrees of freedom
## Multiple R-squared: 0.909,   Adjusted R-squared: 0.902 
## F-statistic:  130 on 1 and 13 DF,  p-value: 3.85e-08 
##

The key values for interpreting the modal are the parameter
estimates. Here we have estimates for the intercept (210.04846 ), the
slope (-0.79773) and the estimate for the standard error of the
residuals (4.578).

For the built-in mtcars data set, fit mpg as a response and wt
as a predictor. What is the estimate for the slope? (The
interpretation is every 1000 pounds cost this many miles per gallon).

For the Cars93 data set in the MASS package, the following does a
similar computation:

require(MASS)
lm(MPG.city ~ I(Weight/1000), data=Cars93)

What is the value of the slope now?

The construct *I(Weight/1000)* is used to divide the Weight variable by 1000 so that the scale is the same as with the mtcars data set. When using formulas, simple math notations must be specified AsIs using I, as the formula notation co-opts the familiar symbols: +, -, *, and /.

The model object contains more than meets the eye. This is because R,
unlike SPSS say, is very terse in its output. You need to ask it to
get more out. Look at the basic output of the model and compare to the
following when a summary is requested:

res <- lm(mpg ~ wt, mtcars)
res

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##       37.29        -5.34  
##

summary(res)

## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.543 -2.365 -0.125  1.410  6.873 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   37.285      1.878   19.86  < 2e-16 ***
## wt            -5.344      0.559   -9.56  1.3e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 3.05 on 30 degrees of freedom
## Multiple R-squared: 0.753,   Adjusted R-squared: 0.745 
## F-statistic: 91.4 on 1 and 30 DF,  p-value: 1.29e-10 
##

The summary function in R is generic and is implemented for many
different classes. For linear models it provides a summary of the
coefficients and the model fit, whereas the lm object itself (res)
just shows the estimates.

The summary function is an accessor function (or extractor function). It is defined for most modeling functions you will meet, though perhaps defined differently. Other such functions, among others, are coef for the coefficients, residuals for the residuals and plot for diagnostic plots.

Define res by:

res <- lm(mpg ~ wt, mtcars)

Find mean(resid(res)), the sample mean of the residuals.

Define res as above

res <- lm(mpg ~ wt, mtcars)

Find sd(resid(res)), the sample standard deviation of the residuals. Compare it to the value from this line of summary(res):

Residual standard error: 3.046 on 30 degrees of freedom

They are the same The one from sd is larger The one from sd is smaller

Week 3

About

Plotting

Correlation

Fitting a linear model

Working with 'lm' objects