Correlation and linear regression

R provides functions to compute the correlation and the coefficients that arise in linear regression. Here we explore how they are used.

First, we must learn that there are TWO different ways R works with bivariate data: either as two separate variables or using a formula.

Two separate variables

Consider the built-in mtcars data set. We say we can make a scatter plot with the command plot(x,y) in either of these manners:

plot(mtcars$wt, mtcars$mpg)

or by using with(dataset, ...) we can use:

with(mtcars, plot(wt, mpg))

Using a formula

When x and y are two numeric variables, then the formula y ~ x will be plotted using a scatter plot. For example, the same plot above is produced with

plot(mpg ~ wt, mtcars)

Do note that in the first case we specify the x variable first and in the second case we use the x variable after the “tilde”, ~.

Why?

Because some functions use one but not both of these styles:

  • scatter.smooth and cor use the two-separate variables style

  • lm uses the formula style.

Correlation

The cor function computes the correlation, for the two variables above we have:

cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594

or

with(mtcars, cor(wt, mpg))
## [1] -0.8676594

The scatter.smooth function adds a “smoother” to the scatter plot, as illustrated in class, and is called similarly:

scatter.smooth(mtcars$wt, mtcars$mpg)

Linear regression

Linear regression is the task for fitting a linear model (of the form \(\hat{y} = \beta_0 + \beta_1x\) to a bivariate data set. The lm function does the computation. It needs a formula:

res = lm(mpg ~ wt, mtcars)
res
## 
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344

The res output shows the coefficients, but can also be used to draw the “regression” line and later we will query it for more details (but not today). We use abline to add a line to a scatter plot:

plot(mpg ~ wt, mtcars)
res = lm(mpg ~ wt, mtcars)
abline(res)

Questions

We will use a dataset downloaded from a website:

bdims <- read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/bdims.csv")

Documentation is available at https://www.openintro.org/stat/data/?data=bdims.

(Variable names ending in gi are girth; di are diameter; and de are depth.)

Question

Make a scatter plot of wgt versus hgt with weight on the \(y\) axis.

Is there a trend?

Use scatter.smooth to add a trend line. Does it appear to be linear?

From looking at the graph is the correlation close to \(0\), positive, or negative?

From looking at the graph would you expect \(r^2\) to be close to \(0\)? close to \(1\)? Somewhere in the middle?

Compute both \(r\) and \(r^2\) using cor.

The data set uses metric units. We can convert into inches and pounds through:

hgt.in = bdims$hgt/2.54
wgt.lbs = bdims$wgt/2.2

After this conversion, what will be \(r\) and \(r^2\)?

Find the standard deviation for both hgt and wgt.

Compute \(r\cdot s_y/s_x\)

Compare this last value with the output of lm(wgt ~ hgt, bdims).

Fill in the blank: For every additional centimeter of height we have an average increase of \(XXX\) kilograms in weight.

Question

Make this graphic?

plot(wgt ~ factor(sex), bdims)

Based on the graph, does the sex variable use a code of 0 or 1 for females? (This data set only has a binary option for this variable)

Question

Conventional wisdom has people getting heavier as they age. Is this shown in this data set? Let’s consider a scatterplot for the values with sex == 1:

plot(wgt ~ age, bdims, subset=sex == 1)

Based on the trend in the graph, does the wisdom hold?

Repeat for the subset sex == 0. How about for this cohort?

Question

The Gulliver’s Travels rubric says roughly once around then waist is twice around the neck and once around the neck is twice around the wrist. In this data set we have wai.gi and wri.gi for the waist and wrist measurements.

Make a scatter plot of the two variables, is there basically no correlation, a positive correlation, a negative correlation?

Is there a linear trend?

Compute the regression coefficients. What is the estimated slope?

Based on Gulliver’s Travels, what is the slope implied by the observation?

If a shirt was built based on wrist size following Gulliver’s travels, would we expect the waist to be too tight or too loose? (There are also variables for hip and naval measurements.)

Question

A scatter plot of waist based on hip girth is given by:

plot(wai.gi ~ hip.gi, bdims)

Does it appear there are two “clusters”?

We can add linear regression lines for both genders through:

plot(wai.gi ~ hip.gi, bdims)
res.0 = lm(wai.gi ~ hip.gi, bdims, sex == 0)
res.1 = lm(wai.gi ~ hip.gi, bdims, sex == 1)
abline(res.0, col="blue")
abline(res.1, col="red")

Comment as to why these are different?

The following commands find the correlation bewteen waist girth and hip girth for all the data, just the sex==0 data, and just the sex==1 data:

all_data = with(bdims, cor(wai.gi, hip.gi))
just_0 = with(subset(bdims, sex==0), cor(wai.gi, hip.gi))
just_1 = with(subset(bdims, sex==1), cor(wai.gi, hip.gi))

These are:

c(all=all_data, just_0 = just_0, just_1 = just_1)
##       all    just_0    just_1 
## 0.6923506 0.8120754 0.7996779

For linear relationships, the “square of the correlation”, \(r^2\), is the fraction of variation in the values of \(y\) explained by the least-squares regression of \(y\) on \(x\). Do the values above suggest that the value of sex explains some of the variation as well as the value of hip.gi? Why?