R provides functions to compute the correlation and the coefficients that arise in linear regression. Here we explore how they are used.
First, we must learn that there are TWO different ways R works with bivariate data: either as two separate variables or using a formula.
Consider the built-in mtcars
data set. We say we can make a scatter plot with the command plot(x,y)
in either of these manners:
plot(mtcars$wt, mtcars$mpg)
or by using with(dataset, ...)
we can use:
with(mtcars, plot(wt, mpg))
When x
and y
are two numeric variables, then the formula y ~ x
will be plotted using a scatter plot. For example, the same plot above is produced with
plot(mpg ~ wt, mtcars)
Do note that in the first case we specify the x
variable first and in the second case we use the x
variable after the “tilde”, ~
.
Because some functions use one but not both of these styles:
scatter.smooth
and cor
use the two-separate variables style
lm
uses the formula style.
The cor
function computes the correlation, for the two variables above we have:
cor(mtcars$wt, mtcars$mpg)
## [1] -0.8676594
or
with(mtcars, cor(wt, mpg))
## [1] -0.8676594
The scatter.smooth
function adds a “smoother” to the scatter plot, as illustrated in class, and is called similarly:
scatter.smooth(mtcars$wt, mtcars$mpg)
Linear regression is the task for fitting a linear model (of the form \(\hat{y} = \beta_0 + \beta_1x\) to a bivariate data set. The lm
function does the computation. It needs a formula:
res = lm(mpg ~ wt, mtcars)
res
##
## Call:
## lm(formula = mpg ~ wt, data = mtcars)
##
## Coefficients:
## (Intercept) wt
## 37.285 -5.344
The res
output shows the coefficients, but can also be used to draw the “regression” line and later we will query it for more details (but not today). We use abline
to add a line to a scatter plot:
plot(mpg ~ wt, mtcars)
res = lm(mpg ~ wt, mtcars)
abline(res)
We will use a dataset downloaded from a website:
bdims <- read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/bdims.csv")
Documentation is available at https://www.openintro.org/stat/data/?data=bdims.
(Variable names ending in gi
are girth; di
are diameter; and de
are depth.)
Make a scatter plot of
wgt
versushgt
with weight on the \(y\) axis.
Is there a trend?
Use
scatter.smooth
to add a trend line. Does it appear to be linear?
From looking at the graph is the correlation close to \(0\), positive, or negative?
From looking at the graph would you expect \(r^2\) to be close to \(0\)? close to \(1\)? Somewhere in the middle?
Compute both \(r\) and \(r^2\) using
cor
.
The data set uses metric units. We can convert into inches and pounds through:
hgt.in = bdims$hgt/2.54
wgt.lbs = bdims$wgt/2.2
After this conversion, what will be \(r\) and \(r^2\)?
Find the standard deviation for both
hgt
andwgt
.
Compute \(r\cdot s_y/s_x\)
Compare this last value with the output of
lm(wgt ~ hgt, bdims)
.
Fill in the blank: For every additional centimeter of height we have an average increase of \(XXX\) kilograms in weight.
Make this graphic?
plot(wgt ~ factor(sex), bdims)
Based on the graph, does the sex
variable use a code of 0
or 1
for females? (This data set only has a binary option for this variable)
Conventional wisdom has people getting heavier as they age. Is this shown in this data set? Let’s consider a scatterplot for the values with sex == 1
:
plot(wgt ~ age, bdims, subset=sex == 1)
Based on the trend in the graph, does the wisdom hold?
Repeat for the subset sex == 0
. How about for this cohort?
The Gulliver’s Travels rubric says roughly once around then waist is twice around the neck and once around the neck is twice around the wrist. In this data set we have wai.gi
and wri.gi
for the waist and wrist measurements.
Make a scatter plot of the two variables, is there basically no correlation, a positive correlation, a negative correlation?
Is there a linear trend?
Compute the regression coefficients. What is the estimated slope?
Based on Gulliver’s Travels, what is the slope implied by the observation?
If a shirt was built based on wrist size following Gulliver’s travels, would we expect the waist to be too tight or too loose? (There are also variables for hip and naval measurements.)
A scatter plot of waist based on hip girth is given by:
plot(wai.gi ~ hip.gi, bdims)
Does it appear there are two “clusters”?
We can add linear regression lines for both genders through:
plot(wai.gi ~ hip.gi, bdims)
res.0 = lm(wai.gi ~ hip.gi, bdims, sex == 0)
res.1 = lm(wai.gi ~ hip.gi, bdims, sex == 1)
abline(res.0, col="blue")
abline(res.1, col="red")
Comment as to why these are different?
The following commands find the correlation bewteen waist girth and hip girth for all the data, just the
sex==0
data, and just thesex==1
data:
all_data = with(bdims, cor(wai.gi, hip.gi))
just_0 = with(subset(bdims, sex==0), cor(wai.gi, hip.gi))
just_1 = with(subset(bdims, sex==1), cor(wai.gi, hip.gi))
These are:
c(all=all_data, just_0 = just_0, just_1 = just_1)
## all just_0 just_1
## 0.6923506 0.8120754 0.7996779
For linear relationships, the “square of the correlation”, \(r^2\), is the fraction of variation in the values of \(y\) explained by the least-squares regression of \(y\) on \(x\). Do the values above suggest that the value of sex
explains some of the variation as well as the value of hip.gi
? Why?