Lab 12
Tobias Johnson
5/10/2018
Getting Started
A few quick reminders:
The console lets you try commands and see the results right away, but nothing you do there is saved.
To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as
lab7.R
or something like that. In this file, put your answers like this:# Exercise 1 mean(dataset$left.bicep.thickness) mean(dataset$right.bicep.thickness) # On average, people's right biceps are thicker than their left biceps # Exercise 2
To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.
If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.
To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.
You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like
?mean
.
Goals
Today, we’ll get practice with linear regression.
Exercise 1
We’ll go back to the North Carolina births dataset from Lab 6 and Lab 7, available at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/nc.csv
Recall that each observation represents one birth, and that the variables are:
variable | definition |
---|---|
fage |
age of father in years |
mage |
age of mother in years |
mature |
either "mature mom" or "younger mom" , depending on the age of the mother |
weeks |
length of pregnancy in weeks |
premie |
either "full term" or "premie" , depending on whether the baby was born prematurely |
visits |
number of hospital visits by the mother during pregnancy |
marital |
either "married" or "not married" , depending on status of parents |
gained |
weight gained by mother during pregnancy in pounds |
weight |
birthweight of baby in pounds |
lowbirthweight |
either "low" or "not low" depending on whether baby was categorized as having a low birth weight |
gender |
gender of baby, either "male" or "female" |
habit |
either "nonsmoker" or "smoker" depending on if mother smokes cigarettes |
whitemom |
either "white" or "nonwhite" , referring to race of the mother |
Task. Load the dataset into an object called nc
.
Exercise 2
One would expect that babies born prematurely tend to weigh less with babies born at full term. The question we investigate today: what sort of relationship does the length of pregnancy have with the weight of the baby?
Task. Make a scatterplot with the pregnancy length as the explanatory variable and the baby’s birthweight as the response variable.
You might find my notes on scatterplots and regression lines helpful.
Exercise 3
Now, we’ll investigate the suitability of using a simple linear regression model for explaining birthweight in terms of pregnancy length. In this model, the relationship between pregnancy length (\(x_i\)) and birthweight \((y_i)\) is given as \[y_i = \beta_0 + \beta_1 x_i + \varepsilon_i.\] In this equation, \(\beta_0\) and \(\beta_1\) are parameters of the model representing the y-intercept and slope, respectively, of the regression line. The \(\epsilon_i\) variable represents the deviation away from the regression line, and it’s assumed to be normal with mean 0 and standard deviation \(\sigma\), where \(\sigma\) is another parameter of the model.
There are three assumptions about the model that we make:
- linear trend: The data has some underlying linear trend to it. If the data in fact has some nonlinear trend, inference based on this model will be nonsense.
- normal residuals: The model assumes that the deviations from the regression line are independent of each other and normally distributed.
- constant variability: The model assumes that all of the residuals have the same standard deviation.
We need to check that these assumptions are reasonable, based on our sample.
Task. Give your judgment on whether the linear trend assumption is satisfied.
Exercise 4
Task. Check the normal residuals assumption and state if you think it’s satisfied.
Exercise 5
Task. Check the constant variability assumption and state if you think it’s satisfied.
Exercise 6
There’s one more condition to be checked for valid inference, which is that the data is a simple random sample from the population.
Task. Does this condition hold for our sample?
Exercise 7
Task. What are the point estimates for \(\beta_0\) and \(\beta_1\) based on the sample?
Exercise 8
Task. Give 95% confidence intervals for \(\beta_0\) and \(\beta_1\). Give a sentence in English explaining the confidence interval for \(\beta_1\).
Hint: Your sentence in English should look something like this: “With 95% confidence, in the linear model for the population, an extra week of pregnancy length increases the birthweight on average by something between __________ and ________ pounds.”
Exercise 9
Task. Give a plot showing the 95% confidence region for the linear model.