Lab 6
Tobias Johnson
3/15/2018
Getting Started
A few quick reminders:
The console lets you try commands and see the results right away, but nothing you do there is saved.
To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as
lab6.R
or something like that. In this file, put your answers like this:# Exercise 1 mean(dataset$left.bicep.thickness) mean(dataset$right.bicep.thickness) # On average, people's right biceps are thicker than their left biceps # Exercise 2
To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.
If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.
To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.
You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like
?mean
.
Goals
Today, we’ll get practice making confidence intervals.
Exercise 1
Today’s lab uses a dataset of births randomly sampled from those occurring in the state of North Carolina in 2004. It’s posted on the website at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/nc.csv
Each observation represents one birth. Here is a list of the variables and what they mean:
variable | definition |
---|---|
fage |
age of father in years |
mage |
age of mother in years |
mature |
either "mature mom" or "younger mom" , depending on the age of the mother |
weeks |
length of pregnancy in weeks |
premie |
either "full term" or "premie" , depending on whether the baby was born prematurely |
visits |
number of hospital visits by the mother during pregnancy |
marital |
either "married" or "not married" , depending on status of parents |
gained |
weight gained by mother during pregnancy in pounds |
weight |
birthweight of baby in pounds |
lowbirthweight |
either "low" or "not low" depending on whether baby was categorized as having a low birth weight |
gender |
gender of baby, either "male" or "female" |
habit |
either "nonsmoker" or "smoker" depending on if mother smokes cigarettes |
whitemom |
either "white" or "nonwhite" , referring to race of the mother |
Task. Load the dataset into an object called nc
.
Exercise 2
Task. Find the mean and standard deviation of pregnancy length. Say if these are population parameters or sample statistics.
In this problem, you’ll face the difficulty that some of the data in this set is missing, indicated by a value of NA
(not available). When you try to find a mean and some data is missing, the mean
command will return NA
as its answer. To tell it to ignore missing data, do this:
mean(..., na.rm=TRUE)
Here … should be replaced by the value you’re trying to find the mean of.
Exercise 3
Task. How many observations is the mean in Exercise 2 based on?
To answer this question, you have to count the number of observations that include data for the weeks
variable. To do this, you can as usual use the counting command
sum(...)
where … is replaced by a condition. Here, your condition should be !is.na(dat$variable)
, with dat$variable
replaced by the name of your dataset and variable. The command is.na(...)
tests if the variable is missing a value, and the !
symbol means not.
Exercise 4
In this problem and all following ones, make the following assumptions:
- The population standard deviation is equal to the sample standard deviation that you found in Exercise 2.
- The sampling distribution for the sample mean is approximately normal.
It is not good practice to make these assumptions without justification. We’ll learn to handle this properly when we cover Chapter 7.
Task. Compute the standard deviation of the sampling distribution of the sample mean for the weeks
variable. Store it in a new object called sampling.sd
, and print out the number. (Note: “store it in a new object” means to do sampling.sd <- ...
, where … is a command to do the computation. To print out the number after this, just put sampling.sd
on a line by itself.)
Exercise 5
Recall: 95% of a normal distribution lies within 1.96 standard deviations from the mean.
Task. Compute a 95% confidence interval for the true mean pregnancy length for all births in North Carolina in 2004, based on the mean in your sample.
Exercise 6
Task. Which is the correct interpretation of your confidence interval:
- There is approximately a 95% probability that the true mean pregnancy length for 2004 North Carolina births is in the interval.
- There is approximately a 95% probability that the mean pregnancy length for births in the sample is in the interval.
- The procedure of taking a sample and computing a confidence interval as you did will produce a confidence interval capturing the true mean pregnancy length about 95% of the time.
- Approximately 95% of births have pregnancy length falling in the interval.
Exercise 7
Task. Without doing any calculations, answer this question: Will a 98% confidence interval be wider or narrower than the 95% confidence interval? Why?
Exercise 8
Task. Fill in the blank in this sentence: 98% of the normal distribution falls within _____________ standard deviations of the mean.
Hint. In the picture below, the blue region has area .98 and the two red regions have area .01 each. Remember the qnorm
command!
Exercise 9
Task. Compute a 98% confidence interval for the true mean pregnancy length for all births in North Carolina in 2004, based on the mean in your sample.