lab10

Lab 10

Getting Started

A few quick reminders:

The console lets you try commands and see the results right away, but nothing you do there is saved.
To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as lab8.R or something like that. In this file, put your answers like this:
```
# Exercise 1
mean(dataset$left.bicep.thickness)
mean(dataset$right.bicep.thickness)
# On average, people's right biceps are thicker than their left biceps

# Exercise 2
```
To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.
If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.
To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.
You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like ?mean.

Goals

Today’s lab is about finding confidence intervals and doing tests of significance for proportions.

Exercise 1

Today’s dataset is data from the paper Twenty five year follow-up for breast cancer incidence and mortality of the Canadian National Breast Screening Study by A.B. Miller et al. I got the example from OpenIntro Statistics, Section 6.2.3.

The paper describes an experiment conducted over 30 years on about 90,000 women in Canada to study whether mammogram screening helps to reduce death from breast cancer. The women were divided into two groups. For the first five years of the study, the women in the first group received regular mammograms to monitor to try to detect early breast cancers. Women in the second group received regular non-mammorgram screening for breast cancer for five years. In the next 25 years, no further intervention was made in the women’s lives, and the number of cases of breast cancer were recorded for both groups.

The dataset contains the information on these women. Each observation represents one woman. There are two variables, cancer.death and group. The values of cancer.death are "yes" or "no", depending on whether the subject died of cancer over the 25 years of monitoring. The values of group are either "mam" or "ctrl" depending on whether the women received mammograms in the first five years or non-mammogram screening.

Task. Load the dataset into an object called can.study. You can download it from http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/cancer.csv

Exercise 2

Task. Determine the total number of subjects in the "mam" group and in the "ctrl" group. (Recall: the sum command can be used to count the number of observations satisfying some condition.)

Exercise 3

Task. Determine the number within each group who died of cancer. (Hint: to count the number of observations satisfying one condition and satisfying some other condition, use the & operator. See Lab 4, for example.)

Exercise 4

The goal now is to do a two-sample proportion test to try to determine if the two populations of women (those who receive mammogram screening and those who don’t) have different proportions dying of breast cancer.

Task. Write a null and an alternative hypothesis for your test.

Exercise 5

Task. Determine if the conditions for validity of inference for a proportion test are satisfied. (Hint: you should check the conditions using the pooled proportion.

Exercise 6

Task. Carry out the test at a .05 significance level, and state the result.

Here’s some more information about the study from when it first came out: https://www.nytimes.com/2014/02/12/health/study-adds-new-doubts-about-value-of-mammograms.html. This study and others have led to a controversial reconsideration of mammograms: https://www.nytimes.com/2015/10/21/health/breast-cancer-screening-guidelines.html.

Exercise 7

For this exercise, suppose that you want to design a study to determine the proportion of women who will die of breast cancer over the next 25 years. (For example, in the Canadian study, this proportion in the mammogram sample was 1.11% and in the other sample was 1.12%.) Your goal is to determine what sample size is necessary to make the margin of error smaller than 0.1% for a 95% confidence interval.

When doing a problem like this, you have two options: you can do the computation assuming that $p=.5$, the worst-case scenario for the size of the margin of error. Or you can use a prior estimate for $p$.

Task. Determine the sample size to make the margin of error smaller than 0.1% for a 95% confidence interval. Do it first using the worst-case assumption $p=.5$ and then using a better estimate.