Lab 5

Getting Started

A few quick reminders:

  • The console lets you try commands and see the results right away, but nothing you do there is saved.

  • To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as lab5.R or something like that. In this file, put your answers like this:

    # Exercise 1
    mean(dataset$left.bicep.thickness)
    mean(dataset$right.bicep.thickness)
    # On average, people's right biceps are thicker than their left biceps
    
    # Exercise 2
  • To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.

  • If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.

  • To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.

  • You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like ?mean.

Goals

Today, we’ll practice doing computations about the sampling distribution of the sample mean. We’ll also get more practice counting and selecting subsets of datsets that satisfy some criterion, as we did in Lab 4

If you don’t finish this lab, I highly recommend that you work on it outside of class! It’s good practice for the midterm.

Exercise 1

We will use a dataset consisting of some information about students at a large high school. (Note: this high school is fictional, and the data is actually taken from a survey of different high schools across the nation. But just pretend that it’s a single high school.) The dataset is available at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/hs.csv.

Task. Load the dataset into an object called hs.

Exercise 2

Task. List the variable names in this dataset. List the possible values that the categorical variables (i.e., the nonnumerical variables) have. (Hint: you might find the levels command helpful here.)

Exercise 3

Task. Find the population mean and standard deviation of the height and weight variables. (Note: This is not a typical situation! Usually, you’d have data only from a sample, not from the entire population!)

Exercise 4

The goal in this exercise is to find the mean and standard deviation of weight for the population of males only. To do this, use the subset command that we learned in last week’s lab. First, use this command to choose all observations of males. Store them in a new dataset called hs.males. The command to do this would look something like hs.males <- subset(hs, ...). Then, use the usual mean and sd commands to find the mean and standard deviation of the weight variable.

Task. Find the mean and standard deviation of weight for the population of males, as explained above.

Exercise 5

Suppose you take a simple random sample of size 200 from the entire dataset. Let \(\bar{x}\) be the mean weight for this sample.

Task. What is the mean and the standard deviation for the sampling distribution of \(\bar{x}\)? (Recall: the sampling distribution of \(\bar{x}\) is the distribution of data obtained from resampling and computing \(\bar{x}\) over and over again.)

Exercise 6

In this exercise, assume that the sampling distribution of \(\bar{x}\) is very close to normal (which is true—later in the class, we’ll learn some rules for judging when the sample size is large enough for this to be the case).

Task. What is the probability that \(\bar{x}\) is larger than 155 pounds for your sample? (Remember the pnorm command!)

Exercise 7

In Exercise 8, you’ll choose a random sample from your dataset. Since you don’t want your answers to change every time you recompile your file, you need to tell R to make the same random choices each time by setting a random number seed.

Task. Run the command set.seed(1234), replacing 1234 with some other number of your choice.

Exercise 8

To choose a random sample from a dataset called dat, use the following command:

sample <- subset(dat, 1:nrow(dat) %in% sample.int(nrow(dat), size=100))

This would choose a sample of size 100 and put it into a new dataset called sample.

Task. Create a new dataset called sample chosen as a random sample from hs of size 200. What is the mean weight in this sample?

Exercise 9

Task. How likely was it for your sample mean to deviate from the true population mean by as much as it did?