Lab 2, Math 214

Getting Started

A few quick reminders:

  • The console lets you try commands and see the results right away, but nothing you do there is saved.

  • To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as lab2.R or something like that. In this file, put your answers like this:

    # Exercise 1
    mean(dataset$left.bicep.thickness)
    mean(dataset$right.bicep.thickness)
    # On average, people's right biceps are thicker than their left biceps
    
    # Exercise 2
  • To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.

  • If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.

  • To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.

  • You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like ?mean.

  • You’ll probably need to look up some commands from the first lab. It’s available at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/lab1/index.html

Today’s lab

Let’s start by downloading the dataset we’ll use today. It contains body measurements taken from 507 active individuals. It also includes their age and gender. I’ve posted the dataset on the website at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/bdims.csv.

Exercise 1. Load in the dataset and give it the name bdims. (The command you’ll need is read.csv—see Lab 1 for an example.)

Exercise 2. Use the head command to print out the first few observations in the dataset.

It’s not so easy to figure out what the different variables are! There’s some documentation for the dataset available at https://www.openintro.org/stat/data/?data=bdims. Look at it and answer the following question.

Exercise 3. Your answers on this problem should be given as a comment only. What’s the name of the variable…

  • …giving bicep girth (i.e., the length around the biceps)?
  • …giving age?
  • …giving height?
  • …giving weight?
  • …giving sex? And how are male and female values represented?

Exercise 4. Find the 5-number summary of the bicep girth, age, height, and weight variables. (Hint: use the summary command. Also, remember that a variable called var.name in a data frame called data.frame is referred to as data.frame$var.name.)

Investigating normality

You learned in class that the best way to investigate if a variable fits the normal distribution is to make a Q-Q plot, which stands for a quantile-quantile plot. We will learn to do this, but not until the next lab. For now, we’ll make some more primitive comparisons to the normal distribution.

First, recall that for a distribution with mean \(\mu\) and standard deviation \(\sigma\), the z-score of the value \(x\) is given by the formula \[ z = \frac{x-\mu}{\sigma}. \] The z-score of a value is its number of standard deviations above or below the mean.

Exercise 5. Find the mean and standard deviation of the bicep girth variable. (Hint: the commands you’ll need are mean and sd.) Then find the z-scores for the following values: 27, 35, and 40 centimeters.

Exercise 6. Suppose that the bicep girth variable is normally distributed. What percentiles would the values 27, 35, and 40 correspond to. (Hint: the command pnorm(z) tells you the percentile for the normal distribution corresponding to the given z-score.)

Now, let’s determine the percentile of 27, 35, and 40 centimeters for the bicep girth variable according to its true distribution. For example, to find the percentile of 27, we need to count the number of observations that are less than or equal to 27 and divide this by the total number of observations.

To count the number of observations that fit a certain criteria, the basic command is sum(condition). For example, to count the number of observations in the 2017 air quality dataset from last lab in which the ozone level was 70 or higher:

sum(air17$Ozone >= 70)

You might also find it helpful that the total number of observations in a dataset is found with the command nrow, which stands for “number of rows”.

Exercise 7. Determine the percentile of 27, 35, and 40 centimeters for the bicep girth variable according to its true distribution.

Exercise 8. How close were these percentiles to those predicted by the normal distribution? Give your judgment as to whether bicep girth is close to the normal distribution.

Credit

This lab is based very loosely on one called The normal distribution, available at https://www.openintro.org/download.php?file=os2_lab_03A&referrer=/stat/labs.php under the Creative Commons Attribution-ShareAlike 3.0 Unported license. This lab is released under the same license.