Lab 4

Getting Started

A few quick reminders:

  • The console lets you try commands and see the results right away, but nothing you do there is saved.

  • To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as lab2.R or something like that. In this file, put your answers like this:

    # Exercise 1
    mean(dataset$left.bicep.thickness)
    mean(dataset$right.bicep.thickness)
    # On average, people's right biceps are thicker than their left biceps
    
    # Exercise 2
  • To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.

  • If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.

  • To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.

  • You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like ?mean.

  • If you need to look things up from the first two labs, they’re available at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/lab1/index.html, http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/lab2/index.html, and http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/lab3/index.html.

Goals

We will learn how to:

  • count the number of observations satisfying more complex criteria than we’ve done in the past;
  • look for data fitting some criterion
  • make box plots
  • make bar plots

The Dataset

Today’s lab uses a dataset giving the level of ozone in the air an the high and low temperatures on different days from January to September in 2017 in New York, Los Angeles, and Boston. On some days, no information was available, so not every day is included in the dataset for each city. I put together this dataset from data published by the EPA and by NOAA, two federal government agencies. The detaset is available at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/airweather.csv.

Exercise 1. Download this dataset using the read.csv command and put it into an object called climate.

Each observation in the dataset represents a day in a city. Here’s a guide to the variables in this dataset:

  • ozone: the level of ozone in the atmosphere, measured in parts per billion; a level higher than 70 is known to be hazardous to health
  • place: either "boston", "la", or "ny"
  • high.temp: the high temperature recorded for the observation
  • low.temp: the low temperature recorded for the observation
  • month: the month of the observation, a number from 1 to 9 (since the dataset only has observations from January to September)
  • day: the day of the month of the observation, a number from 1 to 31
  • day.of.week: the day of the week (Mon–Sun) of the observation

Matching and Counting

In previous labs, we’ve counted the number of observations fitting a given criterion with the command sum(...), where ... is a condition. For example, to count the number of individuals with bicep girth less than or equal to 27 centimeters:

sum(bdims$bic.gi <= 27)

There are several operators besides <= (less than or equal to) that we can use. Here’s a list:

  • less than : <
  • greater than: >
  • greater than or equal to: >=
  • equal to: == (warning: = doesn’t work)
  • not equal to: !=

When you test for equality or inequality with something that’s not a number, you must put that thing in quotes. For example, to count up the number of observations in the climate dataset for Boston, you’d use sum(climate$place == "boston").

We can also combine these operators using |, which means or, and &, which means and. For example, to count the number of people with bicep girth less than or equal to 27 and height greater than 175 in the bdims dataset:

sum(bdims$bic.gi <= 27 & bdims$hgt > 175)

Exercise 2 What proportion of days in New York were out of compliance with air standards for ozone (i.e., had more than 70 parts per billion ozone in the air)? Note that you’ll have to count the number of such days in New York, and then divide by the total number of days with data for New York. What’s the answer for Boston and Los Angeles?

We know how to count the number of observations that fit some criteria. But what if we want to look at the observations that match the criteria? For example, suppose we want to look at all observations in bdims of people with bicep girth less than or equal to 27 and height greater than 175. One way to do this is with the subset command. The basic command is subset(dataset, condition), where you substitute that name of your dataset for dataset and the condition you want to test for for condition. For example, here’s the command and the output for looking at all observations in bdims as above.

subset(bdims, bic.gi <= 27 & hgt > 175)
##     bia.di bii.di bit.di che.de che.di elb.di wri.di kne.di ank.di sho.gi
## 304   33.8   29.7   32.4   17.3   25.4   13.4    9.6   18.4   13.4   94.5
## 346   39.3   24.5   32.3   21.4   29.5   13.4   10.9   18.2   13.0  110.2
## 354   39.5   29.8   31.5   19.2   25.4   13.0   10.6   19.1   13.4  103.0
## 423   36.7   29.0   32.0   19.7   25.6   12.4   10.0   18.6   13.0  100.5
## 471   37.4   29.0   32.9   17.7   23.5   12.7   10.0   18.2   13.8   98.2
## 472   37.6   28.3   29.4   16.8   25.4   13.2   10.0   17.8   14.2   99.7
## 483   35.8   31.5   32.6   17.5   25.2   12.9    9.8   17.9   13.4   99.3
## 497   38.7   29.7   32.0   15.8   26.4   12.8   10.5   18.4   13.8  103.8
##     che.gi wai.gi nav.gi hip.gi thi.gi bic.gi for.gi kne.gi cal.gi ank.gi
## 304   86.5   68.3   89.2   94.0   57.0   26.0   22.6   35.9   32.5   19.7
## 346   89.9   66.0   79.9   97.1   59.0   27.0   24.2   37.2   39.0   23.2
## 354   85.5   68.0   84.5   99.5   60.8   26.5   24.2   40.5   39.4   23.2
## 423   86.1   68.3   85.3   96.5   55.1   26.0   22.6   33.9   33.7   21.8
## 471   84.1   65.9   83.7   96.3   57.0   26.6   23.7   35.6   34.0   21.7
## 472   84.2   62.4   81.4   90.4   53.2   25.8   23.0   34.5   31.0   19.2
## 483   86.2   67.0   85.3   96.7   56.2   27.0   22.3   37.0   33.0   21.6
## 497   85.3   63.7   85.3   96.3   55.5   26.4   23.0   36.5   34.1   21.3
##     wri.gi age  wgt   hgt sex
## 304   14.6  21 60.7 180.3   0
## 346   16.2  24 66.8 179.8   0
## 354   15.8  19 70.6 178.0   0
## 423   15.0  23 60.0 177.8   0
## 471   15.1  23 62.3 175.2   0
## 472   14.7  32 57.7 175.2   0
## 483   15.0  25 63.6 175.3   0
## 497   15.0  24 63.6 175.3   0

Note that in the condition, you can write bic.gi <= 27 instead of bdims$bic.gi <= 27. Observations numbered 304, 346, 354, 423, 471, 472, 483, 497 all matched the criteria. Looking at these people, we can see, for example, that they’re all women.

Exercise 3. Give the high temperatures in New York, Boston, and Los Angeles on February 15.

Box plots

Exercise 4. Load the ggplot2 library with the command library("ggplot2"). Try to compile your R file. If it gives you an error message about not finding ggplot2, type install.packages("ggplot2") in the console, and then try compiling your R file again again. (Note: this is one of the very rare times that doing something in the console will have an effect on your R file.)

Recall the basic syntax for making plots with ggplot:

ggplot(..., aes(...)) + ...

In place of the first … goes the data frame (for example, bdims). The second … contains the aesthetic mapping, which consists of instructions saying which variables in the dataset should be drawn in which ways. The final … contains instructions for the geometry, i.e., what sort of plot to make (e.g., a Q-Q plot, a box plot, a bar plot, etc.). Last week, we used geom_histogram and stat_qq.

Today, we’ll learn to make box plots. The basic command is

ggplot(dataset, aes(x=categorical.variable, y=numerical.variable)) + geom_boxplot()

Here, dataset should be replaced by the name of your dataset. categorical.variable and numerical.variable must be replaced by, well, the name of a categorical and the name of a numerical variable. This makes side-by-side boxplots, where each box shows the distribution of the numerical variable among all observations split according to the categorical variable.

Exercise 5. Make a box plot showing the distributions of ozone levels for each of the three cities.

Bar Plots

The basic command for a bar plot is

ggplot(dataset, aes(x=categorical.variable)) + geom_bar()

This will give a bar plot showing the number of observations with each value of the given categorical variable. For example, this command shows the number of observations from each of the three cities:

ggplot(climate, aes(x=place)) + geom_bar()

This is not the most informative of bar plots, but it does tell us that we have the most data for New York and the least for Los Angeles.

Exercise 6. Make a bar plot showing the number of days in each city in which the ozone levels were above the limit of 70. Hint: Instead of ggplot(climate, ...), you’ll need something like this:

ggplot(subset(climate, ...), aes(x=...)) + geom_bar()

The idea is that you restrict your dataset to only the observations in which ozone is above 70 parts per billion, and then you make a bar plot from this reduced dataset.

Extra Exercises (if you have time)

Exercise 7. Make a bar plot showing the number of days in each month with ozone levels greater than 50.

Exercise 8. Make a bar plot showing the number of days in each month with ozone levels greater than 50, using different colors to indicate how many of the days come from each city. (Hint: in the aes() part of the command, you’ll need to use aes(x=..., fill=...) to tell R to set the x-axis according to one variable and to fill the bars in different colors according to another variable.)

Exercise 9. Which day of the week has the highest ozone levels? Do you see the same patterns in all three cities?