Lab 1, Math 214
Tobias Johnson
2/1/2018
Getting started with R
The first thing to do is open up RStudio. This program is already installed on the lab computers.
The first thing you’ll see is the console. You can type commands here in R, the statistical programming language we use in this class. To get started, you can use R as a calculator. For example, try typing something like 37*21
to compute 37 times 21.
In the console, it’s very easy to try things out. If you think you know how to run a command, don’t be afraid to just give it a shot in the console! The worst thing that happens is that R will give you an error message.
When you type commands in the console, they aren’t saved for later (and you can’t hand in your work!). To make something longer lasting, go to File -> New File -> R Script.
Here, you can type commands and then save them in a file for later. You can include your own comments by starting the line with the symbol #
, like this:
# Now, I'm going to add 3 and 5 and then divide the result by 6
(3 + 5) / 6
This is important because when you do labs, you’ll have to answer questions both with English text and with R commands. In particular, when you put your answer to Exercise 1, you should include a comment like # Exercise 1
so it’s clear which question you’re answering.
One common source of confusion: the console and your R file are totally different worlds. For example, if you type in a command in the console to load in a dataset, this dataset will not be available via commands in the R file. If you type in a command in the R file to load in a dataset, this dataset will not be available via commands in the console. You’ll see what I’m talking about as you use R more in this class.
Exercise 1. Write an R command to compute your height in meters. Hint: one foot is .3048 meters.
Once you’ve put your command in the R file, you’ll want to tell RStudio to run the code so that you can see the result. This is called compiling the file. To do it, press Ctrl-Shift-K (or Command-Shift-K on a Mac), or click on the notebook icon in the toolbar. When it asks for the “Report output format”, choose PDF.
Vectors
R is good at managing vectors, which are lists of data. The following command makes a vector of numbers and stores it under the name vec
:
vec <- c(1, 18, 2.5, 3, 4, 17)
The function c
used here stands for “combine” and is used to make lists. The symbol <-
is called the assignment operator. When you write a <- b
, it tells R to store b
in an object called a
.
Try running this command in the console. You’ll see that it produces no output. If you want to see what’s been stored in the vec
object, you can tell R to show you, like this:
vec
To look at the first number in vec
, use the notation vec[1]
. For the second number, use the notation vec[2]
, and so on.
Exercise 2. What would the outcome be of the following command?
vec[4] * vec[5]
Put your answer as a comment. (If you want, you can try out the command yourself in the console and see for yourself.)
Loading data
Now, let’s load some real data.
Exercise 3. Put the following two commands in your R file. They download two datasets and store them as air73 and air17.
air73 <- read.csv("http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/airquality1973.csv")
## Warning in file(file, "rt"): "internal" method cannot handle https
## redirection to: 'https://www.math.csi.cuny.edu/~maher/teaching/2019/spring/
## stats/labs/airquality1973.csv'
## Warning in file(file, "rt"): "internal" method failed, so trying "libcurl"
air17 <- read.csv("http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/airquality2017.csv")
## Warning in file(file, "rt"): "internal" method cannot handle https
## redirection to: 'https://www.math.csi.cuny.edu/~maher/teaching/2019/spring/
## stats/labs/airquality2017.csv'
## Warning in file(file, "rt"): "internal" method failed, so trying "libcurl"
These commands download two datasets and store them under the names air73
and air17
. These datasets give readings of the level of ozone in the air in Manhattan in 1973 and 2017 on different days. Under today’s standards, an ozone level of more than 70 parts per billion is out of compliance with the Clean Air Act, the federal law that sets the rules on air pollution.
Try running these commands in the console. Useful hint: if you have a line of code in your R file and want to run it immediately in the console, type Control-Enter (Command-Enter on a Mac) when you’re on the line of code.
Now, let’s take a look at these datasets in the console. A useful command is head
, which gives the first few entries of a dataset. This is useful because printing out the entire datasets here would be annoying, as they’re big!
Exercise 4. Use the head
command to print out the beginning of each of these datasets. What variables do you see? (Put your answer to this last bit as a comment.)
The dataset is stored as a collection of vectors (recall that a vector is just a list of data). If your dataset is named dat
and has a variable named x
, then dat$x
is the vector giving the values of the x
variable for all observations in the dataset.
Exercise 5. Print out all values for the Ozone
variable in both datasets. At a glance, what do you notice about the 1973 dataset compared to the 2017 one?
You might notice that some values of the Ozone
variable in air73
are given as NA. This stands for not available, and it indicates that the data is missing from the dataset.
Analyzing data
R includes many, many commands for analyzing data. We’ll start with some basic ones. If you have a vector of data called vec
, you can find its mean, standard deviation, median, and interquartile range with the following commands:
mean(vec)
sd(vec)
median(vec)
IQR(vec)
Exercise 6. Find the mean and the median of the ozone readings in your two datasets. Hint: You’ll have a problem doing this with the 1973 dataset because of the missing data marked as NA. To tell R to ignore missing data when computing the mean, use the na.rm=TRUE
option (i.e., remove NA entries). For example, if your vector is called vec
, here’s how to do this:
mean(vec, na.rm=TRUE)
Something very useful to know: When running a command, you can always get help on the command by putting the cursor on it and then pressing F1. Or, in the console, type ?
immediately followed by the name of the command, for example, ?mean
.
If you have time… This section is just for if you have extra time. Suppose you want to count the number of observations that satisfy some criterion, e.g., the number of days in 2017 when the ozone level was above 70. To do this, use the command sum
. The basic format is sum(condition)
. Here’s an example:
sum(air17$Ozone > 70)
Some things to note when writing conditions: The symbol for “greater than or equal to” in R is >=
. The symbol for “equal to” is ==
. (A very common error is writing =
instead!) The symbol for “not equal to” is !=
. You can combine conditions using the symbol &
, which means “and”, and the symbol |
, which means “or”. For example, here’s how to count the number of days in 2017 in which the ozone level was greater than 70 or less than or equal to 25:
sum(air17$Ozone > 70 | air17$Ozone <= 25)
Exercise 7, if you have time. Think about what things in the dataset would be interesting to count. (The number of days with high/low ozone, the number of such days during the summer, etc.) Give commands to count them, and then put in a comment with any interesting observations you have.
By the way, you can download air quality data from https://www.epa.gov/outdoor-air-quality-data/download-daily-data, which is the website of EPA, the Environmental Protection Agency. The 2017 data used on this lab is taken from there. The 1973 environmental data was recorded by the New York Department of Environmental Conservation and comes from a book by Chambers, Cleveland, Kleiner, and Tukey called Graphical Methods for Data Analysis.