This “lab” has 3 goals
introduce how to use RStudio
describe 3 ways to get data to work with in R
make some of the basic plots for univariate data and summarize these plots using the language of “center”, “spread”, and “shape”.
We load RStudio using its icon on the desktop or the start menu
We can directly enter commands in the console
We can type up a set of commands in a script and run these all at once (or piece by piece).
We can mix commentary and commands in a file and create a report
Using File -> New File -> R Script
open a new script. We can mix commentary and commands by commenting out our commentary.
In R
a comment is any text following a (bare) #
sign. We will use “##`# to comment commentary, as R Studio will treat that in a special manner.
One can use markdown styling to format commentary.
Lines that are not commented out will be R commands that will be run when the script is executed.
For example
##' This is a *report*
2 + 2
##' Clearly R knows how to add, what about multiply?
3 * 3
Execution will magically happen when the file is saved if that box is selected. If so, the workspace will update, the graphics will update, etc.
To create a report, we click the “compile” button, just to the right of the magic wand.
Try it out, it is pretty easy.
If you have a lot of commentary and little code, then marking the code is an option.
Create a new file using File -> New File -> R Markdown
.
The bit at the top is YAML code to describe the document. It can be deleted, or edited, as you want. Commentary follows after the last ---
.
Commentary use markdown formatting to add headers, etc.
Code is setoff within in three-back tick braces the opening braces have a {r} after them, as in:
```{r}
2 + 2
```
Some marking to hide code but print results, mark code to not be evaluated, or modify how the results will be printed is possible.
To compile a report and execute the commands we “knit” it together. This is achieved with the the “Knit” button.
We describe three ways to enter a data set into R so that it can be analyzed.
In class, we say one way already: enter the data into Excel, save as a .csv
file, read into R with read.csv
. The read.csv
file is one way to enter in formatted data from a resource. For example, we can read a csv file from online with the command:
air73 <- read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/airquality1973.csv")
Or similarly,
air17 <- read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/airquality2017.csv")
These read the file and store the data using a name (air73
and air17
, respectively).
Clicking on the “Environment” view, you can see that air73
has 153 obs (cases or observations) and 7 variables. Clicking on the data set will call View
on it, so that you can inspect the data.
You can enter (univariate) data using c
to combine different values, separated by commas, as in:
commute <- c(30, 45, 15, 5, 90, 60)
This command combines the values into one object and assigns that object a name, commute
. The output of an assignment command is “invisible”. However, typing the variable name will cause the display of the stored data:
commute
## [1] 30 45 15 5 90 60
R has many built-in data sets ready to use if you know who to ask. (R of course, but you need to ask properly).
For example, the mtcars
data set lists numbers about cars from 1973. The more modern Cars93
data set is also accessible, but that is for another day.
The mtcars
data is not visible in the “environment” by default but the variable is defined in your session. Issuing the command
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
will cause the values store to display
The command
View(mtcars)
will cause RStudio to display the data in a cell-like manner.
That is actually manipulating data in R.
Above we have two different storage formats for our data sets air73
, mtcars
, and commute
.
The commute
variable is a direct link to the univariate data set and it can be directly passed to a function. For example, to make a stem-and-leaf plot of commute times, we have:
stem(commute)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 0 | 55
## 2 | 0
## 4 | 5
## 6 | 0
## 8 | 0
In this lab we will use just a few functions from R to graphically summarize data: stem
(as above), stripchart
, hist
, and barplot
.
We can make a stem and leaf plot of a variable within a multivariate data set, but we need to learn how to reference the variable.
Multivariate data is typically stored in a data.frame
. These record variables as named columns, each row representing the same case or observation.
The columns are named so that they can be accessed. For example, mtcars$mpg
will reference the mpg
variable in the mtcars
data set:
stem(mtcars$mpg)
##
## The decimal point is at the |
##
## 10 | 44
## 12 | 3
## 14 | 3702258
## 16 | 438
## 18 | 17227
## 20 | 00445
## 22 | 88
## 24 | 4
## 26 | 03
## 28 |
## 30 | 44
## 32 | 49
(We will learn shortcuts, but save that for a later day.)
A strip chart is a crummy version of a dot plot:
stripchart(mtcars$wt)
QUESTION: for each of the graphics above, what is the “center” or “middle” value. (There is no exactly correct answer.)
QUESTION: for each of the graphics above, what is the “range” of values. (Here we use two numbers to describe the range)
Histograms are generated with hist
, as in
hist(mtcars$qsec)
QUESTION: Is this histogram approximately “bell shaped”?
QUESTION: What is qseq
? (Hint, type the command ?mtcars
and read.)
Variable names are case sensitive. The Ozone
variable in air73
is referenced with air73$Ozone
:
hist(air73$Ozone)
QUESTION: Is this histogram approximately “bell shaped”?
QUESTION: What is the “range” of the Ozone
variable?
QUESTION: What is the “center” of the Ozone
variable?
The “shape” of a distribution of values is a rough description along the lines of
unimodal versus non-unimodal or multimodal
symmetric versus skewed (right or left). Skew right means the “tail” of the distribution is longer on the right than the left.
bell shaped
Here are commands to produce 3 unimodal histograms.
hist(rnorm(100))
hist(runif(100))
hist(rexp(100))
QUESTION: characterize the three as symmetric, skewed left, or skewed right.
QUESTION: Are any of the three “bell-shaped”?
QUESTION: Make this grahpic
hist(air73$Temp)
Is this symmetric, skewed left, or skewed right?
Center can be measured different ways, for example:
For the first characterization, the stem and leaf plot can be used to find the middle pretty quickly
For the latter, the histogram can be used to identify a “balance” point.
QUESTION: find the middle value from the commute
data
QUESTION: find the balancing value for the mtcars$mpg
data
QUESTION: find both middles for the air73$Ozone
data
QUESTION: For a symmetric data set, would you expect the two to be the same?
Spread is harder to visualize. Here are two ways, a third will be prominent but based on a numeric computation:
The range of the data is the pair of the smallest and largest values.
The range of the middle 50% of the data is called the IQR. It is found by breaking up the data into 4 equal sized pieces (numbers of data point) and using the range to summarize the middel two pieces.
These can be “eyeballed” from the histgram, the histogram representing amount of data in a bin with area, so the area is proportional to the number of data points.
QUESTION: Look at a histogram of air73$Ozone
. Consider the pair (20, 65). Is this the “range” or the “IQR” for this data?
QUESTION: Look at the histogram of mtcars$qsec
. What is the range? What is the IQR?
QUESTION: Look at the histogram of mtcars$wt
. Describe the shape, center, and spread in a paragraph. Mention how you measured each, if there is a choice.