Intro to using R Studio

This “lab” has 3 goals

introduce how to use RStudio
describe 3 ways to get data to work with in R
make some of the basic plots for univariate data and summarize these plots using the language of “center”, “spread”, and “shape”.

RStudio

We load RStudio using its icon on the desktop or the start menu
We can directly enter commands in the console
We can type up a set of commands in a script and run these all at once (or piece by piece).
We can mix commentary and commands in a file and create a report

Reports in R

reports mix commentary and commands. You (the user) must mark either the commentary or the commands.

Marking the commentary

Using File -> New File -> R Script open a new script. We can mix commentary and commands by commenting out our commentary.

In R a comment is any text following a (bare) # sign. We will use “##`# to comment commentary, as R Studio will treat that in a special manner.

One can use markdown styling to format commentary.

Lines that are not commented out will be R commands that will be run when the script is executed.

For example

##' This is a *report*
2 + 2

##' Clearly R knows how to add, what about multiply?

3 * 3

Execution will magically happen when the file is saved if that box is selected. If so, the workspace will update, the graphics will update, etc.

To create a report, we click the “compile” button, just to the right of the magic wand.

Try it out, it is pretty easy.

Marking the code

If you have a lot of commentary and little code, then marking the code is an option.

Create a new file using File -> New File -> R Markdown.

The bit at the top is YAML code to describe the document. It can be deleted, or edited, as you want. Commentary follows after the last ---.

Commentary use markdown formatting to add headers, etc.

Code is setoff within in three-back tick braces the opening braces have a {r} after them, as in:

```{r}
2 + 2
```

Some marking to hide code but print results, mark code to not be evaluated, or modify how the results will be printed is possible.

To compile a report and execute the commands we “knit” it together. This is achieved with the the “Knit” button.

Working with data in R

We describe three ways to enter a data set into R so that it can be analyzed.

read.csv

In class, we say one way already: enter the data into Excel, save as a .csv file, read into R with read.csv. The read.csv file is one way to enter in formatted data from a resource. For example, we can read a csv file from online with the command:

air73 <- read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/airquality1973.csv")

Or similarly,

air17 <- read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/airquality2017.csv")

These read the file and store the data using a name (air73 and air17, respectively).

Clicking on the “Environment” view, you can see that air73 has 153 obs (cases or observations) and 7 variables. Clicking on the data set will call View on it, so that you can inspect the data.

Typing in data

You can enter (univariate) data using c to combine different values, separated by commas, as in:

commute <- c(30, 45, 15, 5, 90, 60)

This command combines the values into one object and assigns that object a name, commute. The output of an assignment command is “invisible”. However, typing the variable name will cause the display of the stored data:

commute

## [1] 30 45 15  5 90 60

Using a built-in data set.

R has many built-in data sets ready to use if you know who to ask. (R of course, but you need to ask properly).

For example, the mtcars data set lists numbers about cars from 1973. The more modern Cars93 data set is also accessible, but that is for another day.

The mtcars data is not visible in the “environment” by default but the variable is defined in your session. Issuing the command

mtcars

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

will cause the values store to display

The command

View(mtcars)

will cause RStudio to display the data in a cell-like manner.

Making plots and graphs from data

That is actually manipulating data in R.

Above we have two different storage formats for our data sets air73, mtcars, and commute.

The commute variable is a direct link to the univariate data set and it can be directly passed to a function. For example, to make a stem-and-leaf plot of commute times, we have:

stem(commute)

## 
##   The decimal point is 1 digit(s) to the right of the |
## 
##   0 | 55
##   2 | 0
##   4 | 5
##   6 | 0
##   8 | 0

In this lab we will use just a few functions from R to graphically summarize data: stem (as above), stripchart, hist, and barplot.

We can make a stem and leaf plot of a variable within a multivariate data set, but we need to learn how to reference the variable.

Multivariate data is typically stored in a data.frame. These record variables as named columns, each row representing the same case or observation.

The columns are named so that they can be accessed. For example, mtcars$mpg will reference the mpg variable in the mtcars data set:

stem(mtcars$mpg)

## 
##   The decimal point is at the |
## 
##   10 | 44
##   12 | 3
##   14 | 3702258
##   16 | 438
##   18 | 17227
##   20 | 00445
##   22 | 88
##   24 | 4
##   26 | 03
##   28 | 
##   30 | 44
##   32 | 49

(We will learn shortcuts, but save that for a later day.)

A strip chart is a crummy version of a dot plot:

stripchart(mtcars$wt)

QUESTION: for each of the graphics above, what is the “center” or “middle” value. (There is no exactly correct answer.)

QUESTION: for each of the graphics above, what is the “range” of values. (Here we use two numbers to describe the range)

Histograms are generated with hist, as in

hist(mtcars$qsec)

QUESTION: Is this histogram approximately “bell shaped”?

QUESTION: What is qseq? (Hint, type the command ?mtcars and read.)

Variable names are case sensitive. The Ozone variable in air73 is referenced with air73$Ozone:

hist(air73$Ozone)

QUESTION: Is this histogram approximately “bell shaped”?

QUESTION: What is the “range” of the Ozone variable?

QUESTION: What is the “center” of the Ozone variable?

Summaries of distribution

Shape

The “shape” of a distribution of values is a rough description along the lines of

unimodal versus non-unimodal or multimodal
symmetric versus skewed (right or left). Skew right means the “tail” of the distribution is longer on the right than the left.
bell shaped

Here are commands to produce 3 unimodal histograms.

hist(rnorm(100))

hist(runif(100))

hist(rexp(100))

QUESTION: characterize the three as symmetric, skewed left, or skewed right.

QUESTION: Are any of the three “bell-shaped”?

QUESTION: Make this grahpic

hist(air73$Temp)

Is this symmetric, skewed left, or skewed right?

Center

Center can be measured different ways, for example:

what is the “middle” value in a data set (median)
where would the data balance? (mean)

For the first characterization, the stem and leaf plot can be used to find the middle pretty quickly

For the latter, the histogram can be used to identify a “balance” point.

QUESTION: find the middle value from the commute data

QUESTION: find the balancing value for the mtcars$mpg data

QUESTION: find both middles for the air73$Ozone data

QUESTION: For a symmetric data set, would you expect the two to be the same?

Spread

Spread is harder to visualize. Here are two ways, a third will be prominent but based on a numeric computation:

The range of the data is the pair of the smallest and largest values.
The range of the middle 50% of the data is called the IQR. It is found by breaking up the data into 4 equal sized pieces (numbers of data point) and using the range to summarize the middel two pieces.

These can be “eyeballed” from the histgram, the histogram representing amount of data in a bin with area, so the area is proportional to the number of data points.

QUESTION: Look at a histogram of air73$Ozone. Consider the pair (20, 65). Is this the “range” or the “IQR” for this data?

QUESTION: Look at the histogram of mtcars$qsec. What is the range? What is the IQR?

QUESTION: Look at the histogram of mtcars$wt. Describe the shape, center, and spread in a paragraph. Mention how you measured each, if there is a choice.

1-intro