Summary

In this lab, which we'll do together, you'll load some datasets and do some basic calculations on them. For this first lab, we will jointly look over the first week's homework, and how to format it for submission. You'll turn in your lab report as an RMarkdown file, which holds a combination of calculations done in the R language and normal English text.

The first thing you should do is create a new RMarkdown file, following your professor's instructions.

First steps

Before anything else, login to your RStudio account on http://www.math.csi.cuny.edu/rstudio ; refer to your lab handout for login information.

Next, create a new RMarkdown file. Name it "Lab 1" and set output format to Word Document.

The RMarkdown file will be filled with example text. Remove everything after and including the line ## R Markdown.

All your labs and all your homework will be written in RMarkdown, with text explaining what you are doing and code blocks that actually do stuff.

At the very beginning of every RMarkdown file, in the part that reads

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

add the commands library(tidyverse), library(ggformula) and library(mosaic) so that that part reads

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(ggformula)
library(mosaic)
```

Loading in online data

We will work with the Titanic data referenced in the homework exercises. A copy of the dataset is available through the course website. You can read this data into R by running the command

titanic = read.csv("ips9e/Chapter 1/EX01-027TITANIC.txt", sep="\t")

This creates a data frame called titanic that contains the information from the file.

In addition to loading this data within your RMarkdown file, load it in the console as well (use the exact same command, but switch to the console and type it in there). This way, you can play around with the data in the console as well.

Once you have the data loaded in R, you can find out the variable names using

names(titanic)

You can also type

head(titanic)

to get a glimpse of the dataset itself in data matrix form. There is also a command glimpse that will give you an easy summary of variables with example entries.

glimpse(titanic)

The titanic data has four variables:

Variable name Content Type Encoding
pclass Passenger class Integer 1, 2, 3 for first, second and third class
survived Did the passenger survive? Integer 1 for yes, 0 for no
sex Sex of passenger Factor male / female
age Age of passenger Number

Does all the data have the right data type? Both survival and passenger class would make sense to change types: survival is a truth value - a Yes/No value, and passenger class has exactly 3 valid values.

There are commands to convert between data types:

titanic$pclass = as.factor(titanic$pclass)
levels(titanic$pclass) = c("1st", "2nd", "3rd")
titanic$survived = as.logical(titanic$survived)

Plotting data

We will use RMarkdown's ability to produce plots inline with the text. Our preferred plotting system for this course is called ggformula. All ggformula plotting commands have the exact same pattern:

plotcommand(y-variable ~ x-variable, data=dataset)

All the plot commands start with gf, and by using the [Tab] key on your keyboard you can get a list of candidate commands to type in.

As an example, to make a histogram of ages we would use

gf_histogram( ~ age, data=titanic)

And to make a bar chart of the distribution of sex, we would use

gf_bar( ~ sex, data=titanic)

In both these cases, the y-variable is calculated, so we can just skip it. We will later see plot types where the y-variable cannot be skipped.

To show how variables interact, we might split a chart depending on some variable. Then the pattern for the plot command changes to

plotcommand(y-variable ~ x-variable | split-variable, data=dataset)

To plot the sex distribution separately in the three passenger classes we might use

gf_bar( ~ sex | pclass, data=titanic)

Now do your homework