For this lab you should submit, on Blackboard, your .Rmd and .docx-files at the end of the lab hour.

Tidyverse and data manipulation

In many cases, the data you load is not the data you need to work with. To actually work with your data, you may need to extract subsets, summarize subpopulations, or calculate new variables from the existing ones.

The command library(tidyverse) loads a family of packages that make it much easier to work with data manipulation in this way. At the core of the tidyverse approach to everything lies the pipe %>% operation - we have already seen this for layering ggformula plots on top of each other.

The pipe operation takes whatever came out of a previous command, and puts it in the next command. Thus, the following two lines of code do the same thing:

# EXAMPLE CODE - DO NOT RUN

f(g(h(x)))
x %>% h() %>% g() %>% f()

The pipe operation can be pronounced and then, so that read.csv(...) %>% filter(...) means first load the data and then filter.

Task Load the CPS85 dataset using the command data(CPS85). Use the command glimpse to print out a brief overview of the data.

Subsetting and sampling

Tidyverse provides several useful functions for extracting parts of a dataset:

  • select will extract columns from a dataset. List column names or indices, or use - to remove columns.

Task Try the following code, and describe the result.

CPS85.1 = CPS85 %>% select(1:3)
CPS85.2 = CPS85 %>% select(-union, -sector)
CPS85.3 = CPS85 %>% select(wage, educ, union, sector)
glimpse(CPS85.1)
glimpse(CPS85.2)
glimpse(CPS85.3)

Task Create a dataset CPS.race containing the race and the hispanic variables.

  • filter will extract rows from a dataset using some predicate (ie an expression returning TRUE or FALSE). filter keeps any row for which the predicate is TRUE.

Task Try the following code, and describe the result.

CPS85.4 = CPS85.3 %>% filter(union == "Union", sector == "clerical")
CPS85.4

Task Create a dataset CPS.married.highwage containing everything with married equal to "Married" and wage greater than 25. How many entries could you find?

  • sample_n and sample_frac will pick out a simple random sample from your data. sample_n takes the number of observations you want to retain, and sample_frac takes the proportion of observations you want to retain.

Task Try the following code, and describe the result.

set.seed(12345)
CPS85.5 = CPS85 %>% sample_n(25)
CPS85.5
CPS85.6 = CPS85 %>% sample_frac(0.05)
CPS85.6

Task Create a dataset CPS85.1.10 containing 10 randomly sampled observations from CPS85.1. Describe this dataset.

Create new variables from old

It is common to need to create new variables from the existing variables, to do computations needed or combining different variables with each other.

  • mutate adds new variables, keeping any old ones.

In the documentation of the dataset (use ?CPS85 in the console to read), we see that wage is the hourly wage in US$, educ is the number of years of education and exper is the number of years of experience.

Task Try the following code. What do you expect will happen?

CPS85.ratio = CPS85 %>% mutate(ratio = exper/educ)

Task Change the dataset CPS85.ratio to include a variable that takes the value wage/ratio.

  • transmute works like mutate, but throws away anything not explicitly listed in its arguments.

It is like using mutate and then select.

Wide and Tall data

We can distinguish between wide and tall representations of some data sets: with a wide representation, variables have their own columns, whereas with a tall representation the values all sit in one column, with a second column providing information of what data it is.

To get an example of this, use data(Alcohol) to load the Alcohol data set, drawing on Gapminder for country by country statistics on alcohol consumption.

Task Use glimpse to get a first understanding of the content of Alcohol. Describe what you see.

There are many years with missing data. The following will be easier if you run the following code to subset to only data from the years 2005 and 2008.

Alcohol = Alcohol %>% filter(year %in% c(2005,2008)) %>% select(-X)

This is an example of a tall dataset: all alcohol consumption rates are gathered in the column alcohol, with year and country indicating what each number refers to.

  • spread goes from a tall to a wide dataset. It requires two parameters: one with the name of the column containing the keys (labels for the subpopulations) and one with the name for the values.

Task Try the following code. Describe the result:

Alcohol.country = Alcohol %>% spread(country, alcohol)

Task Create a new dataset Alcohol.year where each column corresponds to one of the years for the dataset. Describe the result.

  • gather goes the other way: it transforms a wide dataset to a tall one. It takes the names of a new key column, a value column, and then a select type list of what columns to spread out.

To try gather we will use the dataset Galton. Use data(Galton) to load the data and glimpse(Galton) to see the structure of the dataset.

Task Try the following code. Describe the result:

Galton.parents = Galton %>% select(father, mother, family) %>% gather(parent, height, father, mother)

Task Create a dataset Galton.family where the height of the father (in father), the height of the mother (in mother) and the height of the child (in child) are gathered, keeping the column family but no other columns.

Notice that gather is quite useful for plotting: the key can be used as a coloring or splitting variable.

Task Plot a frequency curve of height from Galton.family with color determined by the key column.

Summarize for subpopulations

We often want to calculate various sample statistics separately within subpopulations. For simple statistics, we can use the functions in library(mosaic) -- calculating summary statistics from a tall dataset by setting the subpopulation column as predictor.

More fine-grained control -- more possibility to use other functions and more ability to choose more complex conditions to group by, tidyverse provides useful commands.

  • group_by adds invisible information to a dataset picking out subpopulations.
  • summarise, summarise_at, summarise_if and summarise_all calculate summaries within each group selected from group_by.

Task Try the following code. Describe the resulting datasets.

CPS85.1.means = CPS85.1 %>% group_by(race) %>% summarise_all(mean)
CPS85.mean.sector = CPS85 %>% group_by(sector) %>% summarise(mean.age = mean(age))
CPS85.mean.sex.sector = CPS85 %>% group_by(sector, sex) %>% summarise(mean.wage = mean(wage), sd.wage = sd(wage))

Task Compute means and standard deviations of wage, educ and exper after grouping by sex, married and union.