For this lab you should submit, on Blackboard, your .Rmd
and .docx
-files at the end of the lab hour.
Tidyverse and data manipulation
In many cases, the data you load is not the data you need to work with. To actually work with your data, you may need to extract subsets, summarize subpopulations, or calculate new variables from the existing ones.
The command library(tidyverse)
loads a family of packages that make it much easier to work with data manipulation in this way.
At the core of the tidyverse approach to everything lies the pipe %>%
operation - we have already seen this for layering ggformula
plots on top of each other.
The pipe operation takes whatever came out of a previous command, and puts it in the next command. Thus, the following two lines of code do the same thing:
# EXAMPLE CODE - DO NOT RUN
f(g(h(x)))
x %>% h() %>% g() %>% f()
The pipe operation can be pronounced and then, so that read.csv(...) %>% filter(...)
means first load the data and then filter.
Task Load the CPS85 dataset using the command data(CPS85)
. Use the command glimpse
to print out a brief overview of the data.
Subsetting and sampling
Tidyverse provides several useful functions for extracting parts of a dataset:
select
will extract columns from a dataset. List column names or indices, or use-
to remove columns.
Task Try the following code, and describe the result.
CPS85.1 = CPS85 %>% select(1:3)
CPS85.2 = CPS85 %>% select(-union, -sector)
CPS85.3 = CPS85 %>% select(wage, educ, union, sector)
glimpse(CPS85.1)
glimpse(CPS85.2)
glimpse(CPS85.3)
Task Create a dataset CPS.race
containing the race
and the hispanic
variables.
filter
will extract rows from a dataset using some predicate (ie an expression returningTRUE
orFALSE
).filter
keeps any row for which the predicate isTRUE
.
Task Try the following code, and describe the result.
CPS85.4 = CPS85.3 %>% filter(union == "Union", sector == "clerical")
CPS85.4
Task Create a dataset CPS.married.highwage
containing everything with married
equal to "Married"
and wage
greater than 25. How many entries could you find?
sample_n
andsample_frac
will pick out a simple random sample from your data.sample_n
takes the number of observations you want to retain, andsample_frac
takes the proportion of observations you want to retain.
Task Try the following code, and describe the result.
set.seed(12345)
CPS85.5 = CPS85 %>% sample_n(25)
CPS85.5
CPS85.6 = CPS85 %>% sample_frac(0.05)
CPS85.6
Task Create a dataset CPS85.1.10
containing 10 randomly sampled observations from CPS85.1
. Describe this dataset.
Create new variables from old
It is common to need to create new variables from the existing variables, to do computations needed or combining different variables with each other.
mutate
adds new variables, keeping any old ones.
In the documentation of the dataset (use ?CPS85
in the console to read), we see that wage
is the hourly wage in US$, educ
is the number of years of education and exper
is the number of years of experience.
Task Try the following code. What do you expect will happen?
CPS85.ratio = CPS85 %>% mutate(ratio = exper/educ)
Task Change the dataset CPS85.ratio
to include a variable that takes the value wage/ratio
.
transmute
works likemutate
, but throws away anything not explicitly listed in its arguments.
It is like using mutate
and then select
.
Wide and Tall data
We can distinguish between wide and tall representations of some data sets: with a wide representation, variables have their own columns, whereas with a tall representation the values all sit in one column, with a second column providing information of what data it is.
To get an example of this, use data(Alcohol)
to load the Alcohol data set, drawing on Gapminder for country by country statistics on alcohol consumption.
Task Use glimpse
to get a first understanding of the content of Alcohol
.
Describe what you see.
There are many years with missing data. The following will be easier if you run the following code to subset to only data from the years 2005 and 2008.
Alcohol = Alcohol %>% filter(year %in% c(2005,2008)) %>% select(-X)
This is an example of a tall dataset: all alcohol consumption rates are gathered in the column alcohol
, with year
and country
indicating what each number refers to.
spread
goes from a tall to a wide dataset. It requires two parameters: one with the name of the column containing the keys (labels for the subpopulations) and one with the name for the values.
Task Try the following code. Describe the result:
Alcohol.country = Alcohol %>% spread(country, alcohol)
Task Create a new dataset Alcohol.year
where each column corresponds to one of the years for the dataset. Describe the result.
gather
goes the other way: it transforms a wide dataset to a tall one. It takes the names of a new key column, a value column, and then a select type list of what columns to spread out.
To try gather
we will use the dataset Galton
. Use data(Galton)
to load the data and glimpse(Galton)
to see the structure of the dataset.
Task Try the following code. Describe the result:
Galton.parents = Galton %>% select(father, mother, family) %>% gather(parent, height, father, mother)
Task Create a dataset Galton.family
where the height of the father (in father
), the height of the mother (in mother
) and the height of the child (in child
) are gathered, keeping the column family
but no other columns.
Notice that gather
is quite useful for plotting: the key can be used as a coloring or splitting variable.
Task Plot a frequency curve of height
from Galton.family
with color determined by the key column.
Summarize for subpopulations
We often want to calculate various sample statistics separately within subpopulations. For simple statistics, we can use the functions in library(mosaic)
-- calculating summary statistics from a tall dataset by setting the subpopulation column as predictor.
More fine-grained control -- more possibility to use other functions and more ability to choose more complex conditions to group by, tidyverse provides useful commands.
group_by
adds invisible information to a dataset picking out subpopulations.summarise
,summarise_at
,summarise_if
andsummarise_all
calculate summaries within each group selected fromgroup_by
.
Task Try the following code. Describe the resulting datasets.
CPS85.1.means = CPS85.1 %>% group_by(race) %>% summarise_all(mean)
CPS85.mean.sector = CPS85 %>% group_by(sector) %>% summarise(mean.age = mean(age))
CPS85.mean.sex.sector = CPS85 %>% group_by(sector, sex) %>% summarise(mean.wage = mean(wage), sd.wage = sd(wage))
Task
Compute means and standard deviations of wage
, educ
and exper
after grouping by sex
, married
and union
.