For this lab you should submit, on Blackboard, your .Rmd and .docx-files at the end of the lab hour.

Remember to choose Word Document as output format when you create the RMarkdown file for the lab report, and to add the lines

library(tidyverse)
library(ggformula)
library(mosaic)

to the code block at the beginning. Then delete all content after this first code block.

Formulas everywhere

In the last lab we saw the format for making plots as

plotcommand(y-variable ~ x-variable | split-variable, data=dataset)

The same format will give you access to summary statistics (and later on also to various types of modeling).

We will consider the Titanic dataset: load it with

titanic = read.delim("ips9e/Chapter 1/EX01-027TITANIC.txt")
titanic$pclass = as.factor(titanic$pclass)
titanic$survived = as.logical(titanic$survived)

Now, to calculate the mean age, you would use age as your x-variable, and not use any y-variable -- whenever only one variable is in play, you either use it as a sole x-variable, or as y setting x to 1:

mean(~age, data=titanic)

Task Calculate the mean age.

Missing data

Did you notice anything disappointing when calculating the mean age?

If there are missing observations, they could potentially change anything you do with them: if the missing age was 100, it would change the mean by a fair bit. R deals with this by making NA sticky: (almost) anything that gets NA input, will return NA.

For the summary statistic function, an option na.rm=TRUE will remove missing data before calculating:

mean(~age, data=titanic, na.rm=TRUE)

Summaries of subgroups

Suppose we want the mean age, but by survival. We want to split our data up. This would correspond to seeing how age depends on survived as variables.

The "y-variables" are more properly called dependent variables or response variables, while the "x-variables" are independent variables or predictor variables. Since we are looking for how age depends on survived, age is our dependent or response variable, and survived is our predictor.

Task Calculate mean ages in the survival classes. Remember to use na.rm if you need to.

Different summary statistics

Your Mosaic cheat sheets have a good list of summary statistics functions. Of particular interest are:

Function What it does
mean() Mean
median() Median
sd() Standard deviation
var() Variance
quantile() Quantile: example quantile(~age, probs=c(0.1,0.5,0.9), data=titanic, na.rm=TRUE)
IQR() Interquartile range: 0.75-quantile minus 0.25-quantile.
min(); max() Minimum and maximum
range() Minimum and maximum returned together
prop() Proportions
perc() Percentages (ie 100*prop)
count() Counts (ie nrow(data)*prop)
tally() Count each label in a categorical variable, or combination of labels in several variables
fivenum() Minimum, first quartile, median, third quartile, maximum
favstats() fivenum and also mean, standard deviation, number of non-missing values, number of missing values -- all labeled

Task Describe in detail both age and sex from the Titanic dataset. Calculate all relevant summary statistics for each. Also calculate all summary statistics of age as dependent on sex.

Missed opportunities in the first lab

We should have covered one more important feature of plotting, but it was missed in the lab instructions.

The format

plotcommand(y-variable ~ x-variable | split-variable, data=dataset)

shows how to connect data variables to x-coordinate, to y-coordinate and to splitting one plot into several side-by-side plots. At least as common, if not more common, is to use other features of the plot to convey data:

  • Transparency (alpha)
  • Size (size)
  • Shape (eg for the points in a scatter plot) (shape)
  • Color (for lines, or for outlines of filled in shapes) (color)
  • Color (for the filled in parts of a filled in shape) (fill)

These can be connected to variables using =~. If you connect a categorical variable, this splits the data into subgroups and handles them separately -- you can get stacked or side-by-side bar charts and histograms this way. Choosing between stacked or side-by-side is done by the position parameter to gf_bar or gf_histogram: it can take the values:

position=... Function
"dodge" Bars side by side, height is the count of instances
"stack" Bars on top of each other, height is the count of instances
"fill" Bars on top of each other, total height is 1.0, each section measures proportion within the label / bin.

One example could look like this:

gf_bar(~pclass, data=titanic, fill=~survived)

Task Try adding each possible value to the position parameter to this example.

For this to work, the conversion steps in the first lab are important:

titanic$pclass = as.factor(titanic$pclass)
levels(titanic$pclass) = c("1st", "2nd", "3rd")
titanic$survived = as.logical(titanic$survived)

Task Make a stacked histogram of passenger class, as divided by sex.

Make side-by-side boxplots of age split both by survived. Use x-axis position for one split and color for the other one. What do you think the difference is in what the two options illustrate?

Addendum: bar charts

ggformula provides a wealth of bar chart functions:

Function Plot
gf_bar Bar heights are counts of instances, stacked position
gf_barh Horizontal version
gf_counts Bar heights are counts of instances, stacked position
gf_countsh Horizontal version
gf_props Bar heights are proportions out of all data, stacked position
gf_propsh Horizontal version
gf_percents Bar heights are percentages out of all data, stacked position
gf_percentsh Horizontal version
gf_col Bar heights are data values, stacked position
gf_colh Horizontal version