For this lab you should submit, on Blackboard, your .Rmd
and .docx
-files at the end of the lab hour.
Remember to choose Word Document as output format when you create the RMarkdown file for the lab report, and to add the lines
library(tidyverse)
library(ggformula)
library(mosaic)
to the code block at the beginning. Then delete all content after this first code block.
Formulas everywhere
In the last lab we saw the format for making plots as
plotcommand(y-variable ~ x-variable | split-variable, data=dataset)
The same format will give you access to summary statistics (and later on also to various types of modeling).
We will consider the Titanic dataset: load it with
titanic = read.delim("ips9e/Chapter 1/EX01-027TITANIC.txt")
titanic$pclass = as.factor(titanic$pclass)
titanic$survived = as.logical(titanic$survived)
Now, to calculate the mean age, you would use age
as your x-variable, and not use any y-variable -- whenever only one variable is in play, you either use it as a sole x-variable, or as y setting x to 1
:
mean(~age, data=titanic)
Task Calculate the mean age.
Missing data
Did you notice anything disappointing when calculating the mean age?
If there are missing observations, they could potentially change anything you do with them: if the missing age was 100, it would change the mean by a fair bit. R
deals with this by making NA
sticky: (almost) anything that gets NA
input, will return NA
.
For the summary statistic function, an option na.rm=TRUE
will remove missing data before calculating:
mean(~age, data=titanic, na.rm=TRUE)
Summaries of subgroups
Suppose we want the mean age, but by survival. We want to split our data up. This would correspond to seeing how age
depends on survived
as variables.
The "y-variables" are more properly called dependent variables or response variables, while the "x-variables" are independent variables or predictor variables.
Since we are looking for how age
depends on survived
, age
is our dependent or response variable, and survived
is our predictor.
Task Calculate mean ages in the survival classes. Remember to use na.rm
if you need to.
Different summary statistics
Your Mosaic cheat sheets have a good list of summary statistics functions. Of particular interest are:
Function | What it does |
---|---|
mean() |
Mean |
median() |
Median |
sd() |
Standard deviation |
var() |
Variance |
quantile() |
Quantile: example quantile(~age, probs=c(0.1,0.5,0.9), data=titanic, na.rm=TRUE) |
IQR() |
Interquartile range: 0.75-quantile minus 0.25-quantile. |
min(); max() |
Minimum and maximum |
range() |
Minimum and maximum returned together |
prop() |
Proportions |
perc() |
Percentages (ie 100*prop ) |
count() |
Counts (ie nrow(data)*prop ) |
tally() |
Count each label in a categorical variable, or combination of labels in several variables |
fivenum() |
Minimum, first quartile, median, third quartile, maximum |
favstats() |
fivenum and also mean, standard deviation, number of non-missing values, number of missing values -- all labeled |
Task Describe in detail both age
and sex
from the Titanic dataset. Calculate all relevant summary statistics for each. Also calculate all summary statistics of age
as dependent on sex
.
Missed opportunities in the first lab
We should have covered one more important feature of plotting, but it was missed in the lab instructions.
The format
plotcommand(y-variable ~ x-variable | split-variable, data=dataset)
shows how to connect data variables to x-coordinate, to y-coordinate and to splitting one plot into several side-by-side plots. At least as common, if not more common, is to use other features of the plot to convey data:
- Transparency (
alpha
) - Size (
size
) - Shape (eg for the points in a scatter plot) (
shape
) - Color (for lines, or for outlines of filled in shapes) (
color
) - Color (for the filled in parts of a filled in shape) (
fill
)
These can be connected to variables using =~
. If you connect a categorical variable, this splits the data into subgroups and handles them separately -- you can get stacked or side-by-side bar charts and histograms this way. Choosing between stacked or side-by-side is done by the position
parameter to gf_bar
or gf_histogram
: it can take the values:
position=... |
Function |
---|---|
"dodge" |
Bars side by side, height is the count of instances |
"stack" |
Bars on top of each other, height is the count of instances |
"fill" |
Bars on top of each other, total height is 1.0, each section measures proportion within the label / bin. |
One example could look like this:
gf_bar(~pclass, data=titanic, fill=~survived)
Task Try adding each possible value to the position
parameter to this example.
For this to work, the conversion steps in the first lab are important:
titanic$pclass = as.factor(titanic$pclass)
levels(titanic$pclass) = c("1st", "2nd", "3rd")
titanic$survived = as.logical(titanic$survived)
Task
Make a stacked histogram of passenger class, as divided by sex
.
Make side-by-side boxplots of age
split both by survived
. Use x-axis position for one split and color for the other one.
What do you think the difference is in what the two options illustrate?
Addendum: bar charts
ggformula
provides a wealth of bar chart functions:
Function | Plot |
---|---|
gf_bar |
Bar heights are counts of instances, stacked position |
gf_barh |
Horizontal version |
gf_counts |
Bar heights are counts of instances, stacked position |
gf_countsh |
Horizontal version |
gf_props |
Bar heights are proportions out of all data, stacked position |
gf_propsh |
Horizontal version |
gf_percents |
Bar heights are percentages out of all data, stacked position |
gf_percentsh |
Horizontal version |
gf_col |
Bar heights are data values, stacked position |
gf_colh |
Horizontal version |