For this lab you should submit, on Blackboard, your .Rmd and .docx-files at the end of the lab hour.

The dataset mpg contains car models with a range of features including engine volume, cylinder count, drive type, and mileages for city and highway driving.

Two-way and N-way tables

Commands to construct and modify two-way tables were given in the lecture slides from the last lecture. Using these,

Task Create a 3-way table describing drv, fl and class.

Task Create a margin table that retains drv and class.

Task Create a proportional table with conditional proportions of drv conditioned on class.

Plotting gallery

We will explore a selection of possible plotting types. For these plots, we will be using a dataset that provides us with paired data numeric-numeric, numeric-categorical and categorical-categorical.

Simple plotting interface: ggformula

We have been using ggformula for plotting. The basic structure of a ggformula command is

command(response ~ predictor | splitter, data=dataset)

Additional features can be connected to data by using color = ~ variable, fill = ~ variable, size = ~ variable, shape = ~ variable, ...

Fine-tuning details: ggplot2

The library ggformula is built on top of the plotting library ggplot2. Where ggformula uses %>% to layer plots on top of each other, ggplot2 uses +. Since ggformula is built on top of ggplot2, any ggformula plot can be tweaked using ggplot2 commands.

To start a ggplot2 plot, we use the ggplot command. Either ggplot or later commands take arguments data to provide a dataset and mapping to provide an Aesthetic Mapping. Most often you will want to give both dataset and aesthetic mapping already in the ggplot command -- that way they are already set and ready for every subsequent component you add to the plot.

Task Try it out by running:

ggplot(mpg)

Describe the result of this command.

Aesthetic Mappings

ggplot2 builds fundamentally on connecting aspects of data to aspects describing a plot. The way to connect data to aspects of the plot is through aesthetic mappings. These are produced using the command aes, taking as parameters the actual properties. Most commonly used properties include x, y, color, fill, shape, size.

Task Let's add some aesthetic mappings too, by running

ggplot(mpg, aes(x=cty, y=hwy, color=class, shape=drv))

Describe the result of this command.

Adding geometry

The ggplot command, with or without dataset and aesthetic mapping, will not actually draw anything. To put shapes on the plot, we need geometries. All geometry commands start with geom_, and take different aesthetic mappings depending on which geometry you are using.

You can find out the aesthetic mappings by looking at the help file for the geometry in question.

Task Let's produce a scatter plot. Use the ggplot command composed in the previous task, and add geom_point(). Describe the resulting plot, and how each of the aesthetic mappings has influenced the plot itself.

Some interesting geometries to use include:

ggformula Command ggplot2 Command Effect
Single variable plots
gf_bar geom_bar Bar chart (will count entries)
gf_col geom_col Bar chart (will use provided values)
gf_boxplot geom_boxplot Box plot
gf_density geom_density Density estimate (smooth histogram)
gf_dotplot geom_dotplot Dot plot
gf_freqpoly geom_freqpoly Frequency curve
gf_histogram geom_histogram Histogram
gf_qq geom_qq Quantile plot
gf_qqline geom_qq_line Quantile plot guide line
gf_rug, gf_rugx, gf_rugy geom_rug Rug plot (markers at the bottom for each data point; combine with histogram)
gf_violin geom_violin Violin plot (boxplot with full density distribution graph)
Two variable plots
gf_point geom_point Scatter plot
gf_count geom_count Scatter plot with points scaled by co-occurring values
gf_jitter geom_jitter Scatter plot with randomly displaced points
gf_bin2d geom_bin2d Heatmap (square bins)
gf_hex geom_hex Heatmap (hexagonal bins)
gf_density_2d geom_density_2d 2d density estimate (smooth heatmap)
gf_line geom_line Line plot
gf_smooth geom_smooth Smoothed curve fitted to scatterplot
Multiple variable plots
gf_contour geom_contour Contour plot of 3d surface
gf_errorbar geom_errorbar Error bar plot
gf_crossbar geom_crossbar Error bar plot
gf_linerange geom_linerange Error bar plot
gf_pointrange geom_pointrange Error bar plot
gf_raster geom_raster Pixel grid
gf_tile geom_tile Rectangular grid
Utility plots
gf_abline geom_abline Straight line
gf_hline geom_hline Horizontal line
gf_vline geom_vline Vertical line

Adapting the plot: scales

Color schemes, scale adapations and other transformations can be done using the scale_ commands. Some of the most useful include

Command Effect
scale_x_log10 X-axis log scale
scale_y_log10 X-axis log scale
scale_x_sqrt X-axis square root transform
scale_y_sqrt X-axis square root transform
scale_color_viridis_c Viridis color scheme (numeric data)
scale_color_viridis_d Viridis color scheme (categorical data)
scale_fill_viridis_c Viridis color scheme (numeric data)
scale_fill_viridis_d Viridis color scheme (categorical data)
coord_polar Polar coordinates
coord_flip Swap x and y axes
coord_equal Fix aspect ratio (circles are round...)

Tasks

Visualize two variables

Task Produce two different plots that visualize the relationship between the cty and hwy variables in the dataset mpg.

Task Produce two different plots that visualize the distribution of cty as split into subpopulations by drv.

To visualize joint distributions of categorical variables, two common methods is using dodged bar charts, or using a colored grid. The colored grid version could look something like this:

ggplot(tally(cyl~drv, data=mpg) %>% as.data.frame(), aes(x=cyl, y=drv, fill=Freq)) +
  geom_raster()

Task Produce two different plots that visualize the joint distribution of drv and class.

Visualize one variable

Task Produce two different plots that visualize the distribution of cty.

Task Produce a plot that visualize the distribution of drv.

Modifying plots

Task Modify at least one of your plots to use Viridis in its continuous version.

Task Modify at least one of your plots to use Viridis in its discrete version.