Lab 3

Getting Started

A few quick reminders:

  • The console lets you try commands and see the results right away, but nothing you do there is saved.

  • To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as lab2.R or something like that. In this file, put your answers like this:

    # Exercise 1
    mean(dataset$left.bicep.thickness)
    mean(dataset$right.bicep.thickness)
    # On average, people's right biceps are thicker than their left biceps
    
    # Exercise 2
  • To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.

  • If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.

  • To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.

  • You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like ?mean.

  • If you need to look things up from the first two labs, they’re available at https://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/lab1/index.html and https://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/lab2/index.html

  • At the end of class, submit your R file on Blackboard.

Today’s Lab

We’ll be using the same dataset from last time, available at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/bdims.csv.

Exercise 1. Download this dataset and put it into an object called bdims, as we did last time.

ggplot

We will use a library called ggplot2 to make plots.

Exercise 2. Load the ggplot2 library with the command library("ggplot2"). Try to compile your R file. If it gives you an error message about not finding ggplot2, type install.packages("ggplot2") in the console, and then try compiling your R file again again. (Note: this is one of the very rare times that doing something in the console will have an effect on your R file.)

To make plots, use the ggplot command. This can get very complicated, but the starting point is a command that looks like this, except with … replaced by something else:

ggplot(..., aes(...)) + ...

In place of the first … goes the data frame (for example, bdims). The second … contains the aesthetic mapping, which consists of instructions saying which variables in the dataset should be drawn in which ways. The final … contains instructions for the geometry, i.e., what sort of plot to make (e.g., a Q-Q plot, a box plot, a bar plot, etc.).

Histograms

For a histogram, the aesthetic mapping is aes(x=name.of.variable), with name.of.variable replaced by the actual name of the variable you want to make a histogram of. This says that this variable should be depicted on the x-axis. The geometry should be specified as geom_histogram(). All together, the command should be

ggplot(name.of.dataframe, aes(x=name.of.variable)) + geom_histogram()

Exercise 3. Make a histogram showing the distribution of the heights of the people in the dataset. The name of your dataframe is bdims (assuming that you named it that in Exercise 1). If you need to look up the name of the variable, here is the documentation for the dataset.

If you did everything correctly, the plot command will draw your Q-Q plot and also give you a message that looks like this:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This says that the histogram was drawn with 30 different bins (i.e., 30 bars), but that you should choose a bin size that suits your data better. You can do this by changing the geom_histogram() command either to say geom_histogram(bins=...) or geom_histogram(binwidth=...), with … replaced by numbers. The first specifies the total number of bins in the histogram, while the second specifies the width of each individual bin.

Exercise 4. Make another histogram with the same data and a more suitable number of bins. It’s up to you to choose a good number of bins, and there’s no single correct answer. Remember that you want to have enough bins that you can understand the distribution, but not so many bins that the picture is dominated by noise.

Exercise 5. Make histograms of the age and bicep girth variables too, choosing a suitable bin size for each plot. In a comment, describe the distributions of the height, age, and bicep girth variables. (Some good words to use: unimodal/bimodal, symmetric/left-skewed/right-skewed.)

Q-Q Plots

Recall that a Q-Q plot is used to judge if data fits a normal distribution. The closer a Q-Q plot is to a line, the closer the data is to the normal distribution.

To make a Q-Q plot, the basic command is

ggplot(..., aes(sample=...)) + stat_qq()

The first … gives the data frame as usual, and the second … should give the variable you’d like to make a Q-Q plot for.

Exercise 6. Make Q-Q plots for the height, age, and bicep girth variables. In a comment, say if you think they are good fits to a normal distribution.

Exercise 7. Go through the same procedure for any other variables you’re interested in. Which variables can you find that fit the normal distribution the most closely, and which can you find that fit it the least?