First in any script

First in any script you write, load the libraries we use for easier data handling and plotting.

library(tidyverse)
library(ggformula)

Other sources

The RStudio cheat sheets is an excellent source of summaries of everything important.

For our course, special interest should be paid to the cheat sheets for

R Markdown

R Markdown lets you weave in computations and graphs with the text you write: if you use other systems you will need to save out figures to import them, and copy-and-paste the values you compute. Staying in R Markdown means everything matches up with the data by default.

Text

The basic content of an RMarkdown document is text. Anything you don't specifically mark as something different is taken to be plain and unadorned text.

You break your text up into paragraphs by using two newlines after one another.

Headings

To get a heading, such as you would use to start a chapter or a section, you either use the # symbol at the front of the line: once for the largest heading, repeatedly to get smaller headings:

# First level heading

## Second level heading

### Third level heading

Alternatively, you can underline the heading text in the text using - or =:

Another heading
----------------------

Yet another heading
===============

Emphasis: italic and bold

To emphasize parts of the text you can surround the text you emphasize by * characters. Once for italic and twice for bold:

*This is italic* and **this is bold**

Code blocks

Using the backtick ` character (top left on the keyboard, not the ' character to the right!) you can mark text to be code and not normal text. Code is formatted using a fixed-width font and keeps all the spaces where they were, leaving the internal structure of the code where it is.

You can either put a code block inline in a sequence of text, by using ` to surround the code you are writing, so you could talk about eg commands like sd while writing text about it.

Alternatively, if you start with a line with a sequence of three backticks ```, this starts a separated code block. This block continues until you end it with another line with three backticks ```.

Both the inline and the separated code blocks can be marked as R code and by doing that let RStudio know you want it to run the code in the code block and insert the results when done knitting. You mark an inline code block by adding the letter r after the backtick, producing something like

This inserts a random number in a sequence of text `r rnorm(1)`.

Similarly, you use the letter r enclosed in curly braces at the beginning of a separated code block:


```{r}
plot(x,y)
abline(lm(y~x))
```

Lists

You can create bulleted lists by, after two newlines, starting each line with a * or a - followed by a space, so that:

* This is a
* bulleted list

produces

This is a
bulleted list

and

- this is another
- bulleted list

creates

this is another
bulleted list

By adding spaces in front, you can make sublists:

* outer list
  - inner list
  - inner list
* outer list again

produces

outer list
inner list
inner list
outer list again

You can create numbered lists using numbers followed by . instead of symbols. RMarkdown doesn't care about which numbers you use, it will assume you want your list to go from 1 to however many items you list.

1. This
3. is
2. a list

creates

This
is
a list

R commands

Important note: Everything in R cares about UPPER/lower case. Make sure you use the right one when you get errors.

library: Load extra functionality.

library(ggplot2)

install.packages: Ensure library packages are installed.

install.packages("ggplot")

#: Commented code. You can add explaining text inside a code block or in a script file by using #. All the text that follows a # character is ignored by R.

$: Column access. To use a column in a dataset, instead of the entire dataset, you use $ to pull out the column:

data$variable

=: Assigning names. By writing something like
name = value you put a name to a thing (dataset, number, string, ...). This saves it for further use, and lets you refer to this thing without recomputing it every time.

upper.95.rule = mean(data$variable) + 2*sd(data$variable)

By assigning a value to an existing variable in a dataset, you change the data in that variable. By assigning to a variable that is not contained in a dataset, you create a new variable and fill it with that data. In either of these cases, your new value needs to have as many values as there are observations in the dataset.

c combines any elements together into a vector: a sequence of things of the same kind. c(2,5) contains the numbers 2 and 5, c("ggplot2", "knitr") contains the two strings "ggplot2" and "knitr".

: creates a sequence of values: 4:8 is the same as c(4,5,6,7,8).

data.frame creates a dataset. This takes parameters with parameter names the variable names you want, and parameter values the vectors of values for each variable. All these vectors have to be equal size. Example: data = data.frame(V1=c(3,4,5,6), other.variable=4:8).

[ and ]: data access. If you have a vector, something like c(1,2,5,7) or 5:10, using name.of.the.vector[5] returns the 5th element in the vector. If you have a dataset, you need to include a comma (,) between the [ and the ]. Anything before the , picks out the observations you want to see, anything after the , picks out variables.

my.vector[5] # the 5th element
my.vector[c(1,3,6)] # the 1st, 3rd and 6th elements
my.vector[2:4] # the 2nd, 3rd and 4th elements
my.data[1:5,] # the first 5 observations of a dataset
my.data[,3] # the 3rd variable of a dataset
my.data[,c("V1","V4")] # the variables named V1 and V4 in a dataset
my.data[5:10, c("V3", 5)] # the 5th through 10th observations of the variable named V3 and the 5th variable in a dataset
my.data[my.data$V1 < 3,] # all the data such that the value of my.data$V1 is less than 3
my.data[my.data$CategoricalVariable == "value",] # all data where CategoricalVariable has the label "value"

sample: randomly selecting from anything: can sample from a vector, from a single variable in a dataset or from the entire dataset. Takes as first argument the thing you want to sample from, as second the number of samples you want to see. If you sample as many observations as your data contains, this shuffles your data into random order.

sample(data, 500) # draw 500 values from your data in random order

nrow counts the number of observations in a dataset. dim gives you the numbers of observations and variables at the same time.

number.of.observations = nrow(data)
number.of.observations = dim(data)[1]
number.of.variables = dim(data)[2]

Reading data

read.csv, read.delim, read.table: Read data from a file. Use csv if comma-separated, delim if tab-separated, and otherwise use read.table and give it an explicit separator.

my.data = read.csv("datafile.csv")
my.data = read.delim("datafile.txt")
my.data = read.table("datafile.dat", sep=";")

read.csv and read.delim assume your data file has a header: a first row with information about the data in it. If data starts directly with the first row in the file, you should add the parameter header=FALSE. If you do this, R will name your variables V1, V2, …. Notice that these are uppercase V, not lowercase v.

my.data = read.csv("datafile.csv", header=FALSE)
my.data = read.delim("datafile.txt", header=FALSE)
my.data = read.table("datafile.dat", sep=";", header=FALSE)

If your dataset is very large, you can restrict to loading parts of it using the nrows parameter:

my.data = read.csv("datafile.csv", nrows=1500)

Numeric statistics

mean, median, sd, var, quantile, IQR: Compute numeric statistics from data. If you have missing data, these will return NA. You can use the parameter na.rm=TRUE to compute only on the data that is not missing. For the quantile function you have to give which quantile you need. Use 0.5 for median, 0.25, 0.75 for the two quartiles, and whatever fraction is appropriate otherwise.

mean(data$variable)
median(data$variable, na.rm=TRUE)
quantile(data$variable, 0.33)

cor: Compute correlation for two variables. These need to be equal length. You can handle missing data using the parameter use="complete.obs":

cor(data$V1, data$V2)
cor(data$V1, data$V3, use="complete.obs")

Relationships between variables

lm: Compute a linear regression for two (or more) variables. Takes a formula written with the ~ symbol and names of variables, as well as the dataset itself. You will usually want to assign the result to give it a name so you can access details later on. You can access coefficients and residuals with their corresponding commands.

model = lm(V1 ~ V3, data)
coefficients(model)
residuals(model)

table: Create a two-way (or more-way) table from variables. These need to be equal length. You can extract entries using indexing with [,], compute joint and conditional distribution with prop.table and compute margin counts and distribution with margin.table.

my.table = table(wildfire$GeneralDesc, wildfire$Fuel) 
my.table[1,3] # the value from row 1, column 3
my.table["Lightning",] # all values from the row with the label "Lightning"
my.table[,"X"] # all values from the column with the label "X"
prop.table(my.table) # joint distribution
prop.table(my.table, 1) # conditional distributions conditioned on rows (1 : first variable)
prop.table(my.table, 2) # conditional distributions conditioned on columns
margin.table(my.table) # sum of all entries
margin.table(my.table, 1) # row sums
margin.table(my.table, 2) # column sums
margin.table(my.table, 1)/margin.table(my.table) # row margin distribution
margin.table(my.table, 2)/margin.table(my.table) # column margin distribution

Creating categories

If your dataset does not have more than one categorical variable, or if you find yourself interested in comparing a categorical and a numeric variable with each other, R provides the command cut. In this example, the range of a numeric variable gets separated into 5 equally large intervals, and each observation labeled with the interval the value of NumericVariable belongs to.

data$NewCategoricalVariable = cut(data$NumericVariable, 5)

Random variables and distributions

All random distributions in R have four different functions defined:

rdistribution(n, ...) Generate n random numbers.
ddistribution(x, ...) Compute probability / probability density at x.
pdistribution(q, ...) Compute the probability of the random variable taking a value less than q.
qdistribution(p, ...) Compute a value q such that the probability of being less than q is precisely p.

The ... here stand for the parameters defining each probability distribution and is different for each distribution.

Uniform runif, dunif, punif, qunif Parameters min and max.
Binomial rbinom, dbinom, pbinom, qbinom Parameters prob and size.
Negative binomial rnbinom, dnbinom, pnbinom, qnbinom Parameters prob and size.
Multinomial rmultinom, dmultinom Parameters size (single integer) and prob vector of probabilities. No p or q functions; for multivariate (vector valued; give more than one number as an outcome) distributions, these functions are less useful.
Poisson rpois, dpois, ppois, qpois Parameter lambda for the rate.
Exponential rexp, dexp, pexp, qexp Parameter rate for the rate.
Normal rnorm, dnorm, pnorm, qnorm Parameters mean and sd.
Chi-square rchisq, dchisq, pchisq, qchisq Parameter df.
Student's t rt, dt, pt, qt Parameter df.
F rf, df, pf, qf Parameters df1 and df2.

Plots

In addition to basic R commands, we will here show how to produce the same plots with ggplot2 and with ggformula and GGally.

We recommend working with ggformula and tidyverse. With these packages all plots take the same general shape:

dataset %>% gf_plottype(response ~ predictor | populations)

Replace anything green with the specific choices you need. The orange part is optional, and will usually split a single plot into several side-by-side plots that show the same thing for subgroups of your data.

For many tasks, we like to see different groups or different variables from the data represented in features such as color, size, shape. These connections can be made using =~: by adding color=~variable as a parameter to the function call, the plot is adapted.

QQ-plot

The quantile-quantile plot (QQplot) is useful to match up distributions; if both datasets follow the same kind of distribution, the QQplot will approximate a straight line.

qqplot(x,y) # plots quantiles of x against quantiles of y
qqnorm(x)  # plots quantiles of x against a normal distribution
qqline() # plots a best fit line for a recently created qqplot or qqnorm

With ggformula, to plot quantiles of data$x against a normal distribution

data %>% gf_qq(~x) %>% gf_qqline

Scatter plot

Scatter plots through plot(x,y)

With ggformula, to plot data$x against data$y

data %>% gf_point(y ~ x)

Histogram

Histograms through hist(x). Also useful is hist(x, probability=TRUE) to get a rescaled histogram that fits with plotting the ddistribution functions.

With ggformula, to plot data$x

data %>% gf_histogram(~ x)

ggplot2

The package ggplot2 produces good graphics for a lot of use cases. It uses a basic idea of a graphics grammar: you create a plot by first creating a plotting object associated with a dataset (has to be an actual dataset, produced by data.frame or read.csv or similar functions; cannot be a table or vector. you can force things to be dataset using the function as.data.frame.), and then adding features to the object.

As features are added, they need a mapping of aesthetic features to data variables. These are created with the aes command inside any of the ggplot2 commands.

Thus, with a command like

ggplot(data, aes(x=V1, y=V2)) + geom_point() + geom_smooth(method="lm")

we first create the plot object, and give it a dataset data, and associate the values of the variable V1 to x-positions, the values of V2 to y-positions. Then we add a points feature, telling the system to plot a point for each observation, by adding (with +) the geom_point function. geom_point pays attention to (among other things) the x and y mappings and use these to position each point. Next we add geom_smooth, which produces a smooth curve approximating the data. This function can produce its curve in many different ways, the one we have been using is chosen using method="lm", which fits and plots a linear regression from the data.

(Some) mappings can be put outside of the aes call to influence the plot without creating new information in a plot legend.

Almost everything takes alpha (transparency), colour and fill (colors; might not both be used, depends on the shape of the point) and size.

Some useful features are, with some of the aesthetic mappings they care about:

geom_point Requires x and y, also uses shape, size.
geom_jitter Used just like geom_point, but moves all datapoints around a little bit, so that data with many repeated values are easier to visualize. Takes parameters width and height to influence how far points are moved around. Values over 0.5 will make categorical data hard to distinguish.
geom_histogram and geom_freqpoly. Requires x. Takes (outside of aes) the optional parameters bins (number of bars), binwidth (width of the bars), center or boundary (shifts the bars by specifying one center or one boundary for a bar), na.rm (remove warning printout for missing values). Creates the new variables ..density.. and ..count.. that can be used in later features. Set y to one of these to modify whether you generate a density histogram (approximating the ddistribution functions) or count histogram.
geom_qq QQplots, just like qqplot. Takes an optional distribution function (qdistribution) together with its parameters as additional parameters. Data needs to be given in the aesthetic mapping sample. Example:

ggplot(data.frame(sample=rnorm(100)), aes(sample=sample)) + geom_qq(distribution=qnorm, dparams=c(mean=3, sd=5))

geom_boxplot Requires x and y. x should be categorical (splits the data on different boxplots), and y should be numeric (the data for each boxplot). Can be combined with geom_point or geom_jitter to good results. Example:

``` ggplot(data.frame(xs=c(rnorm(100,mean=-1), rnorm(100,mean=+1), ls=c(rep("a",100), rep("b", 100), aes(x=ls, y=xs)) + geom_boxplot(notch=TRUE) + geom_jitter(width=0.1, height=0, alpha=0.3)


* `stat_function` Takes parameters `fun` and `args` with a function, and its additional parameters. Typical example we have used would be as `+stat_function(fun=dunif, args=c(min=2, max=5))`. Example:

  ```
ggplot(data.frame(x=runif(1000,2,5)), aes(x=x)) + geom_histogram(aes(y=..density..), boundary=2, binwidth = 0.1) + stat_function(fun=dunif, args=c(min=2,max=5))

Plotting cookbook

Here are some suggestions for first tries to visualize your data, broken up into data types and combinations.

Numeric `V1`

Histogram for the distribution.

data %>% gf_histogram(~ V1)

Frequency curve for the distribution

data %>% gf_freqpoly(~ V1)

Boxplot for a summary (force to one group using an explicit label)

data %>% gf_boxplot(V1 ~ "data")

Categorical `V1`

Bar chart for the distribution (stat_count has a bar-chart as standard display).

data %>% gf_bar(~ V1)

Stacked bar chart for the distribution. (like with the single variable boxplot, this requires us to force the points to one position)

data %>% gf_bar(~ "data", fill=~V1)

Numeric `V1` and numeric `V2`

Scatterplot

data %>% gf_point(V1 ~ V2)

Illustrate numeric V3 as well using size:

data %>% gf_point(V1 ~ V2, size=~V3)

Illustrate categorical V3 as well using symbols:

data %>% gf_point(V1 ~ V2, shape=~V3)

Numeric `V1` and categorical `V2`

Stacked histograms where the counts for each value of V2 are piled on top of the other values.

data %>% gf_histogram(~ V1, fill=~V2)

Dodged histograms where for each range of values for V1, separate bars are drawn side by side for each value of V2.

data %>% gf_histogram(~ V1, fill=~V2, position="dodge")

Multiple frequency curves

data %>% gf_freqpoly(~ V1, color=~V2)

Boxplots

data %>% gf_boxplot(V1 ~ V2)

Categorical `V1` and categorical `V2`

2-way table plot where each point is scaled corresponding to that entry in the 2-way table of the data. When a table is converted to a data-frame, it has variables Var1 and Var2 containing the labels of V1 and V2 respectively; and Freq containing the joint count of each combination. We use shape=15 to make squares. The plot benefits from also using color=Freq.

data.table = data %>% select(V1, V2) %>% table
data.table %>% as.data.frame %>% gf_point(V1 ~ V2, size=~Freq, shape=15)

Stacked bar charts just like with the stacked histograms, but using stat_count instead of geom_histogram.

data %>% gf_bar(~V1, fill=~V2)

Dodged bar charts just like with the dodged histograms, but using stat_count instead of geom_histogram.

data %>% gf_bar(~V1, fill=~V2, position="dodge")

Stretched bar charts optimized to show the conditional distribution of V1 as V2 is held fixed. Each column in a stacked bar chart is rescaled to have equal height (corresponding to 100%)

data %>% gf_bar(~V1, fill=~V2, position="fill")

R commands

First in any script

Other sources

R Markdown

Text

Headings

Emphasis: italic and bold

Code blocks

Lists

R commands

Reading data

Numeric statistics

Relationships between variables

Creating categories

Random variables and distributions

Plots

QQ-plot

Scatter plot

Histogram

ggplot2

Plotting cookbook

Numeric V1

Categorical V1

Numeric V1 and numeric V2

Numeric V1 and categorical V2

Categorical V1 and categorical V2

Numeric `V1`

Categorical `V1`

Numeric `V1` and numeric `V2`

Numeric `V1` and categorical `V2`

Categorical `V1` and categorical `V2`