First in any script
First in any script you write, load the libraries we use for easier data handling and plotting.
library(tidyverse)
library(ggformula)
Other sources
The RStudio cheat sheets is an excellent source of summaries of everything important.
For our course, special interest should be paid to the cheat sheets for
R Markdown
R Markdown lets you weave in computations and graphs with the text you write: if you use other systems you will need to save out figures to import them, and copy-and-paste the values you compute. Staying in R Markdown means everything matches up with the data by default.
Text
The basic content of an RMarkdown document is text. Anything you don't specifically mark as something different is taken to be plain and unadorned text.
You break your text up into paragraphs by using two newlines after one another.
Headings
To get a heading, such as you would use to start a chapter or a section, you either use the #
symbol at the front of the line: once for the largest heading, repeatedly to get smaller headings:
# First level heading
## Second level heading
### Third level heading
Alternatively, you can underline the heading text in the text using -
or =
:
Another heading
----------------------
Yet another heading
===============
Emphasis: italic and bold
To emphasize parts of the text you can surround the text you emphasize by *
characters. Once for italic and twice for bold:
*This is italic* and **this is bold**
Code blocks
Using the backtick `
character (top left on the keyboard, not the '
character to the right!) you can mark text to be code and not normal text. Code is formatted using a fixed-width font and keeps all the spaces where they were, leaving the internal structure of the code where it is.
You can either put a code block inline in a sequence of text, by using `
to surround the code you are writing, so you could talk about eg commands like sd
while writing text about it.
Alternatively, if you start with a line with a sequence of three backticks ```
, this starts a separated code block. This block continues until you end it with another line with three backticks ```
.
Both the inline and the separated code blocks can be marked as R
code and by doing that let RStudio know you want it to run the code in the code block and insert the results when done knitting. You mark an inline code block by adding the letter r
after the backtick, producing something like
This inserts a random number in a sequence of text `r rnorm(1)`.
Similarly, you use the letter r
enclosed in curly braces at the beginning of a separated code block:
```{r}
plot(x,y)
abline(lm(y~x))
```
Lists
You can create bulleted lists by, after two newlines, starting each line with a *
or a -
followed by a space, so that:
* This is a
* bulleted list
produces
- This is a
- bulleted list
and
- this is another
- bulleted list
creates
- this is another
- bulleted list
By adding spaces in front, you can make sublists:
* outer list
- inner list
- inner list
* outer list again
produces
- outer list
- inner list
- inner list
- outer list again
You can create numbered lists using numbers followed by .
instead of symbols. RMarkdown doesn't care about which numbers you use, it will assume you want your list to go from 1 to however many items you list.
1. This
3. is
2. a list
creates
- This
- is
- a list
R commands
Important note: Everything in R
cares about UPPER/lower case. Make sure you use the right one when you get errors.
library
: Load extra functionality.
library(ggplot2)
install.packages
: Ensure library packages are installed.
install.packages("ggplot")
#
: Commented code. You can add explaining text inside a code block or in a script file by using #
. All the text that follows a #
character is ignored by R
.
$
: Column access. To use a column in a dataset, instead of the entire dataset, you use $
to pull out the column:
data$variable
=
: Assigning names. By writing something like
name = value
you put a name to a thing (dataset, number, string, ...). This saves it for further use, and lets you refer to this thing without recomputing it every time.
upper.95.rule = mean(data$variable) + 2*sd(data$variable)
By assigning a value to an existing variable in a dataset, you change the data in that variable. By assigning to a variable that is not contained in a dataset, you create a new variable and fill it with that data. In either of these cases, your new value needs to have as many values as there are observations in the dataset.
c
combines any elements together into a vector: a sequence of things of the same kind. c(2,5)
contains the numbers 2 and 5, c("ggplot2", "knitr")
contains the two strings "ggplot2"
and "knitr"
.
:
creates a sequence of values: 4:8
is the same as c(4,5,6,7,8)
.
data.frame
creates a dataset. This takes parameters with parameter names the variable names you want, and parameter values the vectors of values for each variable. All these vectors have to be equal size. Example: data = data.frame(V1=c(3,4,5,6), other.variable=4:8)
.
[
and ]
: data access. If you have a vector, something like c(1,2,5,7)
or 5:10
, using name.of.the.vector[5]
returns the 5th element in the vector. If you have a dataset, you need to include a comma (,
) between the [
and the ]
. Anything before the ,
picks out the observations you want to see, anything after the ,
picks out variables.
my.vector[5] # the 5th element
my.vector[c(1,3,6)] # the 1st, 3rd and 6th elements
my.vector[2:4] # the 2nd, 3rd and 4th elements
my.data[1:5,] # the first 5 observations of a dataset
my.data[,3] # the 3rd variable of a dataset
my.data[,c("V1","V4")] # the variables named V1 and V4 in a dataset
my.data[5:10, c("V3", 5)] # the 5th through 10th observations of the variable named V3 and the 5th variable in a dataset
my.data[my.data$V1 < 3,] # all the data such that the value of my.data$V1 is less than 3
my.data[my.data$CategoricalVariable == "value",] # all data where CategoricalVariable has the label "value"
sample
: randomly selecting from anything: can sample from a vector, from a single variable in a dataset or from the entire dataset. Takes as first argument the thing you want to sample from, as second the number of samples you want to see. If you sample as many observations as your data contains, this shuffles your data into random order.
sample(data, 500) # draw 500 values from your data in random order
nrow
counts the number of observations in a dataset. dim
gives you the numbers of observations and variables at the same time.
number.of.observations = nrow(data)
number.of.observations = dim(data)[1]
number.of.variables = dim(data)[2]
Reading data
read.csv
, read.delim
, read.table
: Read data from a file. Use csv
if comma-separated, delim
if tab-separated, and otherwise use read.table
and give it an explicit separator.
my.data = read.csv("datafile.csv")
my.data = read.delim("datafile.txt")
my.data = read.table("datafile.dat", sep=";")
read.csv
and read.delim
assume your data file has a header: a first row with information about the data in it. If data starts directly with the first row in the file, you should add the parameter header=FALSE
. If you do this, R
will name your variables V1, V2, …
. Notice that these are uppercase V
, not lowercase v
.
my.data = read.csv("datafile.csv", header=FALSE)
my.data = read.delim("datafile.txt", header=FALSE)
my.data = read.table("datafile.dat", sep=";", header=FALSE)
If your dataset is very large, you can restrict to loading parts of it using the nrows
parameter:
my.data = read.csv("datafile.csv", nrows=1500)
Numeric statistics
mean
, median
, sd
, var
, quantile
, IQR
: Compute numeric statistics from data. If you have missing data, these will return NA
. You can use the parameter na.rm=TRUE
to compute only on the data that is not missing. For the quantile
function you have to give which quantile you need. Use 0.5
for median, 0.25, 0.75
for the two quartiles, and whatever fraction is appropriate otherwise.
mean(data$variable)
median(data$variable, na.rm=TRUE)
quantile(data$variable, 0.33)
cor
: Compute correlation for two variables. These need to be equal length. You can handle missing data using the parameter use="complete.obs"
:
cor(data$V1, data$V2)
cor(data$V1, data$V3, use="complete.obs")
Relationships between variables
lm
: Compute a linear regression for two (or more) variables. Takes a formula written with the ~
symbol and names of variables, as well as the dataset itself. You will usually want to assign the result to give it a name so you can access details later on. You can access coefficients and residuals with their corresponding commands.
model = lm(V1 ~ V3, data)
coefficients(model)
residuals(model)
table
: Create a two-way (or more-way) table from variables. These need to be equal length. You can extract entries using indexing with [,]
, compute joint and conditional distribution with prop.table
and compute margin counts and distribution with margin.table
.
my.table = table(wildfire$GeneralDesc, wildfire$Fuel)
my.table[1,3] # the value from row 1, column 3
my.table["Lightning",] # all values from the row with the label "Lightning"
my.table[,"X"] # all values from the column with the label "X"
prop.table(my.table) # joint distribution
prop.table(my.table, 1) # conditional distributions conditioned on rows (1 : first variable)
prop.table(my.table, 2) # conditional distributions conditioned on columns
margin.table(my.table) # sum of all entries
margin.table(my.table, 1) # row sums
margin.table(my.table, 2) # column sums
margin.table(my.table, 1)/margin.table(my.table) # row margin distribution
margin.table(my.table, 2)/margin.table(my.table) # column margin distribution
Creating categories
If your dataset does not have more than one categorical variable, or if you find yourself interested in comparing a categorical and a numeric variable with each other, R
provides the command cut
. In this example, the range of a numeric variable gets separated into 5 equally large intervals, and each observation labeled with the interval the value of NumericVariable
belongs to.
data$NewCategoricalVariable = cut(data$NumericVariable, 5)
Random variables and distributions
All random distributions in R
have four different functions defined:
r
distribution(n, ...)
Generaten
random numbers.d
distribution(x, ...)
Compute probability / probability density atx
.p
distribution(q, ...)
Compute the probability of the random variable taking a value less thanq
.q
distribution(p, ...)
Compute a valueq
such that the probability of being less thanq
is preciselyp
.
The ...
here stand for the parameters defining each probability distribution and is different for each distribution.
- Uniform
runif
,dunif
,punif
,qunif
Parametersmin
andmax
. - Binomial
rbinom
,dbinom
,pbinom
,qbinom
Parametersprob
andsize
. - Negative binomial
rnbinom
,dnbinom
,pnbinom
,qnbinom
Parametersprob
andsize
. - Multinomial
rmultinom
,dmultinom
Parameterssize
(single integer) andprob
vector of probabilities. Nop
orq
functions; for multivariate (vector valued; give more than one number as an outcome) distributions, these functions are less useful. - Poisson
rpois
,dpois
,ppois
,qpois
Parameterlambda
for the rate. - Exponential
rexp
,dexp
,pexp
,qexp
Parameterrate
for the rate. - Normal
rnorm
,dnorm
,pnorm
,qnorm
Parametersmean
andsd
. - Chi-square
rchisq
,dchisq
,pchisq
,qchisq
Parameterdf
. - Student's t
rt
,dt
,pt
,qt
Parameterdf
. - F
rf
,df
,pf
,qf
Parametersdf1
anddf2
.
Plots
In addition to basic R
commands, we will here show how to produce the same plots with ggplot2
and with ggformula
and GGally
.
We recommend working with ggformula
and tidyverse
. With these packages all plots take the same general shape:
dataset %>% gf_plottype(response ~ predictor | populations)
Replace anything green with the specific choices you need. The orange part is optional, and will usually split a single plot into several side-by-side plots that show the same thing for subgroups of your data.
For many tasks, we like to see different groups or different variables from the data represented in features such as color, size, shape. These connections can be made using =~
: by adding color=~variable
as a parameter to the function call, the plot is adapted.
QQ-plot
The quantile-quantile plot (QQplot) is useful to match up distributions; if both datasets follow the same kind of distribution, the QQplot will approximate a straight line.
qqplot(x,y) # plots quantiles of x against quantiles of y
qqnorm(x) # plots quantiles of x against a normal distribution
qqline() # plots a best fit line for a recently created qqplot or qqnorm
With ggformula
, to plot quantiles of data$x
against a normal distribution
data %>% gf_qq(~x) %>% gf_qqline
Scatter plot
Scatter plots through plot(x,y)
With ggformula
, to plot data$x
against data$y
data %>% gf_point(y ~ x)
Histogram
Histograms through hist(x)
. Also useful is hist(x, probability=TRUE)
to get a rescaled histogram that fits with plotting the d
distribution functions.
With ggformula
, to plot data$x
data %>% gf_histogram(~ x)
ggplot2
The package ggplot2
produces good graphics for a lot of use cases. It uses a basic idea of a graphics grammar: you create a plot by first creating a plotting object associated with a dataset (has to be an actual dataset, produced by data.frame
or read.csv
or similar functions; cannot be a table
or vector. you can force things to be dataset using the function as.data.frame
.), and then adding features to the object.
As features are added, they need a mapping of aesthetic features to data variables. These are created with the aes
command inside any of the ggplot2
commands.
Thus, with a command like
ggplot(data, aes(x=V1, y=V2)) + geom_point() + geom_smooth(method="lm")
we first create the plot object, and give it a dataset data
, and associate the values of the variable V1
to x
-positions, the values of V2
to y
-positions. Then we add a points feature, telling the system to plot a point for each observation, by adding (with +
) the geom_point
function. geom_point
pays attention to (among other things) the x
and y
mappings and use these to position each point. Next we add geom_smooth
, which produces a smooth curve approximating the data. This function can produce its curve in many different ways, the one we have been using is chosen using method="lm"
, which fits and plots a linear regression from the data.
(Some) mappings can be put outside of the aes
call to influence the plot without creating new information in a plot legend.
Almost everything takes alpha
(transparency), colour
and fill
(colors; might not both be used, depends on the shape of the point) and size
.
Some useful features are, with some of the aesthetic mappings they care about:
geom_point
Requiresx
andy
, also usesshape
,size
.geom_jitter
Used just likegeom_point
, but moves all datapoints around a little bit, so that data with many repeated values are easier to visualize. Takes parameterswidth
andheight
to influence how far points are moved around. Values over0.5
will make categorical data hard to distinguish.geom_histogram
andgeom_freqpoly
. Requiresx
. Takes (outside ofaes
) the optional parametersbins
(number of bars),binwidth
(width of the bars),center
orboundary
(shifts the bars by specifying one center or one boundary for a bar),na.rm
(remove warning printout for missing values). Creates the new variables..density..
and..count..
that can be used in later features. Sety
to one of these to modify whether you generate a density histogram (approximating thed
distribution functions) or count histogram.geom_qq
QQplots, just likeqqplot
. Takes an optional distribution function (q
distribution) together with its parameters as additional parameters. Data needs to be given in the aesthetic mappingsample
. Example:
ggplot(data.frame(sample=rnorm(100)), aes(sample=sample)) + geom_qq(distribution=qnorm, dparams=c(mean=3, sd=5))
geom_boxplot
Requiresx
andy
.x
should be categorical (splits the data on different boxplots), andy
should be numeric (the data for each boxplot). Can be combined withgeom_point
orgeom_jitter
to good results. Example:
``` ggplot(data.frame(xs=c(rnorm(100,mean=-1), rnorm(100,mean=+1), ls=c(rep("a",100), rep("b", 100), aes(x=ls, y=xs)) + geom_boxplot(notch=TRUE) + geom_jitter(width=0.1, height=0, alpha=0.3)
* `stat_function` Takes parameters `fun` and `args` with a function, and its additional parameters. Typical example we have used would be as `+stat_function(fun=dunif, args=c(min=2, max=5))`. Example:
```
ggplot(data.frame(x=runif(1000,2,5)), aes(x=x)) + geom_histogram(aes(y=..density..), boundary=2, binwidth = 0.1) + stat_function(fun=dunif, args=c(min=2,max=5))
Plotting cookbook
Here are some suggestions for first tries to visualize your data, broken up into data types and combinations.
Numeric V1
Histogram for the distribution.
data %>% gf_histogram(~ V1)
Frequency curve for the distribution
data %>% gf_freqpoly(~ V1)
Boxplot for a summary (force to one group using an explicit label)
data %>% gf_boxplot(V1 ~ "data")
Categorical V1
Bar chart for the distribution (stat_count
has a bar-chart as standard display).
data %>% gf_bar(~ V1)
Stacked bar chart for the distribution. (like with the single variable boxplot, this requires us to force the points to one position)
data %>% gf_bar(~ "data", fill=~V1)
Numeric V1
and numeric V2
Scatterplot
data %>% gf_point(V1 ~ V2)
Illustrate numeric V3
as well using size:
data %>% gf_point(V1 ~ V2, size=~V3)
Illustrate categorical V3
as well using symbols:
data %>% gf_point(V1 ~ V2, shape=~V3)
Numeric V1
and categorical V2
Stacked histograms where the counts for each value of V2
are piled on top of the other values.
data %>% gf_histogram(~ V1, fill=~V2)
Dodged histograms where for each range of values for V1
, separate bars are drawn side by side for each value of V2
.
data %>% gf_histogram(~ V1, fill=~V2, position="dodge")
Multiple frequency curves
data %>% gf_freqpoly(~ V1, color=~V2)
Boxplots
data %>% gf_boxplot(V1 ~ V2)
Categorical V1
and categorical V2
2-way table plot where each point is scaled corresponding to that entry in the 2-way table of the data. When a table is converted to a data-frame, it has variables Var1
and Var2
containing the labels of V1
and V2
respectively; and Freq
containing the joint count of each combination. We use shape=15
to make squares. The plot benefits from also using color=Freq
.
data.table = data %>% select(V1, V2) %>% table
data.table %>% as.data.frame %>% gf_point(V1 ~ V2, size=~Freq, shape=15)
Stacked bar charts just like with the stacked histograms, but using stat_count
instead of geom_histogram
.
data %>% gf_bar(~V1, fill=~V2)
Dodged bar charts just like with the dodged histograms, but using stat_count
instead of geom_histogram
.
data %>% gf_bar(~V1, fill=~V2, position="dodge")
Stretched bar charts optimized to show the conditional distribution of V1
as V2
is held fixed. Each column in a stacked bar chart is rescaled to have equal height (corresponding to 100%)
data %>% gf_bar(~V1, fill=~V2, position="fill")