This page is not quite done yet: more to come about how to use R Markdown.
R Markdown
R Markdown lets you weave in computations and graphs with the text you write: if you use other systems you will need to save out figures to import them, and copy-and-paste the values you compute. Staying in R Markdown means everything matches up with the data by default.
Text
The basic content of an RMarkdown document is text. Anything you don’t specifically mark as something different is taken to be plain and unadorned text.
You break your text up into paragraphs by using two newlines after one another.
Headings
To get a heading, such as you would use to start a chapter or a section, you either use the #
symbol at the front of the line: once for the largest heading, repeatedly to get smaller headings:
# First level heading
## Second level heading
### Third level heading
Alternatively, you can underline the heading text in the text using -
or =
:
Another heading
----------------------
Yet another heading
===============
Emphasis: italic and bold
To emphasize parts of the text you can surround the text you emphasize by *
characters. Once for italic and twice for bold:
*This is italic* and **this is bold**
Code blocks
Using the backtick `
character (top left on the keyboard, not the '
character to the right!) you can mark text to be code and not normal text. Code is formatted using a fixed-width font and keeps all the spaces where they were, leaving the internal structure of the code where it is.
You can either put a code block inline in a sequence of text, by using `
to surround the code you are writing, so you could talk about eg commands like sd
while writing text about it.
Alternatively, if you start with a line with a sequence of three backticks ```
, this starts a separated code block. This block continues until you end it with another line with three backticks ```
.
Both the inline and the separated code blocks can be marked as R
code and by doing that let RStudio know you want it to run the code in the code block and insert the results when done knitting. You mark an inline code block by adding the letter r
after the backtick, producing something like
This inserts a random number in a sequence of text `r rnorm(1)`.
Similarly, you use the letter r
enclosed in curly braces at the beginning of a separated code block:
```{r}
plot(x,y)
abline(lm(y~x))
```
Lists
You can create bulleted lists by, after two newlines, starting each line with a *
or a -
followed by a space, so that:
* This is a
* bulleted list
produces
- This is a
- bulleted list
and
- this is another
- bulleted list
creates
- this is another
- bulleted list
By adding spaces in front, you can make sublists:
* outer list
- inner list
- inner list
* outer list again
produces
- outer list
- inner list
- inner list
- outer list again
You can create numbered lists using numbers followed by .
instead of symbols. RMarkdown doesn’t care about which numbers you use, it will assume you want your list to go from 1 to however many items you list.
1. This
3. is
2. a list
creates
- This
- is
- a list
R commands
Important note: Everything in R
cares about UPPER/lower case. Make sure you use the right one when you get errors.
library
: Load extra functionality.
library(ggplot2)
install.packages
: Ensure library packages are installed.
install.packages("ggplot")
#
: Commented code. You can add explaining text inside a code block or in a script file by using #
. All the text that follows a #
character is ignored by R
.
$
: Column access. To use a column in a dataset, instead of the entire dataset, you use $
to pull out the column:
data$variable
=
: Assigning names. By writing something like
name = value
you put a name to a thing (dataset, number, string, …). This saves it for further use, and lets you refer to this thing without recomputing it every time.
upper.95.rule = mean(data$variable) + 2*sd(data$variable)
By assigning a value to an existing variable in a dataset, you change the data in that variable. By assigning to a variable that is not contained in a dataset, you create a new variable and fill it with that data. In either of these cases, your new value needs to have as many values as there are observations in the dataset.
c
combines any elements together into a vector: a sequence of things of the same kind. c(2,5)
contains the numbers 2 and 5, c("ggplot2", "knitr")
contains the two strings "ggplot2"
and "knitr"
.
:
creates a sequence of values: 4:8
is the same as c(4,5,6,7,8)
.
data.frame
creates a dataset. This takes parameters with parameter names the variable names you want, and parameter values the vectors of values for each variable. All these vectors have to be equal size. Example: data = data.frame(V1=c(3,4,5,6), other.variable=4:8)
.
[
and ]
: data access. If you have a vector, something like c(1,2,5,7)
or 5:10
, using name.of.the.vector[5]
returns the 5th element in the vector. If you have a dataset, you need to include a comma (,
) between the [
and the ]
. Anything before the ,
picks out the observations you want to see, anything after the ,
picks out variables.
my.vector[5] # the 5th element
my.vector[c(1,3,6)] # the 1st, 3rd and 6th elements
my.vector[2:4] # the 2nd, 3rd and 4th elements
my.data[1:5,] # the first 5 observations of a dataset
my.data[,3] # the 3rd variable of a dataset
my.data[,c("V1","V4")] # the variables named V1 and V4 in a dataset
my.data[5:10, c("V3", 5)] # the 5th through 10th observations of the variable named V3 and the 5th variable in a dataset
my.data[my.data$V1 < 3,] # all the data such that the value of my.data$V1 is less than 3
my.data[my.data$CategoricalVariable == "value",] # all data where CategoricalVariable has the label "value"
sample
: randomly selecting from anything: can sample from a vector, from a single variable in a dataset or from the entire dataset. Takes as first argument the thing you want to sample from, as second the number of samples you want to see. If you sample as many observations as your data contains, this shuffles your data into random order.
sample(data, 500) # draw 500 values from your data in random order
nrow
counts the number of observations in a dataset. dim
gives you the numbers of observations and variables at the same time.
number.of.observations = nrow(data)
number.of.observations = dim(data)[1]
number.of.variables = dim(data)[2]
Reading data
read.csv
, read.delim
, read.table
: Read data from a file. Use csv
if comma-separated, delim
if tab-separated, and otherwise use read.table
and give it an explicit separator.
my.data = read.csv("datafile.csv")
my.data = read.delim("datafile.txt")
my.data = read.table("datafile.dat", sep=";")
read.csv
and read.delim
assume your data file has a header: a first row with information about the data in it. If data starts directly with the first row in the file, you should add the parameter header=FALSE
. If you do this, R
will name your variables V1, V2, …
. Notice that these are uppercase V
, not lowercase v
.
my.data = read.csv("datafile.csv", header=FALSE)
my.data = read.delim("datafile.txt", header=FALSE)
my.data = read.table("datafile.dat", sep=";", header=FALSE)
Numeric statistics
mean
, median
, sd
, var
, quantile
, IQR
: Compute numeric statistics from data. If you have missing data, these will return NA
. You can use the parameter na.rm=TRUE
to compute only on the data that is not missing. For the quantile
function you have to give which quantile you need. Use 0.5
for median, 0.25, 0.75
for the two quartiles, and whatever fraction is appropriate otherwise.
mean(data$variable)
median(data$variable, na.rm=TRUE)
quantile(data$variable, 0.33)
cor
: Compute correlation for two variables. These need to be equal length. You can handle missing data using the parameter use="complete.obs"
:
cor(data$V1, data$V2)
cor(data$V1, data$V3, use="complete.obs")
Relationships between variables
lm
: Compute a linear regression for two (or more) variables. Takes a formula written with the ~
symbol and names of variables, as well as the dataset itself. You will usually want to assign the result to give it a name so you can access details later on. You can access coefficients and residuals with their corresponding commands.
model = lm(V1 ~ V3, data)
coefficients(model)
residuals(model)
table
: Create a two-way (or more-way) table from variables. These need to be equal length. You can extract entries using indexing with [,]
, compute joint and conditional distribution with prop.table
and compute margin counts and distribution with margin.table
.
my.table = table(wildfire$GeneralDesc, wildfire$Fuel)
my.table[1,3] # the value from row 1, column 3
my.table["Lightning",] # all values from the row with the label "Lightning"
my.table[,"X"] # all values from the column with the label "X"
prop.table(my.table) # joint distribution
prop.table(my.table, 1) # conditional distributions conditioned on rows (1 : first variable)
prop.table(my.table, 2) # conditional distributions conditioned on columns
margin.table(my.table) # sum of all entries
margin.table(my.table, 1) # row sums
margin.table(my.table, 2) # column sums
margin.table(my.table, 1)/margin.table(my.table) # row margin distribution
margin.table(my.table, 2)/margin.table(my.table) # column margin distribution
Creating categories
If your dataset does not have more than one categorical variable, or if you find yourself interested in comparing a categorical and a numeric variable with each other, R
provides the command cut
. In this example, the range of a numeric variable gets separated into 5 equally large intervals, and each observation labeled with the interval the value of NumericVariable
belongs to.
data$NewCategoricalVariable = cut(data$NumericVariable, 5)
Random variables and distributions
All random distributions in R
have four different functions defined:
r
distribution(n, ...)
Generaten
random numbers.d
distribution(x, ...)
Compute probability / probability density atx
.p
distribution(q, ...)
Compute the probability of the random variable taking a value less thanq
.q
distribution(p, ...)
Compute a valueq
such that the probability of being less thanq
is preciselyp
.
The ...
here stand for the parameters defining each probability distribution and is different for each distribution.
- Uniform
runif
,dunif
,punif
,qunif
Parametersmin
andmax
. - Binomial
rbinom
,dbinom
,pbinom
,qbinom
Parametersprob
andsize
. - Negative binomial
rnbinom
,dnbinom
,pnbinom
,qnbinom
Parametersprob
andsize
. - Multinomial
rmultinom
,dmultinom
Parameterssize
(single integer) andprob
vector of probabilities. Nop
orq
functions; for multivariate (vector valued; give more than one number as an outcome) distributions, these functions are less useful. - Poisson
rpois
,dpois
,ppois
,qpois
Parameterlambda
for the rate. - Exponential
rexp
,dexp
,pexp
,qexp
Parameterrate
for the rate. - Normal
rnorm
,dnorm
,pnorm
,qnorm
Parametersmean
andsd
. - Chi-square
rchisq
,dchisq
,pchisq
,qchisq
Parameterdf
. - Student’s t
rt
,dt
,pt
,qt
Parameterdf
. - F
rf
,df
,pf
,qf
Parametersdf1
anddf2
.
Plots
The quantile-quantile plot (QQplot) is useful to match up distributions; if both datasets follow the same kind of distribution, the QQplot will approximate a straight line.
qqplot(x,y) # plots quantiles of x against quantiles of y
qqnorm(x) # plots quantiles of x against a normal distribution
qqline() # plots a best fit line for a recently created qqplot or qqnorm
Scatter plots through plot(x,y)
Histograms through hist(x)
. Also useful is hist(x, probability=TRUE)
to get a rescaled histogram that fits with plotting the d
distribution functions.
ggplot2
The package ggplot2
produces good graphics for a lot of use cases. It uses a basic idea of a graphics grammar: you create a plot by first creating a plotting object associated with a dataset (has to be an actual dataset, produced by data.frame
or read.csv
or similar functions; cannot be a table
or vector. you can force things to be dataset using the function as.data.frame
.), and then adding features to the object.
As features are added, they need a mapping of aesthetic features to data variables. These are created with the aes
command inside any of the ggplot2
commands.
Thus, with a command like
ggplot(data, aes(x=V1, y=V2)) + geom_point() + geom_smooth(method="lm")
we first create the plot object, and give it a dataset data
, and associate the values of the variable V1
to x
-positions, the values of V2
to y
-positions. Then we add a points feature, telling the system to plot a point for each observation, by adding (with +
) the geom_point
function. geom_point
pays attention to (among other things) the x
and y
mappings and use these to position each point. Next we add geom_smooth
, which produces a smooth curve approximating the data. This function can produce its curve in many different ways, the one we have been using is chosen using method="lm"
, which fits and plots a linear regression from the data.
(Some) mappings can be put outside of the aes
call to influence the plot without creating new information in a plot legend.
Almost everything takes alpha
(transparency), colour
and fill
(colors; might not both be used, depends on the shape of the point) and size
.
Some useful features are, with some of the aesthetic mappings they care about:
geom_point
Requiresx
andy
, also usesshape
,size
.geom_jitter
Used just likegeom_point
, but moves all datapoints around a little bit, so that data with many repeated values are easier to visualize. Takes parameterswidth
andheight
to influence how far points are moved around. Values over0.5
will make categorical data hard to distinguish.geom_histogram
andgeom_freqpoly
. Requiresx
. Takes (outside ofaes
) the optional parametersbins
(number of bars),binwidth
(width of the bars),center
orboundary
(shifts the bars by specifying one center or one boundary for a bar),na.rm
(remove warning printout for missing values). Creates the new variables..density..
and..count..
that can be used in later features. Sety
to one of these to modify whether you generate a density histogram (approximating thed
distribution functions) or count histogram.geom_qq
QQplots, just likeqqplot
. Takes an optional distribution function (q
distribution) together with its parameters as additional parameters. Data needs to be given in the aesthetic mappingsample
. Example:
ggplot(data.frame(sample=rnorm(100)), aes(sample=sample)) + geom_qq(distribution=qnorm, dparams=c(mean=3, sd=5))
geom_boxplot
Requiresx
andy
.x
should be categorical (splits the data on different boxplots), andy
should be numeric (the data for each boxplot). Can be combined withgeom_point
orgeom_jitter
to good results. Example:
ggplot(data.frame(xs=c(rnorm(100,mean=-1), rnorm(100,mean=+1), ls=c(rep("a",100), rep("b", 100), aes(x=ls, y=xs)) + geom_boxplot(notch=TRUE) + geom_jitter(width=0.1, height=0, alpha=0.3)
stat_function
Takes parametersfun
andargs
with a function, and its additional parameters. Typical example we have used would be as+stat_function(fun=dunif, args=c(min=2, max=5))
. Example:
ggplot(data.frame(x=runif(1000,2,5)), aes(x=x)) + geom_histogram(aes(y=..density..), boundary=2, binwidth = 0.1) + stat_function(fun=dunif, args=c(min=2,max=5))