This page is not quite done yet: more to come about how to use R Markdown.

R Markdown

R Markdown lets you weave in computations and graphs with the text you write: if you use other systems you will need to save out figures to import them, and copy-and-paste the values you compute. Staying in R Markdown means everything matches up with the data by default.


The basic content of an RMarkdown document is text. Anything you don’t specifically mark as something different is taken to be plain and unadorned text.

You break your text up into paragraphs by using two newlines after one another.


To get a heading, such as you would use to start a chapter or a section, you either use the # symbol at the front of the line: once for the largest heading, repeatedly to get smaller headings:

# First level heading

## Second level heading

### Third level heading

Alternatively, you can underline the heading text in the text using - or =:

Another heading

Yet another heading

Emphasis: italic and bold

To emphasize parts of the text you can surround the text you emphasize by * characters. Once for italic and twice for bold:

*This is italic* and **this is bold**

Code blocks

Using the backtick ` character (top left on the keyboard, not the ' character to the right!) you can mark text to be code and not normal text. Code is formatted using a fixed-width font and keeps all the spaces where they were, leaving the internal structure of the code where it is.

You can either put a code block inline in a sequence of text, by using ` to surround the code you are writing, so you could talk about eg commands like sd while writing text about it.

Alternatively, if you start with a line with a sequence of three backticks ```, this starts a separated code block. This block continues until you end it with another line with three backticks ```.

Both the inline and the separated code blocks can be marked as R code and by doing that let RStudio know you want it to run the code in the code block and insert the results when done knitting. You mark an inline code block by adding the letter r after the backtick, producing something like

This inserts a random number in a sequence of text `r rnorm(1)`.

Similarly, you use the letter r enclosed in curly braces at the beginning of a separated code block:



You can create bulleted lists by, after two newlines, starting each line with a * or a - followed by a space, so that:

* This is a
* bulleted list



- this is another
- bulleted list


By adding spaces in front, you can make sublists:

* outer list
  - inner list
  - inner list
* outer list again


You can create numbered lists using numbers followed by . instead of symbols. RMarkdown doesn’t care about which numbers you use, it will assume you want your list to go from 1 to however many items you list.

1. This
3. is
2. a list


  1. This
  2. is
  3. a list

R commands

Important note: Everything in R cares about UPPER/lower case. Make sure you use the right one when you get errors.

library: Load extra functionality.


install.packages: Ensure library packages are installed.


#: Commented code. You can add explaining text inside a code block or in a script file by using #. All the text that follows a # character is ignored by R.

$: Column access. To use a column in a dataset, instead of the entire dataset, you use $ to pull out the column:


=: Assigning names. By writing something like
name = value you put a name to a thing (dataset, number, string, …). This saves it for further use, and lets you refer to this thing without recomputing it every time.

upper.95.rule = mean(data$variable) + 2*sd(data$variable)

By assigning a value to an existing variable in a dataset, you change the data in that variable. By assigning to a variable that is not contained in a dataset, you create a new variable and fill it with that data. In either of these cases, your new value needs to have as many values as there are observations in the dataset.

c combines any elements together into a vector: a sequence of things of the same kind. c(2,5) contains the numbers 2 and 5, c("ggplot2", "knitr") contains the two strings "ggplot2" and "knitr".

: creates a sequence of values: 4:8 is the same as c(4,5,6,7,8).

data.frame creates a dataset. This takes parameters with parameter names the variable names you want, and parameter values the vectors of values for each variable. All these vectors have to be equal size. Example: data = data.frame(V1=c(3,4,5,6), other.variable=4:8).

[ and ]: data access. If you have a vector, something like c(1,2,5,7) or 5:10, using name.of.the.vector[5] returns the 5th element in the vector. If you have a dataset, you need to include a comma (,) between the [ and the ]. Anything before the , picks out the observations you want to see, anything after the , picks out variables.

my.vector[5] # the 5th element
my.vector[c(1,3,6)] # the 1st, 3rd and 6th elements
my.vector[2:4] # the 2nd, 3rd and 4th elements[1:5,] # the first 5 observations of a dataset[,3] # the 3rd variable of a dataset[,c("V1","V4")] # the variables named V1 and V4 in a dataset[5:10, c("V3", 5)] # the 5th through 10th observations of the variable named V3 and the 5th variable in a dataset[$V1 < 3,] # all the data such that the value of$V1 is less than 3[$CategoricalVariable == "value",] # all data where CategoricalVariable has the label "value"

sample: randomly selecting from anything: can sample from a vector, from a single variable in a dataset or from the entire dataset. Takes as first argument the thing you want to sample from, as second the number of samples you want to see. If you sample as many observations as your data contains, this shuffles your data into random order.

sample(data, 500) # draw 500 values from your data in random order

nrow counts the number of observations in a dataset. dim gives you the numbers of observations and variables at the same time.

number.of.observations = nrow(data)
number.of.observations = dim(data)[1]
number.of.variables = dim(data)[2]

Reading data

read.csv, read.delim, read.table: Read data from a file. Use csv if comma-separated, delim if tab-separated, and otherwise use read.table and give it an explicit separator. = read.csv("datafile.csv") = read.delim("datafile.txt") = read.table("datafile.dat", sep=";")

read.csv and read.delim assume your data file has a header: a first row with information about the data in it. If data starts directly with the first row in the file, you should add the parameter header=FALSE. If you do this, R will name your variables V1, V2, …. Notice that these are uppercase V, not lowercase v. = read.csv("datafile.csv", header=FALSE) = read.delim("datafile.txt", header=FALSE) = read.table("datafile.dat", sep=";", header=FALSE)

If your dataset is very large, you can restrict to loading parts of it using the nrows parameter: = read.csv("datafile.csv", nrows=1500)

Numeric statistics

mean, median, sd, var, quantile, IQR: Compute numeric statistics from data. If you have missing data, these will return NA. You can use the parameter na.rm=TRUE to compute only on the data that is not missing. For the quantile function you have to give which quantile you need. Use 0.5 for median, 0.25, 0.75 for the two quartiles, and whatever fraction is appropriate otherwise.

median(data$variable, na.rm=TRUE)
quantile(data$variable, 0.33)

cor: Compute correlation for two variables. These need to be equal length. You can handle missing data using the parameter use="complete.obs":

cor(data$V1, data$V2)
cor(data$V1, data$V3, use="complete.obs")

Relationships between variables

lm: Compute a linear regression for two (or more) variables. Takes a formula written with the ~ symbol and names of variables, as well as the dataset itself. You will usually want to assign the result to give it a name so you can access details later on. You can access coefficients and residuals with their corresponding commands.

model = lm(V1 ~ V3, data)

table: Create a two-way (or more-way) table from variables. These need to be equal length. You can extract entries using indexing with [,], compute joint and conditional distribution with prop.table and compute margin counts and distribution with margin.table.

my.table = table(wildfire$GeneralDesc, wildfire$Fuel) 
my.table[1,3] # the value from row 1, column 3
my.table["Lightning",] # all values from the row with the label "Lightning"
my.table[,"X"] # all values from the column with the label "X"
prop.table(my.table) # joint distribution
prop.table(my.table, 1) # conditional distributions conditioned on rows (1 : first variable)
prop.table(my.table, 2) # conditional distributions conditioned on columns
margin.table(my.table) # sum of all entries
margin.table(my.table, 1) # row sums
margin.table(my.table, 2) # column sums
margin.table(my.table, 1)/margin.table(my.table) # row margin distribution
margin.table(my.table, 2)/margin.table(my.table) # column margin distribution

Creating categories

If your dataset does not have more than one categorical variable, or if you find yourself interested in comparing a categorical and a numeric variable with each other, R provides the command cut. In this example, the range of a numeric variable gets separated into 5 equally large intervals, and each observation labeled with the interval the value of NumericVariable belongs to.

data$NewCategoricalVariable = cut(data$NumericVariable, 5)

Random variables and distributions

All random distributions in R have four different functions defined:

The ... here stand for the parameters defining each probability distribution and is different for each distribution.


The quantile-quantile plot (QQplot) is useful to match up distributions; if both datasets follow the same kind of distribution, the QQplot will approximate a straight line.

qqplot(x,y) # plots quantiles of x against quantiles of y
qqnorm(x)  # plots quantiles of x against a normal distribution
qqline() # plots a best fit line for a recently created qqplot or qqnorm

Scatter plots through plot(x,y)

Histograms through hist(x). Also useful is hist(x, probability=TRUE) to get a rescaled histogram that fits with plotting the ddistribution functions.


The package ggplot2 produces good graphics for a lot of use cases. It uses a basic idea of a graphics grammar: you create a plot by first creating a plotting object associated with a dataset (has to be an actual dataset, produced by data.frame or read.csv or similar functions; cannot be a table or vector. you can force things to be dataset using the function, and then adding features to the object.

As features are added, they need a mapping of aesthetic features to data variables. These are created with the aes command inside any of the ggplot2 commands.

Thus, with a command like

ggplot(data, aes(x=V1, y=V2)) + geom_point() + geom_smooth(method="lm")

we first create the plot object, and give it a dataset data, and associate the values of the variable V1 to x-positions, the values of V2 to y-positions. Then we add a points feature, telling the system to plot a point for each observation, by adding (with +) the geom_point function. geom_point pays attention to (among other things) the x and y mappings and use these to position each point. Next we add geom_smooth, which produces a smooth curve approximating the data. This function can produce its curve in many different ways, the one we have been using is chosen using method="lm", which fits and plots a linear regression from the data.

(Some) mappings can be put outside of the aes call to influence the plot without creating new information in a plot legend.

Almost everything takes alpha (transparency), colour and fill (colors; might not both be used, depends on the shape of the point) and size.

Some useful features are, with some of the aesthetic mappings they care about:

ggplot(data.frame(sample=rnorm(100)), aes(sample=sample)) + geom_qq(distribution=qnorm, dparams=c(mean=3, sd=5))

ggplot(data.frame(xs=c(rnorm(100,mean=-1), rnorm(100,mean=+1), ls=c(rep("a",100), rep("b", 100), aes(x=ls, y=xs)) + geom_boxplot(notch=TRUE) + geom_jitter(width=0.1, height=0, alpha=0.3)

ggplot(data.frame(x=runif(1000,2,5)), aes(x=x)) + geom_histogram(aes(y=..density..), boundary=2, binwidth = 0.1) + stat_function(fun=dunif, args=c(min=2,max=5))

Plotting cookbook

Here are some suggestions for first tries to visualize your data, broken up into data types and combinations.

Numeric V1

Histogram for the distribution.

ggplot(data, aes(x=V1)) + geom_histogram()

Frequency curve for the distribution

ggplot(data, aes(x=V1)) + geom_freqpoly()

Boxplot for a summary (force to one group using an explicit label)

ggplot(data, aes(x="data", y=V1)) + geom_boxplot()

Categorical V1

Bar chart for the distribution (stat_count has a bar-chart as standard display).

ggplot(data, aes(x=V1)) + stat_count()

Stacked bar chart for the distribution. (like with the single variable boxplot, this requires us to force the points to one position)

ggplot(data, aes(x="data", fill=V1)) + stat_count()

Numeric V1 and numeric V2


ggplot(data, aes(x=V1, y=V2)) + geom_point()

Illustrate numeric V3 as well using size:

ggplot(data, aes(x=V1, y=V2, size=V3)) + geom_point()

Illustrate categorical V3 as well using symbols:

ggplot(data, aes(x=V1, y=V2, shape=V3)) + geom_point()

Numeric V1 and categorical V2

Stacked histograms where the counts for each value of V2 are piled on top of the other values.

ggplot(data, aes(x=V1, fill=V2)) + geom_histogram()

Dodged histograms where for each range of values for V1, separate bars are drawn side by side for each value of V2.

ggplot(data, aes(x=V1, fill=V2)) + geom_histogram(position='dodge')

Multiple frequency curves

ggplot(data, aes(x=V1, color=V2)) + geom_freqpoly()


ggplot(data, aes(x=V2, y=V1)) + geom_boxplot()

Categorical V1 and categorical V2

2-way table plot where each point is scaled corresponding to that entry in the 2-way table of the data. When a table is converted to a data-frame, it has variables Var1 and Var2 containing the labels of V1 and V2 respectively; and Freq containing the joint count of each combination. We use shape=15 to make squares. The plot benefits from also using color=Freq.

ggplot($V1, data$V2)), aes(x=Var1, y=Var2, size=Freq)) + geom_point(shape=15)

Stacked bar charts just like with the stacked histograms, but using stat_count instead of geom_histogram.

ggplot(data, aes(x=V2, fill=V1)) + stat_count(position='stack')

Dodged bar charts just like with the dodged histograms, but using stat_count instead of geom_histogram.

ggplot(data, aes(x=V2, fill=V1)) + stat_count(position='dodge')

Stretched bar charts optimized to show the conditional distribution of V1 as V2 is held fixed. Each column in a stacked bar chart is rescaled to have equal height (corresponding to 100%)

ggplot(data, aes(x=V2, fill=V1)) + stat_count(position='fill')