This page is not quite done yet: more to come about how to use R Markdown.

# R Markdown

R Markdown lets you weave in computations and graphs with the text you write: if you use other systems you will need to save out figures to import them, and copy-and-paste the values you compute. Staying in R Markdown means everything matches up with the data by default.

## Text

The basic content of an RMarkdown document is text. Anything you don’t specifically mark as something different is taken to be plain and unadorned text.

You break your text up into paragraphs by using two newlines after one another.

## Headings

To get a heading, such as you would use to start a chapter or a section, you either use the `#`

symbol at the front of the line: once for the largest heading, repeatedly to get smaller headings:

```
# First level heading
## Second level heading
### Third level heading
```

Alternatively, you can underline the heading text in the text using `-`

or `=`

:

```
Another heading
----------------------
Yet another heading
===============
```

## Emphasis: *italic* and **bold**

To emphasize parts of the text you can surround the text you emphasize by `*`

characters. Once for *italic* and twice for **bold**:

`*This is italic* and **this is bold**`

## Code blocks

Using the backtick ```

character (top left on the keyboard, **not** the `'`

character to the right!) you can mark text to be code and not normal text. Code is formatted using a fixed-width font and keeps all the spaces where they were, leaving the internal structure of the code where it is.

You can either put a code block *inline* in a sequence of text, by using ```

to surround the code you are writing, so you could talk about eg commands like `sd`

while writing text about it.

Alternatively, if you start with a line with a sequence of three backticks `````

, this starts a separated code block. This block continues until you end it with another line with three backticks `````

.

Both the inline and the separated code blocks can be marked as `R`

code and by doing that let RStudio know you want it to **run** the code in the code block and insert the results when done knitting. You mark an inline code block by adding the letter `r`

after the backtick, producing something like

`This inserts a random number in a sequence of text `r rnorm(1)`.`

Similarly, you use the letter `r`

enclosed in *curly braces* at the beginning of a separated code block:

```
```{r}
plot(x,y)
abline(lm(y~x))
```
```

## Lists

You can create bulleted lists by, after two newlines, starting each line with a `*`

or a `-`

followed by a space, so that:

```
* This is a
* bulleted list
```

produces

- This is a
- bulleted list

and

```
- this is another
- bulleted list
```

creates

- this is another
- bulleted list

By adding spaces in front, you can make sublists:

```
* outer list
- inner list
- inner list
* outer list again
```

produces

- outer list
- inner list
- inner list
- outer list again

You can create numbered lists using numbers followed by `.`

instead of symbols. RMarkdown doesn’t care about which numbers you use, it will assume you want your list to go from 1 to however many items you list.

```
1. This
3. is
2. a list
```

creates

- This
- is
- a list

# R commands

**Important note:** Everything in `R`

cares about UPPER/lower case. Make sure you use the right one when you get errors.

`library`

: Load extra functionality.

`library(ggplot2)`

`install.packages`

: Ensure library packages are installed.

`install.packages("ggplot")`

`#`

: Commented code. You can add explaining text inside a code block or in a script file by using `#`

. All the text that follows a `#`

character is ignored by `R`

.

`$`

: Column access. To use a column in a dataset, instead of the entire dataset, you use `$`

to pull out the column:

`data$variable`

`=`

: Assigning names. By writing something like

`name = value`

you put a name to a thing (dataset, number, string, …). This saves it for further use, and lets you refer to this thing without recomputing it every time.

`upper.95.rule = mean(data$variable) + 2*sd(data$variable)`

By assigning a value to an existing variable in a dataset, you change the data in that variable. By assigning to a variable that is not contained in a dataset, you create a new variable and fill it with that data. In either of these cases, your new value needs to have as many values as there are observations in the dataset.

`c`

*combines* any elements together into a *vector*: a sequence of things of the same kind. `c(2,5)`

contains the numbers 2 and 5, `c("ggplot2", "knitr")`

contains the two strings `"ggplot2"`

and `"knitr"`

.

`:`

creates a sequence of values: `4:8`

is the same as `c(4,5,6,7,8)`

.

`data.frame`

creates a dataset. This takes parameters with parameter names the variable names you want, and parameter values the vectors of values for each variable. All these vectors have to be equal size. Example: `data = data.frame(V1=c(3,4,5,6), other.variable=4:8)`

.

`[`

and `]`

: data access. If you have a *vector*, something like `c(1,2,5,7)`

or `5:10`

, using `name.of.the.vector[5]`

returns the 5th element in the vector. If you have a dataset, you need to include a comma (`,`

) between the `[`

and the `]`

. Anything before the `,`

picks out the observations you want to see, anything after the `,`

picks out variables.

```
my.vector[5] # the 5th element
my.vector[c(1,3,6)] # the 1st, 3rd and 6th elements
my.vector[2:4] # the 2nd, 3rd and 4th elements
my.data[1:5,] # the first 5 observations of a dataset
my.data[,3] # the 3rd variable of a dataset
my.data[,c("V1","V4")] # the variables named V1 and V4 in a dataset
my.data[5:10, c("V3", 5)] # the 5th through 10th observations of the variable named V3 and the 5th variable in a dataset
my.data[my.data$V1 < 3,] # all the data such that the value of my.data$V1 is less than 3
my.data[my.data$CategoricalVariable == "value",] # all data where CategoricalVariable has the label "value"
```

`sample`

: randomly selecting from anything: can sample from a vector, from a single variable in a dataset or from the entire dataset. Takes as first argument the thing you want to sample from, as second the number of samples you want to see. If you sample as many observations as your data contains, this shuffles your data into random order.

`sample(data, 500) # draw 500 values from your data in random order`

`nrow`

counts the number of observations in a dataset. `dim`

gives you the numbers of observations and variables at the same time.

```
number.of.observations = nrow(data)
number.of.observations = dim(data)[1]
number.of.variables = dim(data)[2]
```

### Reading data

`read.csv`

, `read.delim`

, `read.table`

: Read data from a file. Use `csv`

if comma-separated, `delim`

if tab-separated, and otherwise use `read.table`

and give it an explicit separator.

```
my.data = read.csv("datafile.csv")
my.data = read.delim("datafile.txt")
my.data = read.table("datafile.dat", sep=";")
```

`read.csv`

and `read.delim`

assume your data file has a header: a first row with information about the data in it. If data starts directly with the first row in the file, you should add the parameter `header=FALSE`

. If you do this, `R`

will name your variables `V1, V2, …`

. Notice that these are uppercase `V`

, not lowercase `v`

.

```
my.data = read.csv("datafile.csv", header=FALSE)
my.data = read.delim("datafile.txt", header=FALSE)
my.data = read.table("datafile.dat", sep=";", header=FALSE)
```

If your dataset is **very** large, you can restrict to loading parts of it using the `nrows`

parameter:

`my.data = read.csv("datafile.csv", nrows=1500)`

### Numeric statistics

`mean`

, `median`

, `sd`

, `var`

, `quantile`

, `IQR`

: Compute numeric statistics from data. If you have missing data, these will return `NA`

. You can use the parameter `na.rm=TRUE`

to compute only on the data that is not missing. For the `quantile`

function you have to give which quantile you need. Use `0.5`

for median, `0.25, 0.75`

for the two quartiles, and whatever fraction is appropriate otherwise.

```
mean(data$variable)
median(data$variable, na.rm=TRUE)
quantile(data$variable, 0.33)
```

`cor`

: Compute correlation for two variables. These need to be equal length. You can handle missing data using the parameter `use="complete.obs"`

:

```
cor(data$V1, data$V2)
cor(data$V1, data$V3, use="complete.obs")
```

### Relationships between variables

`lm`

: Compute a linear regression for two (or more) variables. Takes a *formula* written with the `~`

symbol and names of variables, as well as the dataset itself. You will usually want to assign the result to give it a name so you can access details later on. You can access coefficients and residuals with their corresponding commands.

```
model = lm(V1 ~ V3, data)
coefficients(model)
residuals(model)
```

`table`

: Create a two-way (or more-way) table from variables. These need to be equal length. You can extract entries using indexing with `[,]`

, compute joint and conditional distribution with `prop.table`

and compute margin counts and distribution with `margin.table`

.

```
my.table = table(wildfire$GeneralDesc, wildfire$Fuel)
my.table[1,3] # the value from row 1, column 3
my.table["Lightning",] # all values from the row with the label "Lightning"
my.table[,"X"] # all values from the column with the label "X"
prop.table(my.table) # joint distribution
prop.table(my.table, 1) # conditional distributions conditioned on rows (1 : first variable)
prop.table(my.table, 2) # conditional distributions conditioned on columns
margin.table(my.table) # sum of all entries
margin.table(my.table, 1) # row sums
margin.table(my.table, 2) # column sums
margin.table(my.table, 1)/margin.table(my.table) # row margin distribution
margin.table(my.table, 2)/margin.table(my.table) # column margin distribution
```

### Creating categories

If your dataset does not *have* more than one categorical variable, or if you find yourself interested in comparing a categorical and a numeric variable with each other, `R`

provides the command `cut`

. In this example, the range of a numeric variable gets separated into 5 equally large intervals, and each observation labeled with the interval the value of `NumericVariable`

belongs to.

`data$NewCategoricalVariable = cut(data$NumericVariable, 5)`

### Random variables and distributions

All random distributions in `R`

have four different functions defined:

`r`

*distribution*`(n, ...)`

Generate`n`

random numbers.`d`

*distribution*`(x, ...)`

Compute probability / probability density at`x`

.`p`

*distribution*`(q, ...)`

Compute the probability of the random variable taking a value less than`q`

.`q`

*distribution*`(p, ...)`

Compute a value`q`

such that the probability of being less than`q`

is precisely`p`

.

The `...`

here stand for the parameters defining each probability distribution and is different for each distribution.

*Uniform*`runif`

,`dunif`

,`punif`

,`qunif`

Parameters`min`

and`max`

.*Binomial*`rbinom`

,`dbinom`

,`pbinom`

,`qbinom`

Parameters`prob`

and`size`

.*Negative binomial*`rnbinom`

,`dnbinom`

,`pnbinom`

,`qnbinom`

Parameters`prob`

and`size`

.*Multinomial*`rmultinom`

,`dmultinom`

Parameters`size`

(single integer) and`prob`

vector of probabilities. No`p`

or`q`

functions; for multivariate (vector valued; give more than one number as an outcome) distributions, these functions are less useful.*Poisson*`rpois`

,`dpois`

,`ppois`

,`qpois`

Parameter`lambda`

for the rate.*Exponential*`rexp`

,`dexp`

,`pexp`

,`qexp`

Parameter`rate`

for the rate.*Normal*`rnorm`

,`dnorm`

,`pnorm`

,`qnorm`

Parameters`mean`

and`sd`

.*Chi-square*`rchisq`

,`dchisq`

,`pchisq`

,`qchisq`

Parameter`df`

.*Student’s t*`rt`

,`dt`

,`pt`

,`qt`

Parameter`df`

.*F*`rf`

,`df`

,`pf`

,`qf`

Parameters`df1`

and`df2`

.

### Plots

The *quantile-quantile plot* (QQplot) is useful to match up distributions; if both datasets follow the same kind of distribution, the QQplot will approximate a straight line.

```
qqplot(x,y) # plots quantiles of x against quantiles of y
qqnorm(x) # plots quantiles of x against a normal distribution
qqline() # plots a best fit line for a recently created qqplot or qqnorm
```

*Scatter plots* through `plot(x,y)`

*Histograms* through `hist(x)`

. Also useful is `hist(x, probability=TRUE)`

to get a rescaled histogram that fits with plotting the `d`

*distribution* functions.

### ggplot2

The package `ggplot2`

produces good graphics for a lot of use cases. It uses a basic idea of a *graphics grammar*: you create a plot by first creating a plotting object associated with a dataset (has to be an actual dataset, produced by `data.frame`

or `read.csv`

or similar functions; cannot be a `table`

or vector. you can force things to be dataset using the function `as.data.frame`

.), and then adding features to the object.

As features are added, they need a mapping of aesthetic features to data variables. These are created with the `aes`

command inside any of the `ggplot2`

commands.

Thus, with a command like

`ggplot(data, aes(x=V1, y=V2)) + geom_point() + geom_smooth(method="lm")`

we first create the plot object, and give it a dataset `data`

, and associate the values of the variable `V1`

to `x`

-positions, the values of `V2`

to `y`

-positions. Then we add a *points* feature, telling the system to plot a point for each observation, by adding (with `+`

) the `geom_point`

function. `geom_point`

pays attention to (among other things) the `x`

and `y`

mappings and use these to position each point. Next we add `geom_smooth`

, which produces a smooth curve approximating the data. This function can produce its curve in many different ways, the one we have been using is chosen using `method="lm"`

, which fits and plots a linear regression from the data.

(Some) mappings can be put outside of the `aes`

call to influence the plot without creating new information in a plot legend.

Almost everything takes `alpha`

(transparency), `colour`

and `fill`

(colors; might not both be used, depends on the shape of the point) and `size`

.

Some useful features are, with some of the aesthetic mappings they care about:

`geom_point`

Requires`x`

and`y`

, also uses`shape`

,`size`

.`geom_jitter`

Used just like`geom_point`

, but moves all datapoints around a little bit, so that data with many repeated values are easier to visualize. Takes parameters`width`

and`height`

to influence how far points are moved around. Values over`0.5`

will make categorical data hard to distinguish.`geom_histogram`

and`geom_freqpoly`

. Requires`x`

. Takes (outside of`aes`

) the optional parameters`bins`

(number of bars),`binwidth`

(width of the bars),`center`

or`boundary`

(shifts the bars by specifying one center or one boundary for a bar),`na.rm`

(remove warning printout for missing values). Creates the new variables`..density..`

and`..count..`

that can be used in later features. Set`y`

to one of these to modify whether you generate a density histogram (approximating the`d`

*distribution*functions) or count histogram.`geom_qq`

QQplots, just like`qqplot`

. Takes an optional distribution function (`q`

*distribution*) together with its parameters as additional parameters. Data needs to be given in the aesthetic mapping`sample`

. Example:

`ggplot(data.frame(sample=rnorm(100)), aes(sample=sample)) + geom_qq(distribution=qnorm, dparams=c(mean=3, sd=5))`

`geom_boxplot`

Requires`x`

and`y`

.`x`

should be categorical (splits the data on different boxplots), and`y`

should be numeric (the data for each boxplot). Can be combined with`geom_point`

or`geom_jitter`

to good results. Example:

`ggplot(data.frame(xs=c(rnorm(100,mean=-1), rnorm(100,mean=+1), ls=c(rep("a",100), rep("b", 100), aes(x=ls, y=xs)) + geom_boxplot(notch=TRUE) + geom_jitter(width=0.1, height=0, alpha=0.3)`

`stat_function`

Takes parameters`fun`

and`args`

with a function, and its additional parameters. Typical example we have used would be as`+stat_function(fun=dunif, args=c(min=2, max=5))`

. Example:

`ggplot(data.frame(x=runif(1000,2,5)), aes(x=x)) + geom_histogram(aes(y=..density..), boundary=2, binwidth = 0.1) + stat_function(fun=dunif, args=c(min=2,max=5))`

# Plotting cookbook

Here are some suggestions for first tries to visualize your data, broken up into data types and combinations.

### Numeric `V1`

**Histogram** for the distribution.

`ggplot(data, aes(x=V1)) + geom_histogram()`

**Frequency curve** for the distribution

`ggplot(data, aes(x=V1)) + geom_freqpoly()`

**Boxplot** for a summary (force to one group using an explicit label)

`ggplot(data, aes(x="data", y=V1)) + geom_boxplot()`

### Categorical `V1`

**Bar chart** for the distribution (`stat_count`

has a bar-chart as standard display).

`ggplot(data, aes(x=V1)) + stat_count()`

**Stacked bar chart** for the distribution. (like with the single variable boxplot, this requires us to force the points to one position)

`ggplot(data, aes(x="data", fill=V1)) + stat_count()`

### Numeric `V1`

and numeric `V2`

**Scatterplot**

`ggplot(data, aes(x=V1, y=V2)) + geom_point()`

Illustrate **numeric V3** as well using size:

`ggplot(data, aes(x=V1, y=V2, size=V3)) + geom_point()`

Illustrate **categorical V3** as well using symbols:

`ggplot(data, aes(x=V1, y=V2, shape=V3)) + geom_point()`

### Numeric `V1`

and categorical `V2`

**Stacked histograms** where the counts for each value of `V2`

are piled on top of the other values.

`ggplot(data, aes(x=V1, fill=V2)) + geom_histogram()`

**Dodged histograms** where for each range of values for `V1`

, separate bars are drawn side by side for each value of `V2`

.

`ggplot(data, aes(x=V1, fill=V2)) + geom_histogram(position='dodge')`

**Multiple frequency curves**

`ggplot(data, aes(x=V1, color=V2)) + geom_freqpoly()`

**Boxplots**

`ggplot(data, aes(x=V2, y=V1)) + geom_boxplot()`

### Categorical `V1`

and categorical `V2`

**2-way table plot** where each point is scaled corresponding to that entry in the 2-way table of the data. When a table is converted to a data-frame, it has variables `Var1`

and `Var2`

containing the labels of `V1`

and `V2`

respectively; and `Freq`

containing the joint count of each combination. We use `shape=15`

to make squares. The plot benefits from also using `color=Freq`

.

`ggplot(as.data.frame(table(data$V1, data$V2)), aes(x=Var1, y=Var2, size=Freq)) + geom_point(shape=15)`

**Stacked bar charts** just like with the stacked histograms, but using `stat_count`

instead of `geom_histogram`

.

`ggplot(data, aes(x=V2, fill=V1)) + stat_count(position='stack')`

**Dodged bar charts** just like with the dodged histograms, but using `stat_count`

instead of `geom_histogram`

.

`ggplot(data, aes(x=V2, fill=V1)) + stat_count(position='dodge')`

**Stretched bar charts** optimized to show the **conditional distribution** of `V1`

as `V2`

is held fixed. Each column in a stacked bar chart is rescaled to have equal height (corresponding to 100%)

`ggplot(data, aes(x=V2, fill=V1)) + stat_count(position='fill')`