For this lab you should submit, on Blackboard, your .Rmd and .docx-files at the end of the lab hour.

Remember to describe and discuss all the tasks and not just show the results in isolation.

R implements a number of function for each random distribution. These have the general pattern of starting with one of r, d, p or q, followed by an abbreviation of the name of the distribution:

Function name Effect Input Output
rdist Generate random numbers Number of random numbers Vector of random numbers
ddist Density function (PDF) \(x\) to evaluate at \(\mathbb{P}(X=x)\)
pdist Distribution function (CDF) \(x\) to evaluate at \(\mathbb{P}(X\leq x)\)
qdist Quantile function (inverse CDF) \(q\) the quantile to find \(x\) such that pdist(x) = q

The most common distributions we will encounter are:

Distribution name Distribution abbreviation Parameters Where it shows up
Normal norm mean and sd Limit distribution for means and sums of identical random variables; approximates many natural distributions
Student's T t df Used for testing for means.
Uniform unif min, max Continuous uniform distribution.
Discrete Uniform dunif a, b Discrete uniform distribution
Binomial binom n, p Number of successes in \(n\) repeated trials with success probability \(p\)
Poisson pois lambda Number of observations per time unit, arriving at an average rate of lambda per time unit
Exponential exp rate Time to next observation in Poisson distributed observations
Data / Empirical data dataset data; formula for what variable in the dataset to use (including splitting by and predicted by variables)

Plotting distributions

ggformula makes it easy to plot distributions - including ranges within a distribution.

The command gf_dist takes a name of a distribution (such as norm, unif, pois, ...), a list of parameters for the distribution, a kind of plot (density curve, CDF curve, quantile plot, histogram). By adding fill=~ some expression in x, together with geom="area" you can fill in regions under the curve.

Task Try out the following example code to see the command in action:

gf_dist("norm", geom="area", fill=~(abs(x) < 0.5), params=list(mean=0.5, sd=1.5))

Task Plot the uniform distribution between 1 and 5 as a density curve.

Advanced task Highlight the region between 2 and 3.

Random numbers

The discrete uniform distribution is helpful to generate dice rolls or coin flips etc.

Task Generate 1000 rolls of a 6-sided die. Plot a bar chart of the result.

Task Generate another 1000 rolls of a 6-sided die. Sum the two vectors of random numbers, and plot a bar chart of the sum.

Task Pick a mean and a standard deviation. Generate 1000 normally distributed numbers. Plot a density histogram (with for instance gf_dhistogram) and in the same plot also plot the normal density with the same mean and standard deviation.

The Bootstrap

The data distribution can be used for an advanced statistical technique for gaining detailed information about sample statistics. The sample mean, for instance, approximates the true mean better and better with larger and larger samples. The actual distribution of the sample mean itself as a random variable can beapproximated by taking repeated samples from the data itself.

This technique, called the bootstrap is a powerful tool that can give a lot more information than the hypothesis tests we will be learning.

This is easier to do using the mosaic command do. This command works by do(n) [for the number of times you want to repeat the thing you are doing] multiplied by the thing you actuall are doing.

One example, using the iris dataset and with 30 samples each time, can look like this:

bootstrap.means = do(500)*mean(rdata(~Sepal.Length, 30, data=iris))
gf_dhistogram(~mean, data=bootstrap.means)

A more precise estimate could be created using more samples.

Task Repeat the computation with 100 samples each time. Plot a histogram or frequency curve of both. What's the difference when increasing sample size?

Task Use the bootstrap to show the distribution of the sample standard deviation for Sepal.Length.