For this lab you should submit, on Blackboard, your .Rmd
and .docx
-files at the end of the lab hour.
Remember to describe and discuss all the tasks and not just show the results in isolation.
R
implements a number of function for each random distribution.
These have the general pattern of starting with one of r
, d
, p
or q
, followed by an abbreviation of the name of the distribution:
Function name | Effect | Input | Output |
---|---|---|---|
r dist |
Generate random numbers | Number of random numbers | Vector of random numbers |
d dist |
Density function (PDF) | \(x\) to evaluate at | \(\mathbb{P}(X=x)\) |
p dist |
Distribution function (CDF) | \(x\) to evaluate at | \(\mathbb{P}(X\leq x)\) |
q dist |
Quantile function (inverse CDF) | \(q\) the quantile to find | \(x\) such that p dist(x) = q |
The most common distributions we will encounter are:
Distribution name | Distribution abbreviation | Parameters | Where it shows up |
---|---|---|---|
Normal | norm |
mean and sd |
Limit distribution for means and sums of identical random variables; approximates many natural distributions |
Student's T | t |
df |
Used for testing for means. |
Uniform | unif |
min , max |
Continuous uniform distribution. |
Discrete Uniform | dunif |
a , b |
Discrete uniform distribution |
Binomial | binom |
n , p |
Number of successes in \(n\) repeated trials with success probability \(p\) |
Poisson | pois |
lambda |
Number of observations per time unit, arriving at an average rate of lambda per time unit |
Exponential | exp |
rate |
Time to next observation in Poisson distributed observations |
Data / Empirical | data |
dataset data ; formula for what variable in the dataset to use (including splitting by and predicted by variables) |
Plotting distributions
ggformula
makes it easy to plot distributions - including ranges within a distribution.
The command gf_dist
takes a name of a distribution (such as norm
, unif
, pois
, ...), a list of parameters for the distribution, a kind of plot (density curve, CDF curve, quantile plot, histogram).
By adding fill=~
some expression in x
, together with geom="area"
you can fill in regions under the curve.
Task Try out the following example code to see the command in action:
gf_dist("norm", geom="area", fill=~(abs(x) < 0.5), params=list(mean=0.5, sd=1.5))
Task Plot the uniform distribution between 1 and 5 as a density curve.
Advanced task Highlight the region between 2 and 3.
Random numbers
The discrete uniform distribution is helpful to generate dice rolls or coin flips etc.
Task Generate 1000 rolls of a 6-sided die. Plot a bar chart of the result.
Task Generate another 1000 rolls of a 6-sided die. Sum the two vectors of random numbers, and plot a bar chart of the sum.
Task Pick a mean and a standard deviation. Generate 1000 normally distributed numbers. Plot a density histogram (with for instance gf_dhistogram
) and in the same plot also plot the normal density with the same mean and standard deviation.
The Bootstrap
The data distribution can be used for an advanced statistical technique for gaining detailed information about sample statistics. The sample mean, for instance, approximates the true mean better and better with larger and larger samples. The actual distribution of the sample mean itself as a random variable can beapproximated by taking repeated samples from the data itself.
This technique, called the bootstrap is a powerful tool that can give a lot more information than the hypothesis tests we will be learning.
This is easier to do using the mosaic
command do
.
This command works by do(n)
[for the number of times you want to repeat the thing you are doing] multiplied by the thing you actuall are doing.
One example, using the iris
dataset and with 30 samples each time, can look like this:
bootstrap.means = do(500)*mean(rdata(~Sepal.Length, 30, data=iris))
gf_dhistogram(~mean, data=bootstrap.means)
A more precise estimate could be created using more samples.
Task Repeat the computation with 100 samples each time. Plot a histogram or frequency curve of both. What's the difference when increasing sample size?
Task Use the bootstrap to show the distribution of the sample standard deviation for Sepal.Length
.