There are 2 big theorems from probability that are used in statistics:
The law of large numbers: Assume a population with mean \(\mu\). Let \(x_1\), \(x_2\), … be a random sample from this population. Then the sample mean, \(\bar{x}\), eventually “approaches” the mean \(\mu\).
To visualize this (in the 4th project), we took \(n\) to be ever increasing and plotted the sample mean for different \(n\) and saw that despite the randomness, this eventually came to hug the line specified by \(y=\mu\).
The central limit theorem: Let the population have mean \(\mu\) and finite standard deviation \(\sigma\). Then when n is large, the sampling distribution of \(\bar{x}\) is approximately normal with mean \(\mu\) and standard deviation \(\sigma/\sqrt{n}\).
We can visualize this using the following code snippet as a template. In the following the population is specified through dexp
and random samples come from the function rexp
:
M = 1000
n = 10
curve(dexp, -1, 4, ylim=1.1 * c(0, dnorm(0, sd=1/sqrt(n)))) # draw population
for (i in 1:5) { # show a few samples
xs = rexp(n)
points(xs, (i-1)/10 + 0*xs, col=rainbow(i)) # draw a sample of size n
abline(v = mean(xs), col=rainbow(i)) # mark the sample mean with a vertical line
}
xbars = replicate(M, {
xs = rexp(n)
y = mean(xs)
})
lines(density(xbars), lwd=5, col="blue")
We want to understand this diagram and how it illustrates the central limit theorem.
In the example above, we have M=1000
. Above, this means that a random sample means is taken 1000 times (the xbars
) and visualized through a density
plot (the last line).
The moral here is a large sample can inform on of the shape of the underlying population the sample is from.
Let’s see how large:
QUESTION: Take \(M=3\) and compare the two graphs produced below (not shown):
M = 3
curve(dexp, -1, 4)
xs = rexp(M)
lines(density(xs))
Does the density plot look likes the theoretical distribution?
QUESTION: Repeat with \(M=20\)
QUESTION: Repeat with \(M=1000\).
QUESTION: Refer to the main figure. What part(s) of the figure represent this statement?
QUESTION: Change to
n=2
. (Then the sampling distribution will be non bell-shaped.) Is this still the case that the two means are equal?
QUESTION: Take
n=10
andn=160
and make the main figure for each. Estimate the standard deviation of the sampling distribution (eyeballing the difference between the inflection points is 2 standard deviations). If should be that the standard deviation forn=160
is smaller. Exactly how much smaller is it (as a ratio)?
QUESTION: Estimate \(\sigma\) by taking
sd(rexp(M))
. (Round to the nearby integer.) What is the ratio between this value and that of your estimate for the standard deviation of ${x}_{160}?
The dlnorm
and rlnorm
functions can replace dexp
and rexp
in the main example and a different population will be used. This population is more skewed than the exponential, so plot the population over \([0,7]\) with curve(dlnorm, -1, 7, ylim=1.1*c(0, dnorm(0, sd=1/sqrt(n))))
.
QUESTION: Make the main figure using
n=10
. Is the sampling distribution normal? (You can eyeball, or investigate viaqqnorm(xbars)
.
QUESTION: Make the main figure using
n=100
. Is the sampling distribution normal? (You can eyeball, or investigate viaqqnorm(xbars)
.