MVJ
12 September, 2018
Histogram: count observations in bins
Density histogram: count proportion out of all observations in bins
Frequency curve: Line graph of a histogram
Density curve: Line graph of a density histogram
gf_histogram(~x, data=pts) %>% gf_freqpoly(~x, data=pts, color="blue")
gf_histogram(..density.. ~ x, data=pts) %>%
gf_freqpoly(color="blue")
We can distinguish between
Density curves are line graphs that…
The reason why is because a density curve illustrates a probability distribution. More about these later in the course.
Mean: the balance point or center of gravity for the shape
Median: the halfway point that splits the area under the curve in equal parts
Our first probability distribution is the normal distribution.
Repeated measurements of the same quantity often follow a normal distribution.
Means or sums of sufficiently large data sets are close to normal. (Central limit theorem)
Income tends to be skewed.
Many phenomena follow power laws - earthquakes, city sizes, …
Is completely determined by its (population) mean \(\mu\) and its (population) standard deviation \(\sigma\). We write \(\mathcal{N}(\mu,\sigma)\) for the normal distribution with mean \(\mu\) and standard deviation \(\sigma\).
The density curve is given by the function
\[ d_{\mathcal N(\mu,\sigma)}(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]
Remember the effect of \(x\to a+bx\) on means and standard deviations.
If \(x\sim\mathcal N(\mu,\sigma)\), then \(x-\mu\) will have mean \(0\).
If \(x-\mu\sim\mathcal N(0,\sigma)\), then \(\frac{x-\mu}{\sigma}\) will have standard deviation \(1\).
We call \[ z = \frac{x-\mu}{\sigma} \] the standardized value of \(x\) or the z-score of x.
If \(x\sim\mathcal N(\mu,\sigma)\) then \(z\sim\mathcal N(0,1)\).
We call \(\mathcal N(0,1)\) the standard normal distribution.
rnorm(n)
: Generate n
random numbers from the distributiondnorm(x)
: Evaluate the density at x
pnorm(x)
: The area under the curve to the left of x
qnorm(p)
: The x
value such that the area under the curve would be p
gf_dist("norm")
: Plot the theoretical density curve for the normal distributionAll of these take parameters mean
and sd
to determine \(\mu\) and \(\sigma\). mean
has default value 0 and sd
has default value 1; not setting either gives the standard normal distribution.
scale(x)
: Compute z-scores from data.The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question How much influence does the SAT capping at the extreme ends have?
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question How much influence does the SAT capping at the extreme ends have?
Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600.
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question How much influence does the SAT capping at the extreme ends have?
Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600. In percent:
c(lo=pnorm(400, mean=1000, sd=200) * 100,
hi=100-pnorm(1600, mean=1000, sd=200) * 100) %>% kable
lo | 0.1349898 |
hi | 0.1349898 |
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question How much influence does the SAT capping at the extreme ends have?
Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600. Using a graph:
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this?
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this? In percent:
pnorm(800, mean=1000, sd=200, lower.tail=FALSE) * 100
## [1] 84.13447
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this? With a graph:
gf_dist("norm", mean=1000, sd=200, geom="area", fill=~(x > 800))
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this?
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this? With a graph:
gf_dist("norm", mean=1000, sd=200, geom="area", fill=~cut(x,c(400,620,800,1600))) %>%
gf_labs(fill="SAT range")
The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.
Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this? With percent:
100*(pnorm(800, mean=1000, sd=200) - pnorm(620, mean=1000, sd=200))
## [1] 12.99387
Whether or not a particular data set fits to a normal distribution is going to be an important question.
To evaluate the goodness of fit for a normal distribution to a data set, several options are available:
The first of these is often too hard to read accurately; the last is too ambitious for our needs.
Here are two datasets plotted against a normal density curve. Which is normal, which is not? Why?
The Quantile-Quantile or QQ-plot is a scatter plot where each data point \(x_i\) is plotted against the theoretical value for its quantile in a theoretical distribution.
Concretely: the QQ-plot is done by
qnorm(seq(0,1,nrow(data)))
.As a result, we get a plot that looks close to a straight line if the distribution fits, and bends if the distribution does not fit.
QQ-plots are often plotted together with a straight line that goes through the first and third quartiles, as a reference for what good fit looks like.
In R:
gf_qq(~var, data=dataset) %>% gf_qqline()
ggmatrix(list(
gf_qq(~pts.1) %>% gf_qqline(),
gf_qq(~pts.2) %>% gf_qqline()
), nrow=1, ncol=2)
Distribution has… | QQ-plot | ||
---|---|---|---|
…normality |
|
||
…heavy tails |
|
…left skew |
|
…light tails |
|
…right skew |
|