Lecture 4

MVJ

12 September, 2018

Density curves

Histogram: count observations in bins

Density histogram: count proportion out of all observations in bins

Frequency curve: Line graph of a histogram

Density curve: Line graph of a density histogram

Density curves

gf_histogram(~x, data=pts) %>% gf_freqpoly(~x, data=pts, color="blue")

gf_histogram(..density.. ~ x, data=pts) %>% 
  gf_freqpoly(color="blue")

Effect of bin size / bin count

Density curves

We can distinguish between

Empirical and theoretical densities

Density curve properties

Density curves are line graphs that…

The reason why is because a density curve illustrates a probability distribution. More about these later in the course.

Summary statistics of densities

Mean: the balance point or center of gravity for the shape

Median: the halfway point that splits the area under the curve in equal parts

Normal distribution

Our first probability distribution is the normal distribution.

Normal Distribution

Repeated measurements of the same quantity often follow a normal distribution.

Means or sums of sufficiently large data sets are close to normal. (Central limit theorem)

Income tends to be skewed.

Many phenomena follow power laws - earthquakes, city sizes, …

Normal Distribution

Is completely determined by its (population) mean \(\mu\) and its (population) standard deviation \(\sigma\). We write \(\mathcal{N}(\mu,\sigma)\) for the normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

The density curve is given by the function

\[ d_{\mathcal N(\mu,\sigma)}(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

Standardization

Remember the effect of \(x\to a+bx\) on means and standard deviations.

If \(x\sim\mathcal N(\mu,\sigma)\), then \(x-\mu\) will have mean \(0\).

If \(x-\mu\sim\mathcal N(0,\sigma)\), then \(\frac{x-\mu}{\sigma}\) will have standard deviation \(1\).

We call \[ z = \frac{x-\mu}{\sigma} \] the standardized value of \(x\) or the z-score of x.

If \(x\sim\mathcal N(\mu,\sigma)\) then \(z\sim\mathcal N(0,1)\).

We call \(\mathcal N(0,1)\) the standard normal distribution.

In R

All of these take parameters mean and sd to determine \(\mu\) and \(\sigma\). mean has default value 0 and sd has default value 1; not setting either gives the standard normal distribution.

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600.

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600. In percent:

c(lo=pnorm(400, mean=1000, sd=200) * 100,
  hi=100-pnorm(1600, mean=1000, sd=200) * 100) %>% kable
lo 0.1349898
hi 0.1349898

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600. Using a graph:

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this?

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this? In percent:

pnorm(800, mean=1000, sd=200, lower.tail=FALSE) * 100
## [1] 84.13447

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this? With a graph:

gf_dist("norm", mean=1000, sd=200, geom="area", fill=~(x > 800))

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this?

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this? With a graph:

gf_dist("norm", mean=1000, sd=200, geom="area", fill=~cut(x,c(400,620,800,1600))) %>%
  gf_labs(fill="SAT range")

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this? With percent:

100*(pnorm(800, mean=1000, sd=200) - pnorm(620, mean=1000, sd=200))
## [1] 12.99387

Evaluting goodness of fit

Whether or not a particular data set fits to a normal distribution is going to be an important question.

To evaluate the goodness of fit for a normal distribution to a data set, several options are available:

The first of these is often too hard to read accurately; the last is too ambitious for our needs.

Frequency and density curve

Here are two datasets plotted against a normal density curve. Which is normal, which is not? Why?

Quantile plots

The Quantile-Quantile or QQ-plot is a scatter plot where each data point \(x_i\) is plotted against the theoretical value for its quantile in a theoretical distribution.

Concretely: the QQ-plot is done by

  1. Sorting the data.
  2. Calculating qnorm(seq(0,1,nrow(data))).
  3. Plotting these two sequences against each other.

As a result, we get a plot that looks close to a straight line if the distribution fits, and bends if the distribution does not fit.

QQ-plots are often plotted together with a straight line that goes through the first and third quartiles, as a reference for what good fit looks like.

In R:

gf_qq(~var, data=dataset) %>% gf_qqline()

Quantile plots

ggmatrix(list(
  gf_qq(~pts.1) %>% gf_qqline(),
  gf_qq(~pts.2) %>% gf_qqline()
), nrow=1, ncol=2)

Quantile plots

Distribution has… QQ-plot
…normality
…heavy tails …left skew
…light tails …right skew