Lecture 4

MVJ

12 September, 2018

Density curves

Histogram: count observations in bins

Density histogram: count proportion out of all observations in bins

Frequency curve: Line graph of a histogram

Density curve: Line graph of a density histogram

Density curves

gf_histogram(~x, data=pts) %>% gf_freqpoly(~x, data=pts, color="blue")

gf_histogram(..density.. ~ x, data=pts) %>% 
  gf_freqpoly(color="blue")

Effect of bin size / bin count

Density curves

We can distinguish between

empirical density curves – given from data
theoretical density curves – describing an ideal shape

Empirical and theoretical densities

Density curve properties

Density curves are line graphs that…

…depict an always positive quantity
…bound an area of exactly 1
…depict a distribution: the area under the curve for a range of values is the proportion of observation falling in that range

The reason why is because a density curve illustrates a probability distribution. More about these later in the course.

Summary statistics of densities

Mean: the balance point or center of gravity for the shape

Median: the halfway point that splits the area under the curve in equal parts

Normal distribution

Our first probability distribution is the normal distribution.

Normal Distribution

The Bell curve
Commonly occurring distribution

Repeated measurements of the same quantity often follow a normal distribution.

Means or sums of sufficiently large data sets are close to normal. (Central limit theorem)

But not always

Income tends to be skewed.

Many phenomena follow power laws - earthquakes, city sizes, …

Normal Distribution

Is completely determined by its (population) mean \(\mu\) and its (population) standard deviation \(\sigma\). We write \(\mathcal{N}(\mu,\sigma)\) for the normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

The density curve is given by the function

\[ d_{\mathcal N(\mu,\sigma)}(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{\frac{-1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \]

68% of the distribution is within \(\mu\pm\sigma\)
95% of the distribution is within \(\mu\pm2\sigma\)
99.7% of the distribution is within \(\mu\pm3\sigma\)

Standardization

Remember the effect of \(x\to a+bx\) on means and standard deviations.

If \(x\sim\mathcal N(\mu,\sigma)\), then \(x-\mu\) will have mean \(0\).

If \(x-\mu\sim\mathcal N(0,\sigma)\), then \(\frac{x-\mu}{\sigma}\) will have standard deviation \(1\).

We call \[ z = \frac{x-\mu}{\sigma} \] the standardized value of \(x\) or the z-score of x.

If \(x\sim\mathcal N(\mu,\sigma)\) then \(z\sim\mathcal N(0,1)\).

We call \(\mathcal N(0,1)\) the standard normal distribution.

In R

rnorm(n) : Generate n random numbers from the distribution
dnorm(x) : Evaluate the density at x
pnorm(x) : The area under the curve to the left of x
qnorm(p) : The x value such that the area under the curve would be p
gf_dist("norm") : Plot the theoretical density curve for the normal distribution

All of these take parameters mean and sd to determine \(\mu\) and \(\sigma\). mean has default value 0 and sd has default value 1; not setting either gives the standard normal distribution.

scale(x) : Compute z-scores from data.

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600.

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600. In percent:

c(lo=pnorm(400, mean=1000, sd=200) * 100,
  hi=100-pnorm(1600, mean=1000, sd=200) * 100) %>% kable

lo	0.1349898
hi	0.1349898

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question How much influence does the SAT capping at the extreme ends have?

Let’s calculate the “lost” area under the curve to see the proportion we can expect to have scores below 400 or above 1600. Using a graph:

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this?

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this? In percent:

pnorm(800, mean=1000, sd=200, lower.tail=FALSE) * 100

## [1] 84.13447

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA requires students to have a 3.0 GPA and SAT at least 800 to compete in their first year. What proportion of students with high enough GPA is this? With a graph:

gf_dist("norm", mean=1000, sd=200, geom="area", fill=~(x > 800))

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this?

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this? With a graph:

gf_dist("norm", mean=1000, sd=200, geom="area", fill=~cut(x,c(400,620,800,1600))) %>%
  gf_labs(fill="SAT range")

Calculate areas under curves

The combined SAT is roughly normally distributed, and raw scores are rescaled to have mean at approximately 1000, standard deviation at approximately 200 and then capped to the range 400 - 1600.

Question The NCAA allows students with SAT at least 620 to practice and receive an athletic scholarship, but not compete. What proportion is this? With percent:

100*(pnorm(800, mean=1000, sd=200) - pnorm(620, mean=1000, sd=200))

## [1] 12.99387

Evaluting goodness of fit

Whether or not a particular data set fits to a normal distribution is going to be an important question.

To evaluate the goodness of fit for a normal distribution to a data set, several options are available:

Plot frequency curve and closest fitting normal density curve
Plot a Quantile-Quantile plot (or QQ-plot)
Use a formal statistical test for goodness of fit

The first of these is often too hard to read accurately; the last is too ambitious for our needs.

Frequency and density curve

Here are two datasets plotted against a normal density curve. Which is normal, which is not? Why?

Quantile plots

The Quantile-Quantile or QQ-plot is a scatter plot where each data point \(x_i\) is plotted against the theoretical value for its quantile in a theoretical distribution.

Concretely: the QQ-plot is done by

Sorting the data.
Calculating qnorm(seq(0,1,nrow(data))).
Plotting these two sequences against each other.

As a result, we get a plot that looks close to a straight line if the distribution fits, and bends if the distribution does not fit.

QQ-plots are often plotted together with a straight line that goes through the first and third quartiles, as a reference for what good fit looks like.

In R:

gf_qq(~var, data=dataset) %>% gf_qqline()

Quantile plots

ggmatrix(list(
  gf_qq(~pts.1) %>% gf_qqline(),
  gf_qq(~pts.2) %>% gf_qqline()
), nrow=1, ncol=2)

Quantile plots

Distribution has…	QQ-plot
…normality
…heavy tails		…left skew
…light tails		…right skew