Lecture 15

MVJ

12April, 2018

The essence of parametric statistics

In parametric statistics, we use statistics calculated from samples to understand parameters that specify the distribution of a population.

The most common distributions we will be interested in are the normal distribution for numeric data, and the binomial distribution for categorical data.

The essence of parametric statistics

Definition

A parameter is a number that describes some characteristic of a population. Usually we model populations with probability distributions, and the distributions are determined by a collection of parameters.

A statistic is a number that can be computed directly from a sample. We will be using statistics to estimate parameters.

Parameters are usually not available: we usually cannot measure the entire population.

Estimation and Inference

The process of statistical inference uses information from a sample to draw conclusions about a population. Statistical estimation is when the conclusions are proposed values for a specific parameter.

Different samples yield different statistics.

Here are two collections of numbers drawn from the same binomial distribution. Rightmost is the sample mean of each:

x 2 2 0 2 1 1 1 0 1 1 1.1
y 1 1 2 0 1 2 3 0 1 1 1.2

Since the sample mean yields uncertain individual outcomes, but with a **regular distribution in large numbers of repetitions*, the sample mean is a random phenomenon.

Since the sample mean is a number, it is a random value.

This holds true for all sample statistics.

Sample statistics

Since sample statistics are random values, we can study the distribution of that random variable. This is called the sampling distribution.

One example is the bootstrap from Lab 5; you calculated sampling distributions of means and standard deviations by repeated sampling from a data set.

The Central Limit Theorem allows us to draw conclusions about population parameters from the sampling distribution of specific sample statistics

Sampling distribution

In this example, I sample 25 values from a normal distribution with \(\mu=1\) and \(\sigma=0.5\). I repeat this sampling 1000 times.

Sampling distribution

In this example, I sample 25 values from a normal distribution with \(\mu=1\) and \(\sigma=0.5\). I repeat this sampling 1000 times.

Sampling distribution

In this example, I sample 25 values from a normal distribution with \(\mu=1\) and \(\sigma=0.5\). I repeat this sampling 1000 times.

Reading the sampling distribution

The sampling distribution informs both an estimate of the population parameter, and demonstrates the sampling variability inherent in the statistic method we are using.

Simulation and bootstrapping

In practice, repeated sampling is often difficult and expensive. Instead, we can

  1. Use theoretical results and calculus to estimate the sampling distribution from a sample
  2. Simulate what the sampling distribution would look like using a theoretical model for the simulation
  3. Use the bootstrap to create empirical sampling distributions from the data

Bias & Variability

Bias measures whether the expected value of the statistic is the true value of the parameter.

An unbiased statistic has an expected value of the sampling distribution equal to the true value of the parameter.

Variability measures the spread of the sampling distribution. It is usually determined by sample size, with smaller spreads from larger samples.

Bias

Consider variance. The average squared deviation from the average would give us the left. We usually use the formula on the right. \[ \sigma^2 = \frac{1}{N}\sum(x_i-\overline x)^2 \qquad \sigma^2 = \frac{1}{N-1}\sum(x_i-\overline x)^2 \]

How do we reduce bias & variability?

Bias is reduced by using random sampling: by randomizing over- and under-estimates tend to balance out.

Variability is reduced by using a larger sample.

Note: as long as the population is at least 20x larger than the sample, variability does not depend on population size. Variability only depends on sample size (and properties of the true distribution)

Population Distribution

No matter what the actual population, we use population distribution to refer to the distribution of the random variable of whatever property we are interested in measuring.

The population might not necessarily concretely exist:

Course grades in MTH214 has as its population a hypothetical collection of all students that will take the course in the future.

Sample distribution of the sample mean

The sample mean \[ \overline x = \frac{\sum x_i}{N} \] is an unbiased estimate of the population mean \(\mu\).

The Central Limit Theorem tells us that for a population distribution with mean \(\mu\) and standard deviation \(\sigma\), \[ \overline x \sim \mathcal N\left(\mu, \frac{\sigma}{\sqrt{N}}\right) \]

The power of the Central Limit Theorem is that the distribution is approximately normal even if the population is not.

Note: averages vary less than observations.

Sample distribution of the sample mean

\[ \overline x \sim \mathcal N\left(\mu, \frac{\sigma}{\sqrt{N}}\right) \]

Sample distribution of the sample mean

\[ \overline x \sim \mathcal N\left(\mu, \frac{\sigma}{\sqrt{N}}\right) \]

Sampling distribution of the mean: example

A task has a distribution on work time distributed like this:

\[ \mu = 1 \qquad \sigma = 1 \]

Sampling distribution of the mean: example

A task has a distribution on work time distributed like this:

\[ \mu = 1 \qquad \sigma = 1 \]

Your manager allocates 1.1 hours per task to perform 70 tasks in two weeks: 80 hours of work time. As long as the mean time for these 70 task is less than $80/70=1.143 you will be able to do it without overtime.

Sampling distribution of the mean: example

A task has a distribution on work time distributed like this:

\[ \mu = 1 \qquad \sigma = 1 \]

Your manager allocates 1.1 hours per task to perform 70 tasks in two weeks: 80 hours of work time. As long as the mean time for these 70 task is less than \(80/70=1.143\) you will be able to do it without overtime.

The Central Limit Theorem tells us that \(\overline x\sim\mathcal N(1, 1/\sqrt{70})\approx\mathcal N(1, 0.12)\)

The probability of exceeding the time can be calculated using pnorm:

pnorm(80/70, 1, 1/sqrt(70), lower.tail=FALSE)*100
## [1] 11.59989

There is an 11% chance that the time isn’t enough.

Sampling distributions for counts and proportions

For discrete data, we distinguish between two main ways of generating counts:

A binomial setting is when we perform several independent trials of the same process and record the number of times a particular outcome occurs.

A Poisson setting is when we consider the number of successes that occur in a fixed unit of measure. (time, region of space, …)

The difference is that for the binomial case, you specify how many trials you check; for Poisson you specify how long you watch.

Binomial counts

To check that the binomial setting applies, we can use a mnemonic device: BINS

Binomial counts

To check that the binomial setting applies, we can use a mnemonic device: BINS

Is this a binomial setting? If not, what fails?:

I count whether my students are Freshmen, Sophomores, Juniors or Seniors.

Binomial counts

To check that the binomial setting applies, we can use a mnemonic device: BINS

Is this a binomial setting? If not, what fails?:

I check whether the body temperature is higher than 100ºF each hour for a day on flu patients given aspirin.

Binomial counts

To check that the binomial setting applies, we can use a mnemonic device: BINS

Is this a binomial setting? If not, what fails?:

I count how many times I win before my money runs out at a casino visit.

Binomial counts

To check that the binomial setting applies, we can use a mnemonic device: BINS

Is this a binomial setting? If not, what fails?:

I draw 15 cards from a deck of cards, one after another, and count black cards.

Binomial counts

To check that the binomial setting applies, we can use a mnemonic device: BINS

Is this a binomial setting? If not, what fails?:

I grow bacterial cultures in 15 petri dishes and count the number that cover at least half the dish after a week.

Binomial distribution

Binomial counts follow a binomial distribution. The binomial distribution is determined by the number of trials \(n\) and the probability of success \(p\).

The probability function is \[ \mathbb{P}(m) = {n\choose m}p^m(1-p)^{n-m} = \frac{n!}{m!(n-m)!}p^m(1-p)^{n-m} \]

Binomial distribution

Binomial counts follow a binomial distribution. The binomial distribution is determined by the number of trials \(n\) and the probability of success \(p\).

The probability function is \[ \mathbb{P}(m) = {n\choose m}p^m(1-p)^{n-m} = \frac{n!}{m!(n-m)!}p^m(1-p)^{n-m} \]

Example My production process has a failure rate of 5%. I pull out 15 randomly chosen products and count the number \(m\) of broken products.

What is \(\mathbb{P}(m=0)\)?

Binomial distribution

Binomial counts follow a binomial distribution. The binomial distribution is determined by the number of trials \(n\) and the probability of success \(p\).

The probability function is \[ \mathbb{P}(m) = {n\choose m}p^m(1-p)^{n-m} = \frac{n!}{m!(n-m)!}p^m(1-p)^{n-m} \]

Example My production process has a failure rate of 5%. I pull out 15 randomly chosen products and count the number \(m\) of broken products.

What is \(\mathbb{P}(m=0)\)?

dbinom(0, 15, 0.05)
## [1] 0.4632912

Binomial mean and standard deviation

The binomial distribution on \(n\) trials with probability \(p\) has mean and standard deviation:

\[ \mu = np \qquad \sigma = \sqrt{np(1-p)} \]

(these formulas only work for the binomial distribution)

Sample mean of binomial samples

From the Central Limit Theorem we can calculate the expected distribution from taking means of several binomial samples.

In this setting we are repeatedly performing \(n\) experiments, and then averaging the counts from each experiment.

The Central Limit Theorem mean is the same as the population mean: \(np\).

The Central Limit Theorem standard deviation is \(\sigma/\sqrt{n} = \sqrt{np(1-p)}/\sqrt{n} = \sqrt{p(1-p)}\)

Normal approximation of the Binomial distribution

As \(n\) gets large, the binomial distribution gets more and more similar to a normal distribution.

Normal approximation of the Binomial distribution

As \(n\) gets large, the binomial distribution gets more and more similar to a normal distribution.

As a rule of thumb, we require

\[ np \geq 10 \qquad n(1-p)\geq 10 \]

If this is true, then \[ \text{Binomial}(n,p)\approx\mathcal{N}(np, \sqrt{np(1-p)}) \]

Binomial proportions

The sample proportion is the proportion of successes in the sample:

\[ \hat{p} = \frac{\text{number of successes}}{n} = \frac{X}{n} \]

The sample proportion is an unbiased estimator of the binomial probability.

The sample proportion distribution has mean and standard deviation: \[ \mu_{\hat{p}} = p \qquad \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]

Poisson counts

The Poisson setting requires:

  1. The number of successes in two non-overlapping units of measure are independent.
  2. The probability of a success occurring in a unit of measure is the same for all units of equal size, and is proportional to the size of the unit.
  3. Events don’t coincide.

Poisson counts

The Poisson setting requires:

  1. The number of successes in two non-overlapping units of measure are independent.
  2. The probability of a success occurring in a unit of measure is the same for all units of equal size, and is proportional to the size of the unit.
  3. Events don’t coincide.

Is this a Poisson setting? If not, what fails?

I count the number of games I win in an hour at a casino.

Poisson counts

The Poisson setting requires:

  1. The number of successes in two non-overlapping units of measure are independent.
  2. The probability of a success occurring in a unit of measure is the same for all units of equal size, and is proportional to the size of the unit.
  3. Events don’t coincide.

Is this a Poisson setting? If not, what fails?

I count the number of ice creams sold in a month.

Poisson counts

The Poisson setting requires:

  1. The number of successes in two non-overlapping units of measure are independent.
  2. The probability of a success occurring in a unit of measure is the same for all units of equal size, and is proportional to the size of the unit.
  3. Events don’t coincide.

Is this a Poisson setting? If not, what fails?

I count the number of buses arriving in an hour.

Poisson counts

The Poisson setting requires:

  1. The number of successes in two non-overlapping units of measure are independent.
  2. The probability of a success occurring in a unit of measure is the same for all units of equal size, and is proportional to the size of the unit.
  3. Events don’t coincide.

Is this a Poisson setting? If not, what fails?

I count the number of customers at the Doner truck between noon and 1pm.

The Poisson Distribution

The Poisson distribution counts events in a Poisson setting. It is determined by the average rate of events per unit \(\lambda\).

\[ \mathbb{P}(m) = \frac{e^{-\lambda}\lambda^m}{m!} \]

The Poisson distribution has

\[ \mu = \lambda \qquad \sigma = \sqrt{\lambda} \]