MVJ
12April, 2018
In parametric statistics, we use statistics calculated from samples to understand parameters that specify the distribution of a population.
The most common distributions we will be interested in are the normal distribution for numeric data, and the binomial distribution for categorical data.
Definition
A parameter is a number that describes some characteristic of a population. Usually we model populations with probability distributions, and the distributions are determined by a collection of parameters.
A statistic is a number that can be computed directly from a sample. We will be using statistics to estimate parameters.
Parameters are usually not available: we usually cannot measure the entire population.
The process of statistical inference uses information from a sample to draw conclusions about a population. Statistical estimation is when the conclusions are proposed values for a specific parameter.
Different samples yield different statistics.
Here are two collections of numbers drawn from the same binomial distribution. Rightmost is the sample mean of each:
x | 2 | 2 | 0 | 2 | 1 | 1 | 1 | 0 | 1 | 1 | 1.1 |
y | 1 | 1 | 2 | 0 | 1 | 2 | 3 | 0 | 1 | 1 | 1.2 |
Since the sample mean yields uncertain individual outcomes, but with a **regular distribution in large numbers of repetitions*, the sample mean is a random phenomenon.
Since the sample mean is a number, it is a random value.
This holds true for all sample statistics.
Since sample statistics are random values, we can study the distribution of that random variable. This is called the sampling distribution.
One example is the bootstrap from Lab 5; you calculated sampling distributions of means and standard deviations by repeated sampling from a data set.
The Central Limit Theorem allows us to draw conclusions about population parameters from the sampling distribution of specific sample statistics
In this example, I sample 25 values from a normal distribution with \(\mu=1\) and \(\sigma=0.5\). I repeat this sampling 1000 times.
In this example, I sample 25 values from a normal distribution with \(\mu=1\) and \(\sigma=0.5\). I repeat this sampling 1000 times.
In this example, I sample 25 values from a normal distribution with \(\mu=1\) and \(\sigma=0.5\). I repeat this sampling 1000 times.
The sampling distribution informs both an estimate of the population parameter, and demonstrates the sampling variability inherent in the statistic method we are using.
In practice, repeated sampling is often difficult and expensive. Instead, we can
Bias measures whether the expected value of the statistic is the true value of the parameter.
An unbiased statistic has an expected value of the sampling distribution equal to the true value of the parameter.
Variability measures the spread of the sampling distribution. It is usually determined by sample size, with smaller spreads from larger samples.
Consider variance. The average squared deviation from the average would give us the left. We usually use the formula on the right. \[ \sigma^2 = \frac{1}{N}\sum(x_i-\overline x)^2 \qquad \sigma^2 = \frac{1}{N-1}\sum(x_i-\overline x)^2 \]
Bias is reduced by using random sampling: by randomizing over- and under-estimates tend to balance out.
Variability is reduced by using a larger sample.
Note: as long as the population is at least 20x larger than the sample, variability does not depend on population size. Variability only depends on sample size (and properties of the true distribution)
No matter what the actual population, we use population distribution to refer to the distribution of the random variable of whatever property we are interested in measuring.
The population might not necessarily concretely exist:
Course grades in MTH214 has as its population a hypothetical collection of all students that will take the course in the future.
The sample mean \[ \overline x = \frac{\sum x_i}{N} \] is an unbiased estimate of the population mean \(\mu\).
The Central Limit Theorem tells us that for a population distribution with mean \(\mu\) and standard deviation \(\sigma\), \[ \overline x \sim \mathcal N\left(\mu, \frac{\sigma}{\sqrt{N}}\right) \]
The power of the Central Limit Theorem is that the distribution is approximately normal even if the population is not.
Note: averages vary less than observations.
\[ \overline x \sim \mathcal N\left(\mu, \frac{\sigma}{\sqrt{N}}\right) \]
\[ \overline x \sim \mathcal N\left(\mu, \frac{\sigma}{\sqrt{N}}\right) \]
A task has a distribution on work time distributed like this:
\[ \mu = 1 \qquad \sigma = 1 \]
A task has a distribution on work time distributed like this:
\[ \mu = 1 \qquad \sigma = 1 \]
Your manager allocates 1.1 hours per task to perform 70 tasks in two weeks: 80 hours of work time. As long as the mean time for these 70 task is less than $80/70=1.143 you will be able to do it without overtime.
A task has a distribution on work time distributed like this:
\[ \mu = 1 \qquad \sigma = 1 \]
Your manager allocates 1.1 hours per task to perform 70 tasks in two weeks: 80 hours of work time. As long as the mean time for these 70 task is less than \(80/70=1.143\) you will be able to do it without overtime.
The Central Limit Theorem tells us that \(\overline x\sim\mathcal N(1, 1/\sqrt{70})\approx\mathcal N(1, 0.12)\)
The probability of exceeding the time can be calculated using pnorm
:
pnorm(80/70, 1, 1/sqrt(70), lower.tail=FALSE)*100
## [1] 11.59989
There is an 11% chance that the time isn’t enough.
For discrete data, we distinguish between two main ways of generating counts:
A binomial setting is when we perform several independent trials of the same process and record the number of times a particular outcome occurs.
A Poisson setting is when we consider the number of successes that occur in a fixed unit of measure. (time, region of space, …)
The difference is that for the binomial case, you specify how many trials you check; for Poisson you specify how long you watch.
To check that the binomial setting applies, we can use a mnemonic device: BINS
To check that the binomial setting applies, we can use a mnemonic device: BINS
Is this a binomial setting? If not, what fails?:
I count whether my students are Freshmen, Sophomores, Juniors or Seniors.
To check that the binomial setting applies, we can use a mnemonic device: BINS
Is this a binomial setting? If not, what fails?:
I check whether the body temperature is higher than 100ºF each hour for a day on flu patients given aspirin.
To check that the binomial setting applies, we can use a mnemonic device: BINS
Is this a binomial setting? If not, what fails?:
I count how many times I win before my money runs out at a casino visit.
To check that the binomial setting applies, we can use a mnemonic device: BINS
Is this a binomial setting? If not, what fails?:
I draw 15 cards from a deck of cards, one after another, and count black cards.
To check that the binomial setting applies, we can use a mnemonic device: BINS
Is this a binomial setting? If not, what fails?:
I grow bacterial cultures in 15 petri dishes and count the number that cover at least half the dish after a week.
Binomial counts follow a binomial distribution. The binomial distribution is determined by the number of trials \(n\) and the probability of success \(p\).
The probability function is \[ \mathbb{P}(m) = {n\choose m}p^m(1-p)^{n-m} = \frac{n!}{m!(n-m)!}p^m(1-p)^{n-m} \]
Binomial counts follow a binomial distribution. The binomial distribution is determined by the number of trials \(n\) and the probability of success \(p\).
The probability function is \[ \mathbb{P}(m) = {n\choose m}p^m(1-p)^{n-m} = \frac{n!}{m!(n-m)!}p^m(1-p)^{n-m} \]
Example My production process has a failure rate of 5%. I pull out 15 randomly chosen products and count the number \(m\) of broken products.
What is \(\mathbb{P}(m=0)\)?
Binomial counts follow a binomial distribution. The binomial distribution is determined by the number of trials \(n\) and the probability of success \(p\).
The probability function is \[ \mathbb{P}(m) = {n\choose m}p^m(1-p)^{n-m} = \frac{n!}{m!(n-m)!}p^m(1-p)^{n-m} \]
Example My production process has a failure rate of 5%. I pull out 15 randomly chosen products and count the number \(m\) of broken products.
What is \(\mathbb{P}(m=0)\)?
dbinom(0, 15, 0.05)
## [1] 0.4632912
The binomial distribution on \(n\) trials with probability \(p\) has mean and standard deviation:
\[ \mu = np \qquad \sigma = \sqrt{np(1-p)} \]
(these formulas only work for the binomial distribution)
From the Central Limit Theorem we can calculate the expected distribution from taking means of several binomial samples.
In this setting we are repeatedly performing \(n\) experiments, and then averaging the counts from each experiment.
The Central Limit Theorem mean is the same as the population mean: \(np\).
The Central Limit Theorem standard deviation is \(\sigma/\sqrt{n} = \sqrt{np(1-p)}/\sqrt{n} = \sqrt{p(1-p)}\)
As \(n\) gets large, the binomial distribution gets more and more similar to a normal distribution.
As \(n\) gets large, the binomial distribution gets more and more similar to a normal distribution.
As a rule of thumb, we require
\[ np \geq 10 \qquad n(1-p)\geq 10 \]
If this is true, then \[ \text{Binomial}(n,p)\approx\mathcal{N}(np, \sqrt{np(1-p)}) \]
The sample proportion is the proportion of successes in the sample:
\[ \hat{p} = \frac{\text{number of successes}}{n} = \frac{X}{n} \]
The sample proportion is an unbiased estimator of the binomial probability.
The sample proportion distribution has mean and standard deviation: \[ \mu_{\hat{p}} = p \qquad \sigma_{\hat{p}} = \sqrt{\frac{p(1-p)}{n}} \]
The Poisson setting requires:
The Poisson setting requires:
Is this a Poisson setting? If not, what fails?
I count the number of games I win in an hour at a casino.
The Poisson setting requires:
Is this a Poisson setting? If not, what fails?
I count the number of ice creams sold in a month.
The Poisson setting requires:
Is this a Poisson setting? If not, what fails?
I count the number of buses arriving in an hour.
The Poisson setting requires:
Is this a Poisson setting? If not, what fails?
I count the number of customers at the Doner truck between noon and 1pm.
The Poisson distribution counts events in a Poisson setting. It is determined by the average rate of events per unit \(\lambda\).
\[ \mathbb{P}(m) = \frac{e^{-\lambda}\lambda^m}{m!} \]
The Poisson distribution has
\[ \mu = \lambda \qquad \sigma = \sqrt{\lambda} \]