Toward inference

There are 4 key terms in statistics that are related but distinct:

The basic idea is that knowledge of the population is desired. Knowing a parameter is how that knowledge is quantified, so the value of the parameter is desired.

To population is too large to complete take a census of, so a sample is used. The sample is summarized with a statistic.

The basic question of statistical inference then is:

How much does the statistic tell us about the unknown parameter?

Populations generically describe many different scenarios, for purposes of this lab we visualize our population through balls colored either red or black. Suppose there are R red balls, B black balls and N = R + B balls altogether. The unknown proportion (a parameter) of red balls is \(p = R/N = R/(R+B)\).

Our sample will consist of \(n\) balls chosen somehow. In a sample there are \(r\) red balls and \(b\) black balls with \(r+b = n\). The sample proportion is \(\hat{p} = r/n\).

To see how much \(\hat{p}\) can inform us about \(p\), we will use simulation. In simulation – unlike – real life, we know the values for \(p\), \(R\), \(B\), and \(N\).

Here we set some values and create a “population”:

R = 6321
B = 3216
N = R + B
p = R/N

pop = factor(c(rep("red", R), rep("black", B)))

c(p=p, N=N) # show population parameters
##           p           N 
##    0.662787 9537.000000

There are many values in pop, as is typical in a large population with an unknown parameter. We usually take a sample. Different sampling methods are:

srs = sample(pop, 10)
srs
##  [1] red   red   red   red   red   black red   red   red   black
## Levels: black red

A sample can be summarized with table:

table(srs)
## srs
## black   red 
##     2     8

Or prop.table to get proportions:

prop.table(table(srs))   # or divide table by n
## srs
## black   red 
##   0.2   0.8

The proportion of red, \(\hat{p}\), is random–it depends on the sample. The goal is to see what it can tell us about \(p\).

sample(pop, 10, replace=TRUE)
##  [1] red   black red   red   red   black red   red   red   red  
## Levels: black red
pop[1:10]
##  [1] red red red red red red red red red red
## Levels: black red

QUESTION: Explain why a convenience sample might not reflect the underlying population

QUESTION: Look at the output of both sample(pop, 10) and sample(pop,10, replace=TRUE) above. Could you tell which sample came from which methodology, if you didn’t know?

Our goal is to investigate a few questions:

We will do so through simulation. The replicate function is one way with R to perform a simulation.

The following command will create 1000 simulations of sample 10 numbers from pop and counting the proportion or red ones:

M = 1000
n = 10
reds <- replicate(M, prop.table(table(sample(pop, n)))[2])

QUESTION: perform the simulation (copy and paste!). View reds with a histogram. Estimate its center visually?

Your histogram will be different, but from the sample above, we have:

hist(reds)

When \(M\) is large, this histogram essentially shows the sampling distribution of the statistic.

QUESTION: Does the “center” of your histogram have any relation to the population parameter \(p\)? Explain

QUESTION: In the above, change \(n=10\) to \(n=2500\). Repeat last two questions for this new data sets. (That is, estimate the center, and compare this estimate to \(p\).)

QUESTION: Were you to generalize, would you say this statement is TRUE or FALSE – the center of the simulated values is basically the population parameter, regardless of sample size.

When investigating the results of a simulation, we generally take many runs (\(M\) is large), as this allows us to view the variability or spread in the simulation. In real life, we only take one run.

QUESTION: Look at a histogram when \(n=10\) and another when \(n=1000\). Estimate the spread for the data in each histogram. What are your values, and what is your methodology?

QUESTION: Based on your estimates, does this statement seem TRUE or FALSE – the variability of the simlated data does not depend on the sample size.

We can be more systematic than eyeballing. The following commands find the standard deviation for different samples sizes using a simulation of \(M\) runs:

M = 1000
sd10 = sd(replicate(M, prop.table(table(sample(pop, 10)))[2]))
sd40 = sd(replicate(M, prop.table(table(sample(pop, 40)))[2]))
sd160 = sd(replicate(M, prop.table(table(sample(pop, 160)))[2]))
sd640 = sd(replicate(M, prop.table(table(sample(pop, 640)))[2]))
sd2560 = sd(replicate(M, prop.table(table(sample(pop, 2560)))[2]))

QUESTION: Compute these 5 values and compare their values. Which of these seems to be more apt a description:

QUESTION: If you had only one value (and not \(M=1000\)) from a sample of size 10 or sample of size 2560 what would you expect to be closer to \(p\)? (The margin of error is a measure of the spread of the sampling distribution.)

QUESTION: In your answer above, MUST your choice be closer to \(p\)?

The mathematical theory and accompanying formulas are easier if sampling with replacement is used. The difference between the two methods is not usally significant when \(n\) is much smaller than \(N\).

QUESTION: If sampling without replacement can \(n > N\)? (When \(n=N\) the sample is called a “census”.) If sampling with replacement can \(n > N\)?

We wish to check if the population size has any significance on how well the sample proportion, \(\hat{p}\), tracks the parameter \(p\). We do so by creating two different populations:

p = 0.6
pop10 = factor(c(rep("red", p*10), rep("black", (1-p)*10)))
pop10000 = factor(c(rep("red", p*10000), rep("black", (1-p)*10000)))

Now generate the following two simulations:

M = 1000
n = 200
sim10 = replicate(M, prop.table(table(sample(pop10, n, replace=TRUE)))[2])
sim10000 = replicate(M, prop.table(table(sample(pop10000, n, replace=TRUE)))[2])

QUESTION: Make histograms for each simulation. Does the center, spread, or shape differ between the two?