Confidence Interval Estimation

## 9  Confidence Interval Estimation

In statistics one often would like to estimate unknown parameters for a known distribution. For example, you may think that your parent population is normal, but the mean is unknown, or both the mean and standard deviation are unknown. From a data set you can't hope to know the exact values of the parameters, but the data should give you a good idea what they are. For the mean, we expect that the sample mean or average of our data will be a good choice for the population mean, and intuitively, we understand that the more data we have the better this should be. How do we quantify this?

Statistical theory is based on knowing the sampling distribution of some statistic such as the mean. This allows us to make probability statements about the value of the parameters. Such as we are 95 percent certain the parameter is in some range of values.

In this section, we describe the R functions prop.test, t.test, and wilcox.test used to facilitate the calculations.

### 9.1  Population proportion theory

The most widely seen use of confidence intervals is the estimation of population proportion through surveys or polls. For example, suppose it is reported that 100 people were surveyed and 42 of them liked brand X. How do you see this in the media?

Depending on the sophistication of the reporter, you might see the claim that 42% of the population reports they like brand X. Or, you might see a statement like the survey indicates that 42% of people like brand X, this has a margin of error of 9 percentage points.'' Or, if you find an extra careful reporter you will see a summary such as the survey indicates that 42% of people like brand X, this has a margin of error of 9 percentage points. This is a 95% confidence level.''

Why all the different answers? Well, the idea that we can infer anything about the population based on a survey of just 100 people is founded on probability theory. If the sample is a random sample then we know the sampling distribution of p^ the sample proportion. It is approximately normal.

Let's fix the notation. Suppose we let p be the true population proportion, which is of course
p =
Number who agree
Size of population
and let
 ^ p
=
Number surveyed who agree
size of survey
.

We could say more. If the sampled answers are recorded as Xi where Xi = 1 if it was yes'' and Xi = 0 if no'', then our sample is {X1,X2,...,Xn} where n is the size of the sample and we get
 ^ p
=
X1 + X2 + ··· + Xn
n
.
Which looks quite a lot like an average (X--).

Now if we satisfy the assumptions that each Xi is i.i.d. then p^ has a known distribution, and if n is large enough we can say the following is approximately normal with mean 0 and variance 1:
z =
p -
 ^ p
( p(1 -p) )
 1/2
/ ( n )
 1/2
=
 -- X
- µ
s/ ( n )
 1/2

If we know this, then we can say how close z is to zero by specifying a confidence. For example, from the known properties of the normal, we know that
• z is in (-1,1) with probability approximately 0.68
• z is in (-2,2) with probability approximately 0.95
• z is in (-3,3) with probability approximately 0.998
We can solve algebraically for p as it is quadratic, but the discussion is simplified and still quite accurate if we approximate the denominator by SE=
( p(1-p)/n )
 1/2
(about 0.049 in our example) then we have
P(-1 <
p -
 ^ p
SE
< 1) = .68,   P(-2 <
p -
 ^ p
SE
< 2) = .95,   P(-3 <
p -
 ^ p
SE
< 3) = .998,

Or in particular, on average 95% of the time the interval (p^ - 2SE,p^+2SE) contains the true value of p. In the words of a reporter this would be a 95% confidence level, an answer'' of p^=.42 with a margin of error of 9 percentage points (2*SE in percents).

More generally, we can find the values for any confidence level. This is usually denoted in reverse by calling it a (1-a)100% confidence level. Where for any a in (0,1) we can find a z* with
P(-z* < z < z*) = 1-a
Often such a z* is called z1-a/2 from how it is found. For R this can be found with the qnorm function

> alpha = c(0.2,0.1,0.05,0.001)
> zstar = qnorm(1 - alpha/2)
> zstar
[1] 1.281552 1.644854 1.959964 3.290527

Notice the value z* = 1.96 corresponds to a=.05 or a 95% confidence interval. The reverse is done with the pnorm function:

> 2*(1-pnorm(zstar))
[1] 0.200 0.100 0.050 0.001

In general then, a (1-a)100% confidence interval is then given by
 ^ p
± z* SE

Does this agree with intuition? We should expect that as n gets bigger we have more confidence. This is so because as n gets bigger, the SE gets smaller as there is a
( n )
 1/2
in its denominator. As well, we expect if we want more confidence in our answer, we will need to have a bigger interval. Again this is so, as a smaller a leads to a bigger z*.

Some Extra Insight: Confidence interval isn't always right
The fact that not all confidence intervals contain the true value of the parameter is often illustrated by plotting a number of random confidence intervals at once and observing. This is done in figure 43.

Figure 43: How many 80% confidence intervals contain p?

This was quite simply generated using the command matplot

> m = 50; n=20; p = .5;         # toss 20 coins 50 times
> phat = rbinom(m,n,p)/n        # divide by n for proportions
> SE = sqrt(phat*(1-phat)/n)    # compute SE
> alpha = 0.10;zstar = qnorm(1-alpha/2)
> matplot(rbind(phat - zstar*SE, phat + zstar*SE),
+ rbind(1:m,1:m),type="l",lty=1)
> abline(v=p)                  # draw line for p=0.5


Many other tests follow a similar pattern:
• One finds a good'' statistic that involves the unknown parameter (a pivotal quantity).
• One uses the known distribution of the statistic to make a probabilistic statement.
• One unwraps things to form a confidence interval. This is often of the form the statistic plus or minus a multiple of the standard error although this depends on the good'' statistic.
As a user the important thing is to become knowledgeable about the assumptions that are made to know'' the distribution of the statistic. In the example above we need the individual Xi to be i.i.d. This is assured if we take care to randomly sample the data from the target population.

### 9.2  Proportion test

Let's use R to find the above confidence level10. The main R command for this is prop.test (proportion test). To use it to find the 95% confidence interval we do

> prop.test(42,100)

1-sample proportions test with continuity correction

data:  42 out of 100, null probability 0.5
X-squared = 2.25, df = 1, p-value = 0.1336
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.3233236 0.5228954
sample estimates:
p
0.42

Notice, in particular, we get the 95% confidence interval (0.32,0.52) by default.

If we want a 90% confidence interval we need to ask for it:


>  prop.test(42,100,conf.level=0.90)

1-sample proportions test with continuity correction

data:  42 out of 100, null probability 0.5
X-squared = 2.25, df = 1, p-value = 0.1336
alternative hypothesis: true p is not equal to 0.5
90 percent confidence interval:
0.3372368 0.5072341
sample estimates:
p
0.42

Which gives the interval (0.33,0.50). Notice this is smaller as we are now less confident.

Some Extra Insight: prop.test is more accurate
The results of prop.test will differ slightly than the results found as described previously. The prop.test function actually starts from
|
p -
 ^ p
( p(1-p)/n )
 1/2
| < z*
and then solves for an interval for p. This is more complicated algebraically, but more correct, as the central limit theorem approximation for the binomial is better for this expression.

### 9.3  The z-test

As above, we can test for the mean in a similar way, provided the statistic
 -- X
- µ
s/ ( n )
 1/2
is normally distributed. This can happen if either
• s is known, and the Xi's are normally distributed.
• s is known, and n is large enough to apply the CLT.
Suppose a person weighs himself on a regular basis and finds his weight to be
175    176   173   175   174   173   173   176   173   179

Suppose that s=1.5 and the error in weighing is normally distributed. (That is Xi = µ + ei where ei is normal with mean 0 and standard deviation 1.5). Rather than use a built-in test, we illustrate how we can create our own:


## define a function
> simple.z.test = function(x,sigma,conf.level=0.95) {
+ n = length(x);xbar=mean(x)
+ alpha = 1 - conf.level
+ zstar = qnorm(1-alpha/2)
+ SE = sigma/sqrt(n)
+ xbar + c(-zstar*SE,zstar*SE)
+ }
## now try it
> simple.z.test(x,1.5)
[1] 173.7703 175.6297

Notice we get the 95% confidence interval of (173.7703, 175.6297)

### 9.4  The t-test

More realistically, you may not know the standard deviation. To work around this we use the t-statistic, which is given by
t =
 -- X
s/ ( n )
 1/2
where s, the sample standard deviation, replaces s, the population standard deviation. One needs to know that the distribution of t is known if
• The Xi are normal and n is small then this has the t-distribution with n-1 degrees of freedom.
• If n is large then the CLT applies and it is approximately normal. (In most cases.)
(Actually, the t-test is more forgiving (robust) than this implies.)

Lets suppose in our weight example, we don't assume the standard deviation is 1.5, but rather let the data decide it for us. We then would use the t-test provided the data is normal (Or approximately normal.). To quickly investigate this assumption we look at the qqnorm plot and others

Figure 44: Plot of weights to assess normality

Things pass for normal (although they look a bit truncated on the left end) so we apply the t-test. To compare, we will do a 95% confidence interval (the default)

> t.test(x)

One Sample t-test

data:  x
t = 283.8161, df = 9, p-value = < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
173.3076 176.0924
sample estimates:
mean of x
174.7

Notice we get a different confidence interval.

Some Extra Insight: Comparing p-values from t and z
One may be tempted to think that the confidence interval based on the t statistic would always be larger than that based on the z statistic as always t* > z* . However, the standard error SE for the t also depends on s which is variable and can sometimes be small enough to offset the difference.

To see why t* is always larger than z*, we can compare side-by-side boxplots of two random sets of data with these distributions.

> x=rnorm(100);y=rt(100,9)
> boxplot(x,y)
> qqnorm(x);qqline(x)
> qqnorm(y);qqline(y)

which gives (notice the symmetry of both, but the larger variance of the t distribution).

Figure 45: Plot of random normal data and random t-distributed data

And for completeness, this creates a graph with several theoretical densities.

> xvals=seq(-4,4,.01)
> plot(xvals,dnorm(xvals),type="l")
> for(i in c(2,5,10,20,50)) points(xvals,dt(xvals,df=i),type="l",lty=i)


Figure 46: Normal density and the t-density for several degrees of freedom

### 9.5  Confidence interval for the median

Confidence intervals for the median are important too. They are different mathematically than the ones above, but in R these differences aren't noticed. The R function wilcox.test performs a non-parametric test for the median.

Suppose the following data is pay of CEO's in America in 2001 dollars11, then the following creates a test for the median

> x = c(110, 12, 2.5, 98, 1017, 540, 54, 4.3, 150, 432)
> wilcox.test(x,conf.int=TRUE)

Wilcoxon signed rank test

data:  x
V = 55, p-value = 0.001953
alternative hypothesis: true mu is not equal to 0
95 percent confidence interval:
33.0 514.5

Notice a few things:
• Unlike prop.test and t.test, we needed to specify that we wanted a confidence interval computed.
• For this data, the confidence interval is enormous as the size of the sample is small and the range is huge.
• We couldn't have used a t-test as the data isn't even close to normal.

### 9.6  Problems

9.1
Create 15 random numbers that are normally distributed with mean 10 and s.d. 5. Find a 1-sample z-test at the 95% level. Did it get it right?
9.2
Do the above 100 times. Compute what percentage is in a 95% confidence interval. Hint: The following might prove useful

> f=function () mean(rnorm(15,mean=10,sd=5))
> SE = 5/sqrt(15)
> xbar = simple.sim(100,f)
> alpha = 0.1;zstar = qnorm(1-alpha/2);sum(abs(xbar-10) < zstar*SE)
[1] 87
> alpha = 0.05;zstar = qnorm(1-alpha/2);sum(abs(xbar-10) < zstar*SE)
[1] 92
> alpha = 0.01;zstar = qnorm(1-alpha/2);sum(abs(xbar-10) < zstar*SE)
[1] 98

9.3
The t-test is just as easy to do. Do a t-test on the same data. Is it correct now? Comment on the relationship between the confidence intervals.
9.4
Find an 80% and 95% confidence interval for the median for the exec.pay dataset.
9.5
For the Simple data set rat do a t-test for mean if the data suggests it is appropriate. If not, say why not. (This records survival times for rats.)
9.6
Repeat the previous for the Simple data set puerto (weekly incomes of Puerto Ricans in Miami.).
9.7
The median may be the appropriate measure of center. If so, you might want to have a confidence interval for it too. Find a 90% confidence interval for the median for the Simple data set malpract (on the size of malpractice awards). Comment why this distribution doesn't lend itself to the z-test or t-test.
9.8
The t-statistic has the t-distribution if the Xi's are normally distributed. What if they are not? Investigate the distribution of the t-statistic if the Xi's have different distributions. Try short-tailed ones (uniform), long-tailed ones (t-distributed to begin with), Uniform (exponential or log-normal).

(For example, If the Xi are nearly normal, but there is a chance of some errors introducing outliers. This can be modeled with
Xi = z(µ + s Z) + (1-z)Y
where z is 1 with high probability and 0 otherwise and Y is of a different distribution. For concreteness, suppose µ=0, s=1 and Y is normal with mean 0, but standard deviation 10 and P(z=1) = .9. Here is some R code to simulate and investigate. (Please note, the simulations for the suggested distributions should be much simpler.)

> f =  function(n=10,p=0.95) {
+    y = rnorm(n,mean=0,sd=1+9*rbinom(n,1,1-p))
+    t = (mean(y) - 0) / (sqrt(var(y))/sqrt(n))
+ }
> sample = simple.sim(100,f)
> qqplot(sample,rt(100,df=9),main="sample vs. t");qqline(sample)
> qqnorm(sample,main="sample vs. normal");qqline(sample)
> hist(sample)

The resulting graphs are shown. First, the graph shows the sample against the t-quantiles. A bad, fit. The normal plot is better but we still see a skew in the histogram due to a single large outlier.)

Figure 47: t-statistic for contaminated normal data