Confidence Intervals

R has many functions for finding confidence intervals, though none (built-in) for the case of unknown \(\mu\) but known \(\sigma\).

Here is one, borrowed (and simplified) from BDSA"

z.test <- function(x, mu = 0, sigma = NULL,
    conf.level = 0.95, alternative="two.sided") {

    choices <- c("two.sided", "greater", "less")
    alt <- pmatch(alternative, choices)
    alternative <- choices[alt]

    dname <- deparse(substitute(x))

    x <- x[!is.na(x)]
    estimate <- mean(x)
    SE <- sigma/sqrt(length(x))
    zobs = (mean(x) - mu)/SE

    names(mu) <- "mean"
    names(zobs) = "z"
    names(estimate) <- c("sample mean of x")
    method <- c("One-sample z-Test")

    if(alternative == "less") {
        pval <- pnorm(zobs)
        zstar = qnorm(conf.level)
        MOE = zstar * SE
        cint <- c(NA, (estimate-mu) +  MOE)
    } else if(alternative == "greater") {
        pval <- 1 - pnorm(zobs)
        zstar = qnorm(conf.level)
        MOE = zstar*SE
        cint <- c( (estimate-mu)  - MOE, NA)
    } else {
        pval <- 2 * pnorm(- abs(zobs))
        alpha <- 1 - conf.level
        zstar <- qnorm((1 - alpha/2))
        MOE = zstar * SE
        cint <- (estimate-mu) + c(-MOE, MOE)
    }
    cint <- cint + mu

    attr(cint, "conf.level") <- conf.level

    rval <- list(statistic = zobs, p.value = pval, conf.int = cint,
        estimate = estimate, null.value = mu, alternative =
                 alternative, method = method, data.name = dname )

    attr(rval, "class") <- "htest"
    return(rval)

}

You need to copy-and-paste this into your R session for it to work. (So go ahead, copy it over into your R Script.) This function is used like the other R functions. It computes both the confidence interval and performs a test of significance. Here we focus on finding confidence intervals.

To find a confidence interval using z.test we need specify a data set and a population standard deviation and optionally we need to specify a confidence level, the default being the usual \(0.95 \cdot 100\)% confidence level.

Let’s see this work on some data. Copy and past this command to download a data set we call babies:

babies = read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/nc.csv")

This data set contains values from 1000 births. One of the variables is weeks giving the gestation period for each birth. We will treat this data as though it were a random sample from all births in a geographic region, which we informally refer to as the population.

QUESTION: Make a histogram of babies$weeks. Describe the distribution using the terms learned earlier this semester (modes, shape, tails, …)

QUESTION: In a minute, we will use z.test to compute a condidence interval for the mean gestation time for the population from this sample, but first indicate if the data satisfies any assumptions made to ensure that the sample mean has a normal sampling distribution.

The \(z\)-test is special in that we assume the population standard deviation is known. For this data, assume it is \(3\) weeks (for now).

A \(0.95\cdot 100\)% confidence interval is computed by z.test simply by specifying the data:

z.test(babies$weeks, sigma=3)
## 
##  One-sample z-Test
## 
## data:  babies$weeks
## z = 403.68, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  38.14854 38.52079
## sample estimates:
## sample mean of x 
##         38.33467

The confidence interval is included among other output, so we need to look for it. It looks like:

## 95 percent confidence interval:
##   38.15257 38.51677

To change the confidence level we specify it with the conf.level argument, as in:

z.test(babies$weeks, sigma=3, conf.level=0.90)
## 
##  One-sample z-Test
## 
## data:  babies$weeks
## z = 403.68, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  38.17847 38.49087
## sample estimates:
## sample mean of x 
##         38.33467

(We use 0.90 to specify the \(0.90 \cdot 100\)% confidence level, hence the reason for our awkward writing of \(90\)%.)

QUESTION: Find an \(0.99\cdot 100\)% confidence interval for the gestation period assuming \(\sigma=3\).

Typically a baby is considered late if it is more than 2 weeks past its expected delivery date. Let’s assume now that this is because the population standard deviation is really \(2\) weeks (not \(3\) as above).

QUESTION: With this assumption (\(\sigma=2)\). Compute a \(0.95\cdot 100\)% confidence interval for the data. What is its value? Is it wider or narrower than the one computed above under that assumption that \(\sigma=3\)?

QUESTION: With \(\sigma=2\), being late (over 2 weeks past the expected due date) would apply to roughly what percent of all births?

The variable weight (or babies$weight) contains the birthweight (in pounds) recorded for each birth.

QUESTION: Is this data normally distributed? (Say how you now).

QUESTION: What does a boxplot of the data show, and can you explain why you might have anticipated its shape?

QUESTION: Assume the \(z\)-test will apply to this data with a population standard deviation of \(1.5\). Compute a \(0.95\cdot 100\)% confidence interval for the mean birthweight. What is the interval? Does it contain \(7\) pounds?

QUESTION: What is the margin of error in the confidence interval computed above?

This data set contains a habit variable indicating of the mother was a smoker or non-smoker.

QUESTION: Make side-by-side boxplots of the birthweight for both cohorts using this command: plot(weight ~ habit, babies). Does it appear the medians are the same or different?

The following command will return the data for the babies whose mothers are classified as “smoker”:

mom_smoked = subset(babies, habit=="smoker")

QUESTION: compute a \(0.95\cdot 100\)% confidence interval for the data mom_smoked$weight. What is it? Does it contain \(7\) pounds?

QUESTION: There are 126 mothers who smoked in this data set. Explain (as precisely as you can) the difference in the margin of errors for a \(0.95\cdot 100\)% confidence interval for the whole sample of size 1000 and the smaller sample of size 126.