R has many functions for finding confidence intervals, though none (built-in) for the case of unknown \(\mu\) but known \(\sigma\).
Here is one, borrowed (and simplified) from BDSA"
z.test <- function(x, mu = 0, sigma = NULL,
conf.level = 0.95, alternative="two.sided") {
choices <- c("two.sided", "greater", "less")
alt <- pmatch(alternative, choices)
alternative <- choices[alt]
dname <- deparse(substitute(x))
x <- x[!is.na(x)]
estimate <- mean(x)
SE <- sigma/sqrt(length(x))
zobs = (mean(x) - mu)/SE
names(mu) <- "mean"
names(zobs) = "z"
names(estimate) <- c("sample mean of x")
method <- c("One-sample z-Test")
if(alternative == "less") {
pval <- pnorm(zobs)
zstar = qnorm(conf.level)
MOE = zstar * SE
cint <- c(NA, (estimate-mu) + MOE)
} else if(alternative == "greater") {
pval <- 1 - pnorm(zobs)
zstar = qnorm(conf.level)
MOE = zstar*SE
cint <- c( (estimate-mu) - MOE, NA)
} else {
pval <- 2 * pnorm(- abs(zobs))
alpha <- 1 - conf.level
zstar <- qnorm((1 - alpha/2))
MOE = zstar * SE
cint <- (estimate-mu) + c(-MOE, MOE)
}
cint <- cint + mu
attr(cint, "conf.level") <- conf.level
rval <- list(statistic = zobs, p.value = pval, conf.int = cint,
estimate = estimate, null.value = mu, alternative =
alternative, method = method, data.name = dname )
attr(rval, "class") <- "htest"
return(rval)
}
You need to copy-and-paste this into your R session for it to work. (So go ahead, copy it over into your R Script.) This function is used like the other R functions. It computes both the confidence interval and performs a test of significance. Here we focus on finding confidence intervals.
To find a confidence interval using z.test
we need specify a data set and a population standard deviation and optionally we need to specify a confidence level, the default being the usual \(0.95 \cdot 100\)% confidence level.
Let’s see this work on some data. Copy and past this command to download a data set we call babies
:
babies = read.csv("http://www.math.csi.cuny.edu/~tobiasljohnson/214/nc.csv")
This data set contains values from 1000 births. One of the variables is weeks
giving the gestation period for each birth. We will treat this data as though it were a random sample from all births in a geographic region, which we informally refer to as the population.
QUESTION: Make a histogram of
babies$weeks
. Describe the distribution using the terms learned earlier this semester (modes, shape, tails, …)
QUESTION: In a minute, we will use
z.test
to compute a condidence interval for the mean gestation time for the population from this sample, but first indicate if the data satisfies any assumptions made to ensure that the sample mean has a normal sampling distribution.
The \(z\)-test is special in that we assume the population standard deviation is known. For this data, assume it is \(3\) weeks (for now).
A \(0.95\cdot 100\)% confidence interval is computed by z.test
simply by specifying the data:
z.test(babies$weeks, sigma=3)
##
## One-sample z-Test
##
## data: babies$weeks
## z = 403.68, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 38.14854 38.52079
## sample estimates:
## sample mean of x
## 38.33467
The confidence interval is included among other output, so we need to look for it. It looks like:
## 95 percent confidence interval:
## 38.15257 38.51677
To change the confidence level we specify it with the conf.level
argument, as in:
z.test(babies$weeks, sigma=3, conf.level=0.90)
##
## One-sample z-Test
##
## data: babies$weeks
## z = 403.68, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
## 38.17847 38.49087
## sample estimates:
## sample mean of x
## 38.33467
(We use 0.90
to specify the \(0.90 \cdot 100\)% confidence level, hence the reason for our awkward writing of \(90\)%.)
QUESTION: Find an \(0.99\cdot 100\)% confidence interval for the gestation period assuming \(\sigma=3\).
Typically a baby is considered late if it is more than 2 weeks past its expected delivery date. Let’s assume now that this is because the population standard deviation is really \(2\) weeks (not \(3\) as above).
QUESTION: With this assumption (\(\sigma=2)\). Compute a \(0.95\cdot 100\)% confidence interval for the data. What is its value? Is it wider or narrower than the one computed above under that assumption that \(\sigma=3\)?
QUESTION: With \(\sigma=2\), being late (over 2 weeks past the expected due date) would apply to roughly what percent of all births?
The variable weight
(or babies$weight
) contains the birthweight (in pounds) recorded for each birth.
QUESTION: Is this data normally distributed? (Say how you now).
QUESTION: What does a boxplot of the data show, and can you explain why you might have anticipated its shape?
QUESTION: Assume the \(z\)-test will apply to this data with a population standard deviation of \(1.5\). Compute a \(0.95\cdot 100\)% confidence interval for the mean birthweight. What is the interval? Does it contain \(7\) pounds?
QUESTION: What is the margin of error in the confidence interval computed above?
This data set contains a habit
variable indicating of the mother was a smoker or non-smoker.
QUESTION: Make side-by-side boxplots of the birthweight for both cohorts using this command:
plot(weight ~ habit, babies)
. Does it appear the medians are the same or different?
The following command will return the data for the babies whose mothers are classified as “smoker”:
mom_smoked = subset(babies, habit=="smoker")
QUESTION: compute a \(0.95\cdot 100\)% confidence interval for the data
mom_smoked$weight
. What is it? Does it contain \(7\) pounds?
QUESTION: There are 126 mothers who smoked in this data set. Explain (as precisely as you can) the difference in the margin of errors for a \(0.95\cdot 100\)% confidence interval for the whole sample of size 1000 and the smaller sample of size 126.