Significance testing

R has many functions for carrying out significance tests intervals, though none (built-in) for the case of unknown $\mu$ but known $\sigma$.

Here is one, borrowed (and simplified) from BDSA:

z.test <- function(x, mu = 0, sigma = NULL,
    conf.level = 0.95, alternative="two.sided") {

    choices <- c("two.sided", "greater", "less")
    alt <- pmatch(alternative, choices)
    alternative <- choices[alt]

    dname <- deparse(substitute(x))

    x <- x[!is.na(x)]
    estimate <- mean(x)
    SE <- sigma/sqrt(length(x))
    zobs = (mean(x) - mu)/SE

    names(mu) <- "mean"
    names(zobs) = "z"
    names(estimate) <- c("sample mean of x")
    method <- c("One-sample z-Test")

    if(alternative == "less") {
        pval <- pnorm(zobs)
        zstar = qnorm(conf.level)
        MOE = zstar * SE
        cint <- c(NA, (estimate-mu) +  MOE)
    } else if(alternative == "greater") {
        pval <- 1 - pnorm(zobs)
        zstar = qnorm(conf.level)
        MOE = zstar*SE
        cint <- c( (estimate-mu)  - MOE, NA)
    } else {
        pval <- 2 * pnorm(- abs(zobs))
        alpha <- 1 - conf.level
        zstar <- qnorm((1 - alpha/2))
        MOE = zstar * SE
        cint <- (estimate-mu) + c(-MOE, MOE)
    }
    cint <- cint + mu

    attr(cint, "conf.level") <- conf.level

    rval <- list(statistic = zobs, p.value = pval, conf.int = cint,
        estimate = estimate, null.value = mu, alternative =
                 alternative, method = method, data.name = dname )

    attr(rval, "class") <- "htest"
    return(rval)

}

You need to copy-and-paste this into your R session for it to work. (So go ahead, copy it over into your R Script.)

The observant student will note this is the exact same function used for finding confidence intervals under similar assumptions!

This function is used like the other R functions. It computes both the confidence interval and performs a test of significance. Here we focus on using it to carry out a test of significance.

Example: (6.17 of book) Water quality testing. The setup is water quality should have a lead level of no more than 15 pbb. Assuming, the amount of lead tested in a sample is random, but follows a normal distribution with mean $\mu$ and variance $\sigma = 0.25$, perform a one-sided test of significance that the sampled lead indicates a greater mean than $\mu=15$ given a sample with values 15.84, 15.33, and 15.58.

We first specify the null and alternative. Here we have a one-sided alternative:

$H_0: \mu = 15$, against
$H_A: \mu > 15$.

The test statistic will be $Z = (\bar{x}-\mu)/(\sigma/\sqrt{n})$. It is important to know that under our assumptions, this has a standard normal distribution.

QUESTION: What assumption(s) ensures that $Z$ will have a standard normal distribution?

Once that is confirmed, then the z.test function will carry out the work. We need to specify:

the data. In this case both the sample and the value for $\sigma$ to be used.
the null hypothesis (through mu=...)
the alternative hypothesis (through alternative="greater", alternative="less", or alternative="two.sided")

water_sample = c(15.84, 15.33, 15.58)
sigma = 0.25
z.test(water_sample, sigma=sigma, mu=15, alternative="greater")

## 
##  One-sample z-Test
## 
## data:  water_sample
## z = 4.0415, p-value = 2.656e-05
## alternative hypothesis: true mean is greater than 15
## 95 percent confidence interval:
##  15.34592       NA
## sample estimates:
## sample mean of x 
##         15.58333

A $p$-value is returned.

QUESTION: What is the $p$-value?

QUESTION: Using $=0.10 is this difference (between the observed and expected values) statistically significant at the $\alpha$ significance level?

EXAMPLE: Average number of skittles in a bag.

Suppose the distribution of skittles in a bag has unknown $\mu$, but known variance $\sigma^2 = 10$. A student wishes to test if the average number of skittles is different from a value of $20$ (a number presumably read on the internet). To investigate, they take a Halloween size bag of candy holding 10 bags of skittles and find these values:

19 14 23 17 15 21 24 18 16 20

QUESTION: The central limit theorem states that if $n$ is large enough $\bar{x}$ – and hence $Z$ – will have a normal distribution. The value of $n=10$ is usually large enough if the data is not too skewed or too long tailed. Make a graphic to investigate whether the sample inidicates the population is skewed and or long tailed. What do you conclude?

QUESTION: State the null and alternative hypotheses that match this example

QUESTION: Use z.test to compute the $p$-value. Is the difference statistically significant at the $\alpha=0.10$ significance level?

Example: Are students having less fun?

A researcher wants to know if students are having less fun. To get a sense, she uses a validated test that was widely distributed amongst students in 2015 and had an average value of $82$ with a standard deviation of $5$. The researcher gives the test to 8 students and gets these values back:

85 75 81 70 78 76 80 85

The researcher assumes the population is normally distributed, so for any size sample $\bar{x}$ (and hence $Z$) will have a normal distribution. She assumes the population has mean $\mu$, an unknown, but known standard deviation $\sigma=5$.

QUESTION: What is null and one-sided alternative hypothesis for this example? (Are students having less fun than in 2015, where a large sample established that the average for this test then is 82.)

QUESTION: What $p$ value did the researcher find, under her assumptions? Is the difference statistically significant at the $\alpha=0.05$ significance level?

QUESTION: WERE a two-sided test computed, would she have found that the difference is statistically significant at the $\alpha=0.05$ significance level?

Example: why the fuss

Why the fuss to assume $Z$ has a normal distribution. If it isn’t so, then the computed $p$ value might be off and hence decisions might be made incorrectly. To illustrate (maybe), we use an exponential population with mean $1$ and standard deviation $1$. In a test with $n=3$, if the normal distribution applies, we expect the observed value of $Z$ to be greater than 1.96 only 2.5% of the time.

Let’s see if this is the case. Copy and paste this command into your R session:

M = 1
n = 3
mu = 1
sigma=1

replicate(M, {
xs = rexp(n)
SD = sigma/sqrt(n)
zobs = (mean(xs) - mu)/SD
zobs
})

## [1] -0.8236451

As written (with M=1) this computes just 1 sample and from that 1 observed value of $Z$.

QUESTION: Change to M=20. What proportion of values are greater than $1.96$? (Count by hand.)

QUESTION: Change to M=10000. Save the observed values as zobs and use sum(zobs > 1.96)/M to find the proportion greater. Does this suggest a probability of $0.25$ for the event?