Hypothesis Testing: p-values, Bootstrap

Mikael Vejdemo-Johansson

P-Values - Attained Confidence Levels

A sample of 51 AAA batteries from Panasonic gave a sample mean zinc mass of 2.06g, and a sample standard deviation of .141g. Does this data provide compelling evidence for concluding that the population mean zinc mass exceeds 2.0g?

Let’s test this. The question asks whether the mean mass exceeds 2.0g, so this is an upper tail T-test. We have:

\[ H_0: \mu_0=2.0 \qquad n=51 \qquad \overline{x} = 2.06 \qquad s = 0.141 \\ t = \frac{\overline{x}-\mu_0}{s/\sqrt{n}} = \frac{2.06-2.0}{0.141/\sqrt{51}} \approx 3.039 \]

At a significance level of 5%, for an upper-tail test, and with 51 samples, we need to compare to \(CDF_{T(50)}^{-1}(0.95)\approx1.676\), and are able to reject the null hypothesis and conclude a significant excess.

At a significance level of 1%, for an upper-tail test, and with 51 samples, we need to compare to \(CDF_{T(50)}^{-1}(0.99)\approx2.403\), and are able to reject the null hypothesis and conclude a significant excess.

At a significance level of 0.5%, for an upper-tail test, and with 51 samples, we need to compare to \(CDF_{T(50)}^{-1}(0.995)\approx2.678\), and are able to reject the null hypothesis and conclude a significant excess.

At a significance level of 0.1%, for an upper-tail test, and with 51 samples, we need to compare to \(CDF_{T(50)}^{-1}(0.999)\approx3.261\), and are unable to reject the null hypothesis. We do not have statistical support to consider this a significant excess at a significance level of 0.1%

P-Values - Attained Confidence Levels

For many tests, we can extract a single statistic from the test procedure that lets us conclude at a glance what significance levels do and do not allow us to reject the null hypothesis.

Definition

The p-value of a hypothesis test is the probability, conditional on the null hypothesis being assumed true, of obtaining a value of the test statistic at least as contradictory to \(H_0\) as the value calculated from the sample.

In other words, \(p = \PP(T\text{ is at least as extreme as }t | H_0)\)

A small p-value would mean that the observed value of the test statistic (or something even more extreme) is very unlikely, whereas a large p-value means it is less surprising to see the observed value of the test statistic, were the null hypothesis to be true.

We can generate a decision rule for concluding results of a hypothesis test based on an observed p-value:

Choose a significance level \(\alpha\)
Reject the null hypothesis if \(p<\alpha\)

This decision rule is equivalent to deciding based on rejection regions.

P-Values - Attained Confidence Levels

Theorem

The p-value is the smallest significance level \(\alpha\) at which the null hypothesis can be rejected.

Because of this, the p-value can be referred to as the observed significance level or the attained significance level.

Theorem

If the test statistic distribution is continuous, and the null hypothesis is simple, then the distribution of the p-value is uniform on \([0,1]\).

Computing p-values

For upper-tailed, lower-tailed or two-tailed rejection regions with a known sample distribution of the test statistic, the p-values are easy to compute. Suppose the test statistic \(T\) follows a distribution \(\mathcal{D}\) when the null hypothesis is true, and that we observe a value \(t\):

Upper-tailed: \(p = \PP(T > t) = 1-CDF_\mathcal{D}(t)\)
Two-tailed: \(p = 2\min\{\PP(T > t), \PP(T < t)\} = 2\min\{CDF_\mathcal{D}(t), 1-CDF_\mathcal{D}(t)\}\); For \(\mathcal{D}\) symmetric around 0, this is equivalent to \(\PP(|T| > |t|)\)
Lower-tailed: \(p = \PP(T < t) = CDF_\mathcal{D}(t)\)

P-values are uniform - use in meta-analysis

Computational Hypothesis Testing - Bootstrap

The Bootstrap can be used for hypothesis testing - Efron and Tibshirani suggests the following generic scheme for a bootstrap hypothesis test:

Bootstrap Hypothesis Test, general form

Test Statistic: \(t(x)\)
Null Hypothesis: Data is distributed as \(\hat{\mathcal{D}}_0\)

Generate \(B\) bootstrap values of \(t(x^*)\), where \(x^*\sim\hat{\mathcal{D}}_0\).

Achieved Significance Level: \(\widehat{ASL}_{boot} = \#\left\{t(x^{*b}) \geq t(x)\mid|b\in[1,B]\right\}/B\)

Phipson and Smyth (2016)¹ point out that for a range of Monte Carlo tests (such as permutation tests, simulation tests, or bootstrap tests), while the ideal p-value would be \(p=\PP(t_{simulated}\geq t_{observed})\), an unbiased estimator \(\hat{p} = \frac{\#\{t_{sim} \geq t_{obs}\}}{n}\) would possibly produce p-values exactly equal to 0.

Instead, Phipson-Smyth suggest, we should use \(b = \#\{t_{sim} \geq t_{obs}\}\) as test statistic (instead of \(t_{obs}\)). With this choice, the marginal distribution of the test statistic (under the null hypothesis) should be discrete uniform on the integers \(0,\dots,m\) and a biased estimator of the p-value with better theoretical underpinnings is: \[ p_u = \PP(B \leq b) = \frac{b+1}{m+1} \]

Computational Hypothesis Testing - Bootstrap

Example: (Exercise 9:81)

Flame time for cloth used for children’s nightwear was measured, and an average flame time of at most 9.75 seconds has been mandated.

Code

import pandas as pd
import numpy as np
import scipy as sp
import scipy.stats
data = pd.DataFrame({"flame": [9.85,9.93,9.94,9.85,9.88,9.95,9.75,9.77,9.67,9.87,
                               9.67,9.75,9.83,9.92,9.74,9.99,9.95,9.93,9.92,9.89]})

The data itself does not match \(\mathcal{N}(9.75, s^2)\):

Code

from matplotlib import pyplot
import statsmodels.api as sms
p = scipy.stats.probplot(data.flame, sparams=(9.75, data.flame.std()), plot=pyplot)
xmin, xmax, ymin, ymax = pyplot.axis()
p = pyplot.plot([xmin,xmax], [xmin,xmax], 'k-')
p = pyplot.axis([xmin, xmax, ymin, ymax])
p = pyplot.title("Distribution fit against N(9.75, S^2)")
pyplot.show()

However, the slope seems to fit. So if we translate the data, computing \(y_i = x_i - \overline{x} + \mu_0\) we might get a better fitting null distribution to work with.

Code

data["y"] = data.flame - data.flame.mean() + 9.75
p = scipy.stats.probplot(data.y, sparams=(9.75, data.y.std()), plot=pyplot)
xmin, xmax, ymin, ymax = pyplot.axis()
p = pyplot.plot([xmin,xmax], [xmin,xmax], 'k-')
p = pyplot.axis([xmin, xmax, ymin, ymax])
p = pyplot.title("Distribution fit against N(9.75, S^2)")
pyplot.show()

Computational Hypothesis Testing - Bootstrap

Example: (Exercise 9:81)

Since the transformed data matches the null hypothesis assumption, we can sample from the transformed data to yield bootstrap samples.

For our test statistic, we use: \(t(x) = \frac{\overline{x}-\mu_0}{s/\sqrt{n}}\), where each time we re-estimate \(\overline{x}\) and \(s\).

Code

def t(s): return (s.mean() - 9.75)/(s.std()/np.sqrt(len(s)))
boots = [data.sample(frac=1.0, replace=True).y for _ in range(10000)]
bootT = np.array([t(b) for b in boots])
obsT = t(data.flame)
pu = (bootT >= obsT).sum()/(len(bootT))
pb = ((bootT >= obsT).sum()+1)/(len(bootT)+1)
print(f"""
Unbiased p-value:
: {pu:0.2%}

Phipson-Smyth p-value:
: {pb:0.2%}""")

Unbiased p-value:: 0.11%
Phipson-Smyth p-value:: 0.12%

With a p-value under our threshold for significance, we can claim that the condition has not been met: these cloth samples have too long flame times for the mandated safety criteria.

Cautions about p-values

P-values are widely misrepresented and misunderstood, to the point where there have been high-profile statements released from professional research associations and journals requiring non-p-value information and reducing emphasis on p-values.

The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.

These misconceptions mistake \(\PP(\text{observation} | H_0)\) for \(\PP(H_0 | \text{observation})\).

In frequentist statistics, hypotheses are either true or not, they do not have probabilities attached.

The p-value is not the probability that the observed effects were produced by random chance alone.

The p-value is a statement about how the observed effects relate to a specific hypothesis about how the data behaves. Nothing says that the null hypothesis is going to be an accurate model for “random chance”.

The 0.05 significance level is merely a convention.

0.05 stuck with the community after Fisher’s handbooks used it, but is just a convention with no fundamental motivation.

Benjamin et al (2017)¹ proposed that the scientific community change the convention from 0.05 to 0.005.

The p-value does not indicate the size or importance of the observed effect.

You can (usually) always drive down the p-value almost arbitrarily small by having more data.

Pair your observation of a p-value with

A measure of effect size. Most effect size measures deal with two-sample tests - we will return to this topic.

Domain expertise judging the impact of the significant observations.

Your Homework

9:4 - Nicholas Basile

Since we are dealing with radioactive levels of water, it is safer to falsely reject than falsely accept something harmful. Therefore we choose option 2 to guarantee minimal radioactive levels, and if we falsely reject safe water, it is better than falsely accepting unsafe water.

MVJ: But the usual probabilities we tend to aim for are \(\PP(\text{false reject}) = 0.05\) and \(\PP(\text{false accept}) = 0.20\) (power 80%).

So with option (1) we would have probability 20% of falsely claiming low radioactivity and 5% of falsely claiming high radioactivity, while with option (2) we would have probability 20% of falsely claiming high radioactivity and 5% of falsely claiming low radioactivity.

The way it is set up, unless we have a surprisingly powerful test, it tends to be easier to claim that the null hypothesis is true by accident than that it is false by accident.

9:11 - Maxim Kleyer

\[n=25 \qquad \mu=10 \qquad \sigma=0.2\]

\(H_0: \mu=10\qquad H_a: \mu\neq 10\)
\[ \PP(recalibration) \\ \begin{align*} \alpha &= \PP(\overline{x}\geq 10.1032) + \PP(\overline{x}\leq 9.8968) \\ &= 1-\PP\left(z \leq \frac{10.1032-10}{0.2/\sqrt{25}}\right) + \PP\left(z \leq \frac{9.8968-10}{0.2/\sqrt{25}}\right) \\ &= 1-\Phi\left(\frac{0.1032\cdot 5}{0.2}\right)+\Phi\left(\frac{-0.1032\cdot 5}{0.2}\right) \\ &= 1-\Phi(2.58) + \Phi(-2.58) \\ &\approx 1 - 0.9951 + 0.0049 \approx 0.0098 \end{align*} \]

9:11 - Maxim Kleyer

\[ \begin{align*} \PP(9.8968\leq \overline{x}\leq 10.1032 | \mu = 10.1) &= \PP\left(\frac{9.8968-10.1}{0.2/\sqrt{25}} \leq z \leq \frac{10.1032-10.1}{0.2/\sqrt{25}}\mid\right) \\ &= \PP(-5.08\leq z \leq 0.08) \\ &= \Phi(0.08)-\Phi(-5.08) \approx \boxed{0.5319}\\ \\ \PP(9.8968\leq \overline{x}\leq 10.1032 | \mu = 9.8) &= \PP\left(\frac{9.8968-9.8}{0.2/\sqrt{25}} \leq z \leq \frac{10.1032-9.8}{0.2/\sqrt{25}}\mid\right) \\ &= \PP(2.42 \leq z \leq 7.58) \\ &= \Phi(7.58) - \Phi(2.42) \approx 1.0 - 0.9922 \approx \boxed{0.0078} \end{align*} \]
\[ z = \frac{\overline{x}-10}{\sigma/\sqrt{n}} \\ \overline{x}=9.8968 \longrightarrow \frac{9.8968-10}{0.2/\sqrt{25}} = -2.58 \\ \overline{x}=10.1032 \longrightarrow \frac{10.1032-10}{0.2/\sqrt{25}} = 2.58 \\ \boxed{c = 2.58} \]

9:11 - Maxim Kleyer

\[ z_{0.05/2} = z_{0.025} \approx 1.96 \\ \begin{align*} \frac{\overline{x}-10}{0.2/\sqrt{10}} &\geq 1.96 & \frac{\overline{x}-10}{0.2/\sqrt{10}} &\leq -1.96 \\ \overline{x}-10 &\geq 1.96\cdot0.2/\sqrt{10} & \overline{x}-10 &\leq -1.96\cdot0.2/\sqrt{10} \\ \overline{x}-10 &\geq 0.1239613 & \overline{x}-10 &\leq -0.1239613 \\ \overline{x} &\geq 10.1239613 & \overline{x} &\leq 9.876039 \end{align*} \]
Mean \(\overline{x} = \frac{\sum x_i}{n} = \frac{100.203}{10} = \boxed{10.0203}\)

We do NOT reject \(H_0\).

MVJ: Reject when \(|z| > 2.58\).

9:14 - Nicholas Basile

\[ \begin{align*} z \geq 2.51 \Rightarrow \PP(x>z) &= 0.0060 \\ z \leq -2.65 \Rightarrow \PP(x<z) &= 0.0040 \\ 0.0060 + 0.0040 &= 0.01 \end{align*} \]
\[ \begin{align*} \beta(\mu) &= CDF_{\mathcal{N}(0,1)}\left(z_\alpha+\frac{\mu_0-\mu}{\sigma/\sqrt{n}}\right) && \color{Maroon}{\text{This is the formula for one-sided power. /MVJ}} \\ \beta(10.1) &= CDF_{\mathcal{N}(0,1)}\left(2.51+\frac{10-10.1}{0.2/\sqrt{25}}\right) \\ &- CDF_{\mathcal{N}(0,1)}\left(-2.65+\frac{10-10.1}{0.2/\sqrt{25}}\right) \\ &= CDF_{\mathcal{N}(0,1)}(0.01)-CDF_{\mathcal{N}(0,1)}(-5.15) \\ &\approx 0.5040 \\ \beta(9.9) &= CDF_{\mathcal{N}(0,1)}\left(2.51+\frac{10-9.9}{0.2/\sqrt{25}}\right) \\ &- CDF_{\mathcal{N}(0,1)}\left(-2.65+\frac{10-9.9}{0.2/\sqrt{25}}\right) \\ &= CDF_{\mathcal{N}(0,1)}(5.01)-CDF_{\mathcal{N}(0,1)}(-0.15) \\ &\approx 0.5596 \\ \end{align*} \]

9:24 - Maxim Kleyer

\[ \begin{align*} H_0: \mu &= 3\,000 & \overline{x} &= \frac{\sum x_i}{n} = \frac{14\,438}{5} = 2\,887.6 \\ H_a: \mu &\neq 3\,000 & df &= n-1 = 4 \end{align*} \\ \sigma = \sqrt{\frac{\sum(x_i-\overline{x})^2}{n-1}} = \sqrt{\frac{28\,241.2}{4}} = \sqrt{7\,060.3} \approx 84.0256 \\ t = \frac{2\,887.6-3000}{84.0256/\sqrt{5}} \approx -2.99 \]

From a t-curve table we may read \(df=4, t=-2.99 \Rightarrow p=0.02\). For a two-tailed test, we double this value:

\(\boxed{p = 0.04}\)

Since \(0.04 \leq 0.05\), \(H_0\) will be rejected.

9:33 - MVJ

Notice that \(\mu_0 - (\mu_0+\Delta) = -\Delta\) and \(\mu_0 - (\mu_0 - \Delta) = \Delta\).

We compute

\[ \begin{align*} \beta(\mu_0-\Delta) &= \Phi\left(z_{\alpha/2}+\frac{\mu_0-(\mu_0-\Delta)}{\sigma/\sqrt{n}}\right) & \beta(\mu_0+\Delta) &= \Phi\left(z_{\alpha/2}+\frac{\mu_0-(\mu_0+\Delta)}{\sigma/\sqrt{n}}\right) \\ &-\Phi\left(-z_{\alpha/2}+\frac{\mu_0-(\mu_0-\Delta)}{\sigma/\sqrt{n}}\right) & &-\Phi\left(-z_{\alpha/2}+\frac{\mu_0-(\mu_0+\Delta)}{\sigma/\sqrt{n}}\right) \\ &= \Phi\left(z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right) -\Phi\left(-z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right) & &= \Phi\left(z_{\alpha/2}+\frac{-\Delta}{\sigma/\sqrt{n}}\right) -\Phi\left(-z_{\alpha/2}+\frac{-\Delta}{\sigma/\sqrt{n}}\right) \\ && &= \Phi\left(-\left(-z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right)\right) -\Phi\left(-\left(z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right)\right) \\ &\text{and because }\Phi(-z) = 1-\Phi(z)& &= 1-\Phi\left(-z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right) -(1-\Phi\left(z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right)) \\ && &= \Phi\left(z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right) -\Phi\left(-z_{\alpha/2}+\frac{\Delta}{\sigma/\sqrt{n}}\right) \\ \end{align*} \]

These two expressions are equal, which concludes the proof.

9:34 - MVJ

Rewrite

\[ \beta_u(\mu') = \Phi\left(z_\alpha+\frac{\mu_0-\mu'}{\sigma/\sqrt{n}}\right) = \Phi\left(z_\alpha+\sqrt{n}\frac{\mu_0-\mu'}{\sigma}\right) \\ \beta_2(\mu') = \Phi\left(z_{\alpha/2}+\sqrt{n}\frac{\mu_0-\mu'}{\sigma}\right) - \Phi\left(-z_{\alpha/2}+\sqrt{n}\frac{\mu_0-\mu'}{\sigma}\right) \\ \beta_\ell(\mu') = 1-\Phi\left(-z_\alpha+\frac{\mu_0-\mu'}{\sigma/\sqrt{n}}\right) = 1-\Phi\left(-z_\alpha+\sqrt{n}\frac{\mu_0-\mu'}{\sigma}\right) \]

Since everything else remains fixed as \(n\to\infty\), the factor \(\sqrt{n}\) will end up dominating all other terms within the expression.

For an upper-tail power computation, the alternatives are to the right of the \(\mu_0\), so \(\mu_0-\mu'<0\). Hence the entire quantity within the brackets grows more and more negative as \(n\) grows. And \(\lim_{z'\to-\infty}\Phi(z') = 0\).

For a lower-tail power computation, the alternatives are to the left of the \(\mu_0\), so \(\mu_0-\mu'>0\). The entire quantity within the brackte grows more and more positive as \(n\) grows. And \(\lim_{z'\to\infty}1-\Phi(z') = 0\).

Fo a two-tailed power computation, the alternatives are either to the left or the right of \(\mu'\). At any fixed value for \(n\), \(\beta_2(\mu')\) measures probability mass in a sliver of the standard normal distribution of width \(2z_{\alpha/2}\), but depending on the sign of \(\mu_0-\mu'\), this sliver is either far away in the left tail or far away in the right tail, going further and further away as \(n\) grows. Either way, the probability mass captured goes to 0.

9:34 - MVJ

An illustration. Set \(\mu_0 = 1\), \(\mu' = 1.001\) (or \(\mu'=0.999\) for the lower-tail case), \(\sigma=2\), \(\alpha=0.05\). For the two-tailed case, set \(\alpha=0.25\) (to emphasize the sliver’s behavior)

Code

library(tidyverse)
library(patchwork)
data = tibble(
  x = seq(-10, 10, by=0.01),
  ynorm = dnorm(x),
  yu.10 = ifelse(x<qnorm(0.95)+sqrt(10)*(1-1.001)/2,dnorm(x),0),
  yu.10e4 = ifelse(x<qnorm(0.95)+sqrt(10e4)*(1-1.001)/2,dnorm(x),0),
  yu.10e5 = ifelse(x<qnorm(0.95)+sqrt(10e5)*(1-1.001)/2,dnorm(x),0),
  yu.10e6 = ifelse(x<qnorm(0.95)+sqrt(10e6)*(1-1.001)/2,dnorm(x),0),
  yl.10 = ifelse(x>qnorm(0.05)+sqrt(10)*(1-0.999)/2,dnorm(x),0),
  yl.10e4 = ifelse(x>qnorm(0.05)+sqrt(10e4)*(1-0.999)/2,dnorm(x),0),
  yl.10e5 = ifelse(x>qnorm(0.05)+sqrt(10e5)*(1-0.999)/2,dnorm(x),0),
  yl.10e6 = ifelse(x>qnorm(0.05)+sqrt(10e6)*(1-0.999)/2,dnorm(x),0),
  y2.10 = ifelse(qnorm(0.25)+sqrt(10)*(1-1.001)/2 < x & 
                   x<qnorm(0.75)+sqrt(10)*(1-1.001)/2,dnorm(x),0),
  y2.10e4 = ifelse(qnorm(0.25)+sqrt(10e4)*(1-1.001)/2 < x & 
                   x<qnorm(0.75)+sqrt(10e4)*(1-1.001)/2,dnorm(x),0),
  y2.10e5 = ifelse(qnorm(0.25)+sqrt(10e5)*(1-1.001)/2 < x & 
                   x<qnorm(0.75)+sqrt(10e5)*(1-1.001)/2,dnorm(x),0),
  y2.10e6 = ifelse(qnorm(0.25)+sqrt(10e6)*(1-1.001)/2 < x & 
                   x<qnorm(0.75)+sqrt(10e6)*(1-1.001)/2,dnorm(x),0),

)

plots = lapply(names(data)[3:length(data)], function(name) {
  ggplot(data, aes(x=x)) +
    geom_line(aes(y=ynorm)) +
    geom_area(aes(y=.data[[name]])) +
    coord_fixed(ratio=2, xlim=c(-5, 5)) +
    scale_x_continuous(breaks=c(-5,0,5)) +
    scale_y_continuous(breaks=c(0,0.5)) +
    labs(x="", y="")
})

(wrap_plots(plots[1:4], ncol=1, byrow=FALSE) + 
    plot_layout(tag_level="new")) | 
(wrap_plots(plots[5:8], ncol=1, byrow=FALSE) + 
   plot_layout(tag_level="new")) | 
(wrap_plots(plots[9:12], ncol=1, byrow=FALSE) + 
   plot_layout(tag_level="new")) | 
  plot_annotation(tag_levels=list(c("Upper: ", "Lower: ", "2-tailed: "), c("n=10", "n=10e4", "n=10e5", "n=10e6")))