7 February, 2018

Hypothesis testing

Null and alternative hypotheses divide a model space (or parameter space) \(\Omega\) into two parts: the null model \(\Omega_0\) and the alternative model \(\Omega_1\).

A statistical test decides between the null and the alternative by calculating a test statistic \(T\) and checking if \(T\) is in the region of acceptance or the region of rejection (critical region)

Historical Roots

Ronald Fisher: created significance testing – with a null hypothesis and a sample determine whether the sample is significantly different from what the null would suggest. No alternative hypothesis.

Jerzy Neyman and Egon Pearson: created hypothesis testing – given two simple hypotheses pick the more likely hypothesis given data.

Bitter conflict between Fisher and Neyman/Pearson. Fisher believed pre-determined rigid reject/accept decisions to be incompatible with the practice of scientific research.

Modern day mainstream testing: hybrid method, using rejection of the null to test against explicitly stated alternative.

Example of Fisher's test

The lady tasting tea experiment.

Muriel Bristol claimed to be able to tell whether milk or tea was first added. Fisher presented 8 cups, 4 for each sequence, random order.

\[ {8\choose 4} = \frac{8!}{4!4!} = 70 \text{ possible combinations} \]

\(\mathbb{P}(1\text{ wrong}) = 17/70 \approx 0.24\>\) \(\mathbb{P}(0\text{ wrong}) = 1/70 \approx 0.01\)

Fisher set out

  • Null hypothesis: Lady Bristol unable to tell
  • Rejection region: All cups correctly identified

Components of a statistical test

  • Model class with a model parameter \(\theta\) in some parameter space \(\Omega\). Picking \(\theta\) should completely determine a probability distribution.
  • Decomposition \(\Omega = \Omega_0\cup\Omega_1\) into a null region and an alternative region.
  • Test statistic \(T\in\mathcal{T}\).
  • Decomposition of the space \(\mathcal{T}\) of valid values of \(T\) into \(\mathcal{T}=AR\cup RR\) an acceptance region and a rejection region.

Error types

. null true alternative true
null rejected Type I error (False positive) True positive
null not rejected True negative Type II error (False negative)

Alternative setup (Gelman-) Compare parameters: \(\theta_1, \theta_2\) Consider the comparison of \(\theta_1\) and \(\theta_2\) given some statistic related to \(\Delta\theta = \theta_1-\theta_2\).

  • Type S error Claim \(\theta_2>\theta_1\) when \(\theta_2 < \theta_1\). (ie wrong Sign on \(\Delta\theta\))
  • Type M error Claim \(\theta_1\approx\theta_2\) when they are significantly different (ie wrong Magnitude on \(\Delta\theta\))

Quality of a test

We associate to a test two quantities:

  • \(\alpha\) the level or significance level of the test. \(\alpha=\mathbb{P}(\text{False reject})\)
  • \(\beta(\theta)\) the *power of the test. A function of the assumed true value \(\theta\), we define \(\beta(\theta)=\mathbb{P}(\text{Reject}|\theta)\).

Note \(\alpha=\sup_{\theta\in\Omega_0}\beta(\theta)\).

Lady tasting tea

Suppose that Lady Bristol tells composition of a cup of tea correctly with constant probability \(p\). Then the tea tasting is Binomial\((8,p)\). The test rejects the null at 8 successes. \(\alpha=\mathbb{P}(\text{reject}|p=0.5)=1/70\approx0.01\)

Randomized testing

A test could be structured with a soft membership in the rejection region. We will describe a statistical test by a test function \(\phi:X\to[0,1]\), where \(\phi(X)\) is the chance of rejecting \(H_0\). A specific test evaluates \(\phi(x)\) on data, and randomly rejects \(H_0\) with probability \(\phi(x)\).

These tests are called randomized tests.

Non-randomized tests fit in this framework: a non-randomized test with rejection region \(RR\) has test function \(\phi=\mathbb 1_{RR}\).

A randomized test can be converted to a non-randomized test by setting the rejection region \(RR=\{x:\phi(x)=1\}\)

Simple vs Simple testing

A hypothesis is simple if \(|\Omega_j|=1\), ie the distribution is completely determined.

Example I have two coins. One has \(\mathbb{P}(H) = 0.5\) the other has \(\mathbb{P}(H) = 0.75\). I have \(\Omega=\{0.5,0.75\}\), and choosing between them is a test between two simple hypotheses.

We like our tests to have a low level and a high significance. For a simple vs. simple test \(\phi\) this means \[ \alpha=\mathbb{E}_0\phi = \int\phi d\mathbb{P}_0 \text{ minimized} \\ \beta(1) = \mathbb{E}_1\phi = \int\phi d\mathbb{P}_1 \text{ maximized} \]

Simple vs Simple testing

Claim Suppose \(k\geq 0\) and \(\phi^*\) maximizes \(\mathbb{E}_1\phi-k\mathbb{E}_0\phi\), and \(\mathbb{E}_0\phi^*=\alpha\). Then \(\phi^*\) maximizes \(\beta(1)\) over all \(\phi\) with level at most \(\alpha\).

Proof Suppose \(\phi\) has level at most \(\alpha\), ie \(\mathbb{E}_0\phi\leq\alpha\). Then \[ \mathbb{E}_1\phi \leq \mathbb{E}_1\phi-k(\mathbb{E}_0\phi-\alpha) \leq \mathbb{E}_1\phi^*-k\mathbb{E}_0\phi^*+k\alpha = \mathbb{E}_1\phi^* \]

Simple vs Simple testing

We can find the maximizer \(\phi^*\) easily because \(\mathbb E_1\phi-k\mathbb E_0\phi =\) \[ \int\phi d\mathbb{P}_1-k\int\phi d\mathbb{P}_0 = \int(p_1(x)-kp_0(x))\phi(x)d\mu(x) =\\ \int_{p_1(x)>kp_0(x)}|p_1(x)-kp_0(x)|\phi(x)d\mu(x) - \\ \int_{p_1(x)<kp_0(x)}|p_1(x)-kp_0(x)|\phi(x)d\mu(x) \] To maximize this, we must have \[ \phi^*(x) = \begin{cases} 1 & p_1(x) > kp_0(x) \\ 0 & p_1(x) < kp_0(x) \end{cases} \]

Simple vs Simple testing

If \(p_0(x)=0\) we can reject \(\Omega_0\) outright for a maximally powerful test. Wherever \(p_0(x)\neq 0\), consider the likelihood ratio \(L(x)=p_1(x)/p_0(x)\). Let \[ \phi(x) = \begin{cases} 1 & p_0(x) = 0 \\ 1 & L(x) > k \\ 0 & L(x) < k \\ p & L(x) = k \text{ for any } p\in[0,1] \end{cases} \]

Any such test is called a likelihood ratio test. The limit case \(k=\infty\) is the test with \(RR=\{x:p_0(x)=0\}\). \(k\) is a parameter used to achieve the desired level.

Neyman-Pearson's Lemma

Theorem Given any level \(\alpha\in[0,1]\), there is a likelihood ratio test \(\phi_\alpha\) with level \(\alpha\). Any likelihood ratio test with level \(\alpha\) maximizes \(\mathbb E_1\phi\) among all tests with level at most \(\alpha\).

Proof Write \(R_{NP}\) for the Neyman-Pearson rejection region \(R_{NP}=\{x:p_1(x)/p_0(x)>k\}\), with \(k\) chosen s.t. \(\mathbb{P}(R_{NP}|\theta_0)=\alpha\).

Any other test has a different rejection region \(R_A\), with \(\mathbb{P}(R|\theta)=\int_Rp(x|\theta)dx\).

\(\alpha = \mathbb{P}(R_{NP}|\theta_0)\geq\mathbb{P}(R_A|\theta_0)\). Split these two probabilities: \[ \mathbb P(R_{NP}|\theta_0) = \mathbb P(R_{NP}\cap R_A|\theta_0) + \mathbb P(R_{NP}\setminus R_A|\theta_0) \\ \mathbb P(R_{A}|\theta_0) = \mathbb P(R_{NP}\cap R_A|\theta_0) + \mathbb P(R_{A}\setminus R_{NP}|\theta_0) \\ \]

Neyman-Pearson's Lemma

Theorem There is a likelihood ratio test \(\phi_\alpha\) with level \(\alpha\). Any likelihood ratio test with level \(\alpha\) maximizes \(\mathbb E_1\phi\).

Proof

\[ \mathbb P(R_{NP}|\theta_0) = \color{red}{\mathbb P(R_{NP}\cap R_A|\theta_0)} + \mathbb P(R_{NP}\setminus R_A|\theta_0) \\ \mathbb P(R_{A}|\theta_0) = \color{red}{\mathbb P(R_{NP}\cap R_A|\theta_0)} + \mathbb P(R_{A}\setminus R_{NP}|\theta_0) \text{ so}\\ \mathbb P(R_{NP}\setminus R_A|\theta_0) \geq \mathbb P(R_{A}\setminus R_{NP}|\theta_0) \]

The same way we can see that \(\mathbb P(R_{NP}|\theta_1) \geq \mathbb P(R_A|\theta_1)\) is equivalent to \(\mathbb P(R_{NP}\setminus R_A|\theta_1) \geq \mathbb P(R_{A}\setminus R_{NP}|\theta_1)\)

Neyman-Pearson's Lemma

Theorem There is a likelihood ratio test \(\phi_\alpha\) with level \(\alpha\). Any likelihood ratio test with level \(\alpha\) maximizes \(\mathbb E_1\phi\).

Proof… \[ \mathbb{P}(R_{NP}\setminus R_A|\theta_1) = \int_{R_{NP}\setminus R_A}p_1(x)dx \geq k\int_{R_{NP}\setminus R_A}p_0(x)dx \\ \text{[because in $R_{NP}$, $p_1(x)/p_0(x) \geq k$]} \\ = k\mathbb{P}(R_{NP}\setminus R_A | \theta_0) \geq k\mathbb{P}(R_{A}\setminus R_{NP}|\theta_0) = k\int_{R_A\setminus R_{NP}}p_0(x)dx \\ \geq\int_{R_{A}\setminus R_{NP}}p_1(x)dx \text{ [because outside $R_{NP}$, $p_1(x)/p_0(x)\leq k$]} \]

Example

I draw one sample uniformly at random either from \((0,2)\) or from \((1,3)\).

What is the most powerful test to distinguish, with \(\alpha=0.05\)?

Example

I draw one sample uniformly at random either from \((0,2)\) or from \((1,3)\).

What is the most powerful test with \(\alpha=0.05\)?

Let \(\theta_0=(0,2)\) and \(\theta_1=(1,3)\). The likelihood is constant for a uniform distribution; picking \(p=0.5\) for the indistinguishable cases, we get \[ \phi(x) = \begin{cases} 0 & \text{whenever $p_1(x)=0$, ie when $x < 1$}\\ p & \text{whenever $p_1(x)=p_0(x)$, ie when $x\in[1,2]$}\\ 1 & \text{whenever $p_0(x) = 0$, ie when $x>2$}\\ \end{cases} \]

Alternatively, our rejection region is whenever \(\theta_0\) is impossible, ie when \(x\in(2,3)\).

Uniqueness of the Neyman-Pearson test

Claim Let \(\alpha\in[0,1]\) and let \(k\) be picked for \(\phi_\alpha\) to achieve level \(\alpha\).

If \(\phi^*\) maximizes \(\mathbb{E}_1\phi\) among all tests of level at most \(\alpha\), then \(\phi^*(x)=\phi_\alpha(x)\) whenever \(p_1(x)\neq kp_0(x)\) almost always.

Proof If \(k=0\), then \(\phi_\alpha=1\) whenever \(p_1>0\), and \(\mathbb{E}_1\phi_\alpha=1\). If \(\phi^*\) attains the maximum, then \[ 0 = \mathbb E_1|\phi_\alpha-\phi^*| = \int_{p_1(x)\neq kp_0(x)}|\phi_\alpha-\phi^*|p_1dx \] so \(\phi_\alpha=\phi^*\) almost everywhere.

Uniqueness of the Neyman-Pearson test

Claim If \(\phi^*\) maximizes \(\mathbb{E}_1\phi\), then \(\phi^*(x)=\phi_\alpha(x)\) whenever \(p_1(x)\neq kp_0(x)\) almost everywhere.

Proof If \(k=\infty\), then \(\phi_\alpha=0\) whenever \(p_0>0\), so \(\alpha=0\). For \(\phi^*\) with level 0, \[ 0 = \mathbb{E}_0|\phi^*-\phi_\alpha| = \int_{p_1(x)\neq kp_0(x)} |\phi_\alpha-\phi^*|p_1dx \]

So \(\phi_\alpha=\phi^*\) almost everywhere.

Uniqueness of the Neyman-Pearson test

Claim If \(\phi^*\) maximizes \(\mathbb{E}_1\phi\), then \(\phi^*(x)=\phi_\alpha(x)\) whenever \(p_1(x)\neq kp_0(x)\) almost everywhere.

Proof Let \(0<k<\infty\). Define regions \(B_1=\{x:p_1(x) > kp_0(x)\}\) and \(B_2=\{x:p_1(x) < kp_0(x)\}\).

Both \(\phi_\alpha\) and \(\phi^*\) maximize power, so \(\mathbb E_1\phi_\alpha=\mathbb E_1\phi^*\). Since \(\phi_\alpha\) maximizes \(\mathbb E_1\phi-k\mathbb E_0\phi\), we get \(k\mathbb{E}_0\phi_\alpha = k\alpha \leq k\mathbb{E}_0\phi^* \leq k\alpha\).

Uniqueness of the Neyman-Pearson test

Claim If \(\phi^*\) maximizes \(\mathbb{E}_1\phi\), then \(\phi^*(x)=\phi_\alpha(x)\) whenever \(p_1(x)\neq kp_0(x)\) almost everywhere.

Proof We know \(\phi_\alpha(x)=1\) on \(B_1\) and \(=0\) on \(B_2\): \[ 0 =\mathbb{E}_1(\phi_\alpha-\phi^*)-k\mathbb{E}_0(\phi_\alpha-\phi^*) = \\ \int_{p_1(x)>kp_0(x)}|p_1-kp_0|(1-\phi^*)d\mu + \\ \int_{p_1(x)<kp_0(x)}|p_1-kp_0|\phi^*d\mu \]

Both integrals have a non-zero integrand, so \(\phi_\alpha=\phi^*\) almost everywhere follows.

Power of the Neyman-Pearson test

Claim If the two simple hypotheses are different and \(0<\alpha<1\), then \(\mathbb E_1\phi_\alpha>\alpha\).

Proof Let \(\phi'(x)=\alpha\). Then \(\mathbb{E}_1\phi'=\alpha\), so since \(\phi_\alpha\) maximizes \(\mathbb E_1\) among test with level at most \(\alpha\), we get \(\mathbb{E}_1\phi_\alpha\geq\mathbb E_1\phi' = \alpha\).

Remains to exclude \(\mathbb{E}_1\phi_\alpha=\alpha\). Suppose equality holds. Then \(\phi'\) also maximizes power, and so by the uniqueness theorem, \(\phi_\alpha=\phi'\) whenever \(p_1(x)\neq kp_0(x)\). However, \(\phi_\alpha\in\{0,1\}\) and \(0<\alpha<1\). So \(p_1(x)=kp_0(x)\) almost everywhere. Therefore \(\int p_1dx = k\int p_0dx\), so \(k=1\) and density functions agree almost everywhere.

Hence, equality implies the hypotheses are equal.

Example

Two coins, \(\mathbb P=1/2\) and \(\mathbb P = 3/4\). Flip one twice, and test against the null hypothesis of the coin being fair. Find the maximal power test at \(\alpha=0.05\).

\[ L(X) = \frac{\mathcal L(X|3/4)}{\mathcal L(X|1/2)} = \frac{{2\choose X}(3/4)^X(1/4)^{2-X}}{{2\choose X}1/2^2} = \frac{3^X}{4} \]

Under the null hypothesis, \[ L(X) = \begin{cases} 1/4 & \mathbb{P} = 1/4 \\ 3/4 & \mathbb{P} = 2/4 \\ 9/4 & \mathbb{P} = 1/4 \\ \end{cases} \]

Example

Two coins, \(\mathbb P=1/2\) and \(\mathbb P = 3/4\). Flip one twice, and test against the null hypothesis of the coin being fair. Find the maximal power test at \(\alpha=0.05\).

\[ L(X) = \begin{cases} 1/4 & \mathbb{P} = 1/4 \\ 3/4 & \mathbb{P} = 2/4 \\ 9/4 & \mathbb{P} = 1/4 \\ \end{cases} \]

If we pick \(k<9/4\) then \(L(2)=9/4>k\) and \(\phi(2)=1\). Then \(\mathbb E_0\phi(X) \geq \phi(2)\mathbb{P}_0(2)=1/4\) so \(k<9/4\) gives a too large $.

If \(k>9/4\) then \(L(X)<k\) so \(\phi=0\) with too small \(\alpha\).

If we pick \(k>9/4\) then \(L(2)=9/4<k\) and \(\phi(2)=0\).

Example

Two coins, \(\mathbb P=1/2\) and \(\mathbb P = 3/4\). Flip one twice, and test against the null hypothesis of the coin being fair. Find the maximal power test at \(\alpha=0.05\).

\[ L(X) = \begin{cases} 1/4 & \mathbb{P} = 1/4 \\ 3/4 & \mathbb{P} = 2/4 \\ 9/4 & \mathbb{P} = 1/4 \\ \end{cases} \]

The way to get the exact level desired is to use a randomized test, and let \(k=9/4\) so that we can pick \(\phi(2)\) freely: \[ \alpha=\mathbb E_0\phi(X) = \phi(2)\mathbb{P}\left(2\middle|\frac12\right) = \phi(2)/4 \] so pick \(\phi(2)=4\alpha=0.2\).

Example

Two coins, \(\mathbb P=1/2\) and \(\mathbb P = 3/4\). Flip one 10 times, and test against the null hypothesis of the coin being fair. Find the maximal power test at \(\alpha=0.05\).

\[ L(X) = \frac{\mathcal L(X|3/4)}{\mathcal L(X|1/2)} = \frac{{10\choose X}(3/4)^X(1/4)^{10-X}}{{10\choose X}1/2^{10}} = \frac{3^X}{4^5} \]

x 0.000 1.000 2.000 3.000 4.000 5.000 6.000 7.000 8.000 9.000 10.000
L.x 0.001 0.003 0.009 0.026 0.079 0.237 0.712 2.136 6.407 19.222 57.665
PDF 0.001 0.010 0.044 0.117 0.205 0.246 0.205 0.117 0.044 0.010 0.001

Example

Two coins, \(\mathbb P=1/2\) and \(\mathbb P = 3/4\). Flip one 10 times, and test against the null hypothesis of the coin being fair. Find the maximal power test at \(\alpha=0.05\). \(L(X)=3^X/4^5\).

By examination, we see that if the threshold value \(k=6\) then \(\phi_\alpha(8)=\dots=\phi_\alpha(10)=1\), and \(\mathbb{E}_0\phi\approx0.055\).

With a threshold of \(k=7\), the rejection region instead contains only \(9,10\) and \(\mathbb{E}_0\phi\approx0.011\).

To get \(\alpha=0.05\) we need to set \(k=3^8/4^5\) and adjust \(\phi(8)\); \[ \alpha = \mathbb{E}_0\phi = \phi(8)\mathbb{P}(X=8) + \mathbb{P}(X>8) = \\ 2^{-10}\left(\phi(8){10\choose 8} + {10\choose 9} + {10\choose 10} \right) \]

Example

Two coins, \(\mathbb P=1/2\) and \(\mathbb P = 3/4\). Flip one 10 times, and test against the null hypothesis of the coin being fair. Find the maximal power test at \(\alpha=0.05\). \(L(X)=3^X/4^5\). \(k=3^8/4^5\) and \[ 2^{10}\alpha - {10\choose 9} - {10\choose 10} = \phi(8){10\choose 8} \\ \phi(8) = \frac{2^{10}\alpha- {10\choose 9} - {10\choose 10}}{10 \choose 8} \approx 0.89 \]

Composite hypotheses

As we recall, a simple hypothesis \(\Omega_j\) is one where \(|\Omega_j|=1\). As contrast, a composite hypothesis has more than one candidate member in its family of distributions.

When \(\Omega\subset\mathbb R\), we distinguish between

  • Lower tail: \(\Omega_1 = \{\theta: \theta<\theta_j\}\)
  • Upper tail: \(\Omega_1 = \{\theta: \theta>\theta_j\}\)
  • Two tailed: \(\Omega_1 = \{\theta: \theta\neq\theta_j\}\)

Composite tests

Hypothesis space \(\Omega=\Omega_0\cup\Omega_1\). The Neyman-Pearson test type generalizes to the likelihood ratio tests

Define \[ \lambda = \lambda(X) = \frac{\mathcal L(\Omega_0|X)}{\mathcal L(\Omega|X)} = \frac{\sup_{\theta\in\Omega_0}\mathcal L(\theta|X)} {\sup_{\theta\in\Omega}\mathcal L(\theta|X)} \\ \phi(X) = \begin{cases} 1 & \lambda(X) < k \\ p & \lambda(X) = k \\ 0 & \lambda(X) > k \end{cases} \]

Since an overall most likely model is at least as likely as a most likely null model, \(\lambda\in[0,1]\). High values indicate that the null hypothesis does well to explain the data.