3/23/2020

Neyman-Pearson’s Lemma

Most powerful?

Recall that the power of a test is the complement of the false acceptance error rate: \(\text{power}(\theta) = 1-\beta(\theta)\)

Just like the false acceptance rate, power is a function of the parameter - measures the probability of true rejection at a particular “true value” for the parameter.

So what is the most powerful test possible?

Neyman-Pearson’s lemma provides the answer in a particular constrained setting: the simple vs. simple hypothesis test.

Simple and Composite Hypotheses

Definition

A hypothesis is simple if it completely determines a single probability distribution. A hypothesis that is not simple is said to be composite.

Example

\(X\sim\mathcal{N}(\mu,\sigma^2)\), \(H_0: \mu=\mu_0\) is

  • composite if \(\sigma^2\) is unknown
  • simple if \(\sigma^2\) is known.

Neyman-Pearson’s Lemma

A simple vs. simple hypothesis test has a simple null \(H_0:\theta=\theta_0\) and a simple alternative \(H_A:\theta=\theta_A\) hypothesis.

Theorem (Neyman-Pearson)

For a given \(\alpha=\text{power}(\theta_0)\), the test that maximizes \(\text{power}(\theta_A)\) has a rejection region determined by: \[ \frac{\mathcal{L}(\theta_0)}{\mathcal{L}(\theta_A)} < k \] for some \(k\) chosen to achieve the chosen level \(\alpha\).

Example

Exponential Distribution

\(X\sim\text{Exponential}(\theta)\), with density \(f(x) = \begin{cases} \theta\exp[-\theta x] & x > 0 \\ 0 & \text{otherwise}\end{cases}\)

\(H_0:\theta=1\) and \(H_A:\theta=\theta_1\) for some specific \(\theta_1>1\).

Neyman-Pearson’s lemma tells us to use the rejection region

\[ k > \frac{\mathcal{L}(1)}{\mathcal{L}(\theta_1)} = \frac{\exp[-x]}{\theta_1\exp[-\theta_1x]} \]

Solving for \(x\) we get

\[ \begin{aligned} -x-(-\theta_1x) &< k\theta_1 \\ x &< \log(k\theta_1)/(\theta_1-1) = k' \end{aligned} \]

Rejection Region

To pick the \(k'\) in the rejection region \(x<k'\), we calculate the level: \[ \alpha = \mathbb{P}(x<k' | \theta=1) = \int_0^{k'}\exp[-x]dx = 1-\exp[-k'] \]

Solving for \(k'\) we get \(k' = -\log(1-\alpha)\).

This rejection region does not depend on \(\theta_1\)! It is an example of a more general phenomenon, generating uniformly most powerful tests (UMP)

One-sided Most Powerful Tests

Composite vs Composite

Let the setup for a one-dimensional parameter \(\theta\) be \(H_0:\theta\leq\theta_0\) vs \(H_A:\theta>\theta_0\). For nice enough underlying probability densities, the likelihood ratio construction carries over particularly nicely.

One way this can happen is if the likelihood ratio varies monotonically with some statistic, in which case any threshold for the likelihood ratio is passed exactly once.

Monotone Likelihood Ratios

Definition

A family \(p_\theta(x)\) of probability densities has monotone likelihood ratios if there is some statistic \(T(x)\) such that if \(\theta_1<\theta_2\) then \(p_{\theta_2}(x)/p_{\theta_1}(x)\) is a non-decreasing function of \(T\).

Example

Exponential families have monotone likelihood ratios if \(\eta(\theta)\) is increasing.

If \(\eta(\theta)\) is decreasing, we can replace \(\eta'=-\eta; T'=-T\) to rewrite the exponential family to one with monotone likelihood ratios.

Uniformly Most Powerful Tests

Definition

A uniformly most powerful test (UMP) maximizes \(\text{power}(\theta_A)\) for every \(\theta_A\).

Criterion

If the likelihood ratio rejection region is independent of \(\theta_A\), then the test is UMP.

Theorem

For a family of densities with montone likelihood ratios, a UMP test is generated by the rejection region \(\{T>c\}\) for some constant \(c\).

Example

\(\mathcal{N}(\mu,\sigma^2)\) with known variance is an exponential family with \(\eta(\mu) = \mu/\sigma^2\) and \(T(x) = \sum X_i\).

Since \(\eta(\mu)\) is increasing, this family has monotone likelihood ratios. Therefore an UMP has rejection region \(\{\sum X_i > c\}\) for some \(c\).

Note that this agrees (up to adding and multiplying by constants) with the \(z\)-score test we met earlier. So the normal distribution test for means is uniformly most powerful!

Example (10.94)

Normal distribution

\(Y_1,\dots,Y_n\sim\mathcal{N}(\mu,\sigma^2)\) iid with known \(\mu\).

\(H_0:\sigma^2=\sigma_0^2\) vs. \(H_A:\sigma^2=\sigma_1^2\).

Neyman-Pearson gives us a most powerful test as \[ k > \frac{\mathcal{L}(\sigma_0^2)}{\mathcal{L}(\sigma^2)} = \frac {\left(\frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{\sigma_0^2}}\right)^n \exp\left[-\frac{1}{2\sigma_0^2}\sum(Y_i-\mu)^2\right]} {\left(\frac{1}{\sqrt{2\pi}}\frac{1}{\sqrt{\sigma^2}}\right)^n \exp\left[-\frac{1}{2\sigma^2}\sum(Y_i-\mu)^2\right]} = \\ = \left(\frac{\sigma^2}{\sigma_0^2}\right)^{n/2}\exp\left[\left(\frac{1}{2\sigma^2}-\frac{1}{\sigma_0^2}\right)\sum(Y_i-\mu)^2\right] \]

Rejection Region

We can take the logarithm, and move any summands and factors involving known quantities over to the threshold constant to generate a new rejection region inequality

\[ \sum(Y_i-\mu)^2 < k' \]

Note: this assumes that we have \(\sigma_1^2<\sigma_0^2\). For the opposite order, the inequality flips when dividing by \(\left(\frac{1}{\sigma_1^2}-\frac{1}{\sigma_0^2}\right)\)

To find a concrete value, it helps finding a distribution that describes \(\sum(Y_i-\mu)^2\) (or some function of this statistic) - and squaring normal variables does bring a construction to mind!

Rejection Threshold

If \(\sum(Y_i-\mu)^2 < k'\) then \(\sum\left(\frac{Y_i-\mu}{\sigma}\right)^2 < k'\sigma_0^2\). This latter expression is a sum of squared standard normal variables (under \(H_0\)) - \(\chi^2\)-distributed with \(n\) degrees of freedom.

Hence, we can calculate an explicit expression for \(\alpha\) as: \[ \alpha = \mathbb{P}\left(\sum(Y_i-\mu)^2 < k' | \sigma^2=\sigma_0^2\right) = \\ \mathbb{P}\left(\sum\left(\frac{Y_i-\mu}{\sigma}\right)^2 < k'\sigma_0^2 | \sigma^2=\sigma_0^2\right) = F_{\chi^2(n)}(k'\sigma_0^2) \]

A good cutoff for the rejection region is \(k' = F_{\chi^2(n)}^{-1}(\alpha)/\sigma_0^2\).

This is independent of \(\sigma_1^2\), so produces a UMP test. The rejection region is identical to the 1-sample variance test.

Example (10.102)

Bernoulli Trials

\(Y_1,\dots,Y_n\sim\text{Bernoulli}(p)\)

\(H_0:p=p_0\) vs \(H_A:p>p_0\).

The joint likelihood is \[ \mathcal{L}(Y_1,\dots,Y_n|p) = p^{\sum Y_i}(1-p)^{n-\sum Y_i} = \left(\frac{p}{1-p}\right)^{\sum Y_i}(1-p)^n = \\ \exp\left[\log\frac{p}{1-p}\cdot\sum Y_i + n\log(1-p)\right] \]

Since \(\log\frac{p}{1-p}\) is increasing, this is an exponential family with monotone likelihood ratios - so the rejection region will be \(\{\sum Y_i > c\}\) for some \(c\).

Rejection Cutoff

\(\sum Y_i \sim \text{Binomial}(n,p)\), which we can use to determine \(c\) in the rejection region. We calculate the level \[ \alpha = \mathbb{P}\left(\sum Y_i > c | p=p_0\right) = 1-F_{\text{Binomial(n,p)}}(c) \\ c = F^{-1}_{\text{Binomial(n,p)}}(1-\alpha) \\ \]

Since the exponential family has monotone likelihood ratios, this is a uniformly most powerful test.

Example (10.103)

Uniform Distribution

\(Y_1,\dots,Y_n\sim\text{Uniform}(0,\theta)\)

\(H_0:\theta=\theta_0\) vs \(H_A:\theta=\theta_A\) with \(\theta_A < \theta_0\).

The likelihood ratio works out to \[ \frac{\mathcal{L}(\theta_0)}{\mathcal{L}(\theta_A)} = \begin{cases} 0 & \min Y_i < 0 \\ \left(\frac{\theta_A}{\theta_0}\right)^n & 0 \leq \min Y_i \leq \max Y_i \leq \theta_A \\ \infty & \theta_A < \max Y_i \leq \theta_0 \\ 0 & \theta_0 < \max Y_i \end{cases} \]

For any data that does not violate the extents of one distribution or the other, the likelihood ratio depends essentially on \(\max Y_i\). Lower values should favor \(\theta_A\).

Rejection Threshold

\(\max Y_i\) has PDF \(f(y) = ny^{n-1}\theta^{-n}\). We need to pick a \(c\) such that \[ \alpha = \mathbb{P}(\max Y_i < c | \theta_0) = \int_0^cny^{n-1}\theta_0^{-n}dy = \frac{c^n}{\theta_0^n} \]

We can pick \(c = \theta_0\sqrt[n]{\alpha}\) for a most powerful test. Since this does not depend on \(\theta_A\), we get a uniform most powerful lower tail test.

Generic Likelihood Ratio Test

Hypothesis Spaces

Recall that we defined a hypothesis as some subset of a parameter space. We can talk about subsets \(\Omega_0\) and \(\Omega_A\) such that \(H_0:\theta\in\Omega_0\) and \(H_A:\theta\in\Omega_A\).

This type of formulation is very generic - no requirement on scalar or vector valued parameters, no requirement on the subsets being easy or complex to express, just a pair of sets of possible parameter values.

Write \(\Omega = \Omega_0\cup\Omega_A\) for the total parameter space induced by \(\Omega_0\) and \(\Omega_A\).

Following the philosophy of the Neyman-Pearson Lemma, it would make sense to use a ratio of maximum likelihoods as a generic test statistic.

Test Statistic and Rejection Region

Using the maximum likelihoods, we could set as a test statistic \[ \lambda = \frac{\mathcal{L}(\Omega_0)}{\mathcal{L}(\Omega)} = \frac{\max_{\theta_0\in\Omega_0}\mathcal{L}(\theta_0)}{\max_{\theta\in\Omega}\mathcal{L}(\theta)} \]

Here, we measure how much better we could do if we include \(\Omega_A\) in our calculations as compared to if we exclude it. If \(H_0\) is true, then the ratio takes the value 1.

This would produce the rejection region \(\{\lambda < k\}\).

It is difficult to pick a good \(k\) without knowing the sampling distribution for \(\lambda\). For large sample sizes, we can approximate this.

Distribution of the Likelihood Ratio

Theorem

Let \(r_0\) be the number of free parameters that get fixed to specify \(\Omega_0\) and \(r\) be the number of free parameters get fixed to specify \(\Omega\).

Then for large \(n\): \(-2\ln\lambda\sim\chi^2(r_0-r)\)

This somewhat handwavy formulation hides technicalities about Degrees of Freedom and differentiability of the likelihood ratio with respect to specific parameters.

Example (10.112-114)

Question 10.112

Suppose we draw independent random samples \[ X_1,\dots,X_{n_X} \sim \mathcal{N}(\mu_X,\sigma^2) \qquad Y_1,\dots,Y_{n_Y} \sim \mathcal{N}(\mu_Y,\sigma^2) \] with unknown means and unknown (common!) variance.

For \(H_0: \mu_X = \mu_Y\) vs. \(H_A: \mu_X > \mu_Y\), show that the likelihood ratio test is equivalent to the upper tail two-sample t-test.

Maximum Likelihoods

Write \(n=n_X+n_Y\).

First, assume \(\mu_X\geq\mu_Y\) for \(\Omega\). MLE for \(\mu_X, \mu_Y\) and \(\sigma^2\) works out to:

\[ \hat\mu_X = \overline{X} \quad \hat\mu_Y = \overline{Y} \quad \hat\sigma^2 = \frac{1}{n}\left(\sum(X_i-\overline X)^2 + \sum(Y_i-\overline Y)^2\right) \]

Next, assume \(\mu_X=\mu_Y\) for \(\Omega_0\). MLE works out to:

\[ \hat\mu = \frac{\sum X_i + \sum Y_i}{n} \qquad \hat\sigma_0^2 = \frac{1}{n}\left(\sum(X_i-\hat\mu)^2 + \sum(Y_i-\hat\mu)^2\right) \]

Likelihood Ratio

\[\tiny\begin{aligned} \frac{\mathcal{L}(\Omega_0)}{\mathcal{L}(\Omega)} &= \frac{\mathcal{L}(\hat\mu, \hat\sigma_0^2)}{\mathcal{L}(\hat\mu_X,\hat\mu_Y,\hat\sigma^2)} \\ &= \frac {\exp\left[\color{blue}{-\frac n2\log(2\pi)} -\frac n2\log\hat\sigma_0^2 -\frac{1}{2\hat\sigma_0^2}\left(\sum(X_i-\hat\mu)^2+\sum(Y_i-\hat\mu)^2\right)\right]} {\exp\left[\color{blue}{-\frac n2\log(2\pi)} -\frac n2\log\hat\sigma^2 -\frac{1}{2\hat\sigma^2}\left(\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2\right)\right]} \\ &= \frac {\exp\left[-\frac n2\log\hat\sigma_0^2 -\frac{1}{2\color{purple}{\hat\sigma_0^2}}\color{purple}{\left(\sum(X_i-\hat\mu)^2+\sum(Y_i-\hat\mu)^2\right)}\right]} {\exp\left[-\frac n2\log\hat\sigma^2 -\frac{1}{2\color{green}{\hat\sigma^2}}\color{green}{\left(\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2\right)}\right]} \\ &= \frac{-\frac n2\log\hat\sigma_0^2-\frac{n}{2}}{-\frac n2\log\hat\sigma^2-\frac{n}{2}} = \left(\frac{\sigma^2}{\sigma_0^2}\right)^{n/2}\exp[-n/2] \end{aligned}\]

By absorbing the constant, and taking an appropriate power, it is enough to reject if \(\hat\sigma_0^2/\hat\sigma^2\geq k\).

Likelihood Ratio

We can write \[ \sum(X_i-\hat\mu)^2 = \sum(X_i-\overline X)^2 + n_X(\overline X-\hat\mu)^2 \\ \sum(Y_i-\hat\mu)^2 = \sum(Y_i-\overline Y)^2 + n_Y(\overline Y-\hat\mu)^2 \]

Since \(\hat\mu=\frac{n_X}{n}\overline X + \frac{n_Y}{n}\overline Y\), we get \[ \hat\sigma_0^2 = \sum(X_i-\overline X)^2 + \sum(Y_i-\overline Y)^2 + \frac{n_Xn_Y}{n}(\overline X-\overline Y)^2 \]

Rejection Region

Inserting this form of the estimate, we get rejection when \[ k \leq \frac{1/n}{1/n}\frac{\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2 + \frac{n_Xn_Y}{n}(\overline X-\overline Y)^2}{\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2} \\ = 1 + \frac{\frac{n_Xn_Y}{n}(\overline X-\overline Y)^2}{\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2} \]

The \(1+\) can be absorbed into the constant. The remainder is - up to additional constants that do not depend on the data - the square of the two-sample T-value with pooled variance estimate.

With the assumption \(\mu_X \geq \mu_Y\), only one of the square roots can be realized and we recover the two-sample T-test.

What about \(\mu_X \neq \mu_Y\)? (10.113)

Everything up until now carries over unchanged to testing \(H_A:\mu_X\neq\mu_Y\). The one place where a difference occurs at all is in taking the square root of \[ \frac{(\overline X-\overline Y)^2}{\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2} \]

If \(\mu_X < \mu_Y\) can be possible, the square root and the corresponding rejection region transform into \[ \frac{|\overline X-\overline Y|}{\sqrt{\sum(X_i-\overline X)^2+\sum(Y_i-\overline Y)^2}} \geq k \] for some \(k\). We recognize this as the two-tailed T-test (up to the multiplication of constants that do not depend on the data)

What about three samples? (10.114)

Suppose we draw independent random samples \[ X_{1,1},\dots,X_{1,n_1} \sim \mathcal{N}(\mu_1,\sigma^2) \\ X_{2,1},\dots,X_{2,n_2} \sim \mathcal{N}(\mu_2,\sigma^2) \\ X_{3,1},\dots,X_{3,n_3} \sim \mathcal{N}(\mu_3,\sigma^2) \] with unknown means and unknown (common!) variance.

\(H_0: \mu_1 = \mu_2 = \mu_3\). \(H_A:\) at least one inequality.

Maximum Likelihood Estimation

Write \(n=n_1+n_2+n_3\). Under \(H_A\), the MLEs are: \[ \hat\mu_X=\overline X_{1,*} \qquad \hat\mu_Y=\overline X_{2,*} \qquad \hat\mu_Z=\overline X_{3,*} \\ \hat\sigma^2 = \frac{1}{n}\left(\sum_i\sum_j(X_{i,j}-\overline X_{i,*})^2\right) \]

Under \(H_0\), they are: \[ \hat\mu = \frac{1}{n}\sum_i\sum_jX_{i,j} = \frac{n_1\overline X_{1,*}+n_2\overline X_{2,*} + n_3\overline X_{3,*}}{n} \\ \hat\sigma_0^2 = \frac{1}{n}\left(\sum_i\sum_j(Y_{i,j}-\hat\mu)^2\right) \]

Test Statistic

By the same process of cancellation as before, the likelihood ratio reduces to \[ \lambda = \left(\frac{\hat\sigma^2}{\hat\sigma_0^2}\right)^{n/2} \quad\text{producing a rejection region:} \quad \frac{\hat\sigma_0^2}{\hat\sigma^2} \geq k \]

We will return to this setup in greater detail when studying ANOVA; for now notice that up to constants, \(\hat\sigma_0^2/\hat\sigma^2\) is (like) a ratio of two sample variances - under \(H_0\) distributed as the \(F\) distribution with 2 numerator and \(n-2\) denominator degrees.