Extra Credit: Power

Background reading is section 6.4.

Repeating:

Good experimental designs limit the impact of variability and reduce sample-size requirements.

How to reduce sample-size requirements?

Suppose a level \(\alpha\) significance test is performed. We compute a \(p\)-value:

\(p\)-value: the probability the test statistics will be as or more extreme than the observed values assuming \(H_o\) is true.

If we “reject” \(H_0\) when the \(p\)-value is less than or equal \(\alpha\), and “accept” \(H_0\) when the \(p\) value is greater than \(\alpha\) we have 4 situations:

\(H_0\) is really true and we accept. This is correct
\(H_0\) is really true and we reject. This is an error and happens about \(\alpha\cdot100\)% of the time – by design. We call this a Type-I error.
\(H_0\) is really false and we reject. This is correct.
\(H_0\) is really false and we accept. This is an error. We call it a Type-II error.

QUESTION:

In the trial scenario, is letting a guilty man go free a type-I or type-II error? What about finding an innocent man guilty?

By specifying a value for \(\alpha\) we specify how often we will make a type-I error. How does one control type II errors?

The book has this picture:

power

The top curve assumes \(\mu=0\) and \(\sigma=2/\sqrt{25}\) and shades the region corresponing to a one-sided \(z\) test with \(\alpha=0.05\). Any observed values more than \(0.658\), the critical value, would lead to “rejection” of the null hypotheses. Any observed values less than \(0.658\), would lead to “accepting” the null hypothesis.

If \(H_0\) is true, then the observed value has the shown distribution, and so we would reject a true statement with probability \(0.05\). That is \(\alpha\) is the probability of a Type-I error.

Now, suppose \(H_0\) is false, moreover, assume \(\mu=1\). The bottom curve is a normal curve with mean 1 and standard deviation \(2/\sqrt{25}\).

If we assume \(H_0\) is true (though it is really false, as \(\mu=1\)) we reject if the observed value is more than \(0.658\), but for distribution shown (with \(\mu=1\)) this happens with proability \(0.80\). So we would falsely accept when the observed value is less than \(0.658\) and this would occur with proability \(0.20\).

We call \(0.80\) the power of this test and this value depends on:

the effect size: the difference between the assumed \(\mu\) under \(H_0\) and the true \(\mu\). This is 1 in the figure.
The standard deviations \(2/\sqrt{n}\) in this case.

To compute the power we figure out the critical value, then figure out under the true scenario the probabilty of rejecting under \(H_0\). For this scenario, the R codes would be:

mu = 0
sigma = 2
n = 25
SD = sigma/sqrt(n)
alpha = 0.05
zstar = qnorm(1 - alpha, mean=mu, sd=SD)
zstar

## [1] 0.6579415

eff_size = 1
power = 1 - pnorm(zstar, mean=mu + eff_size, sd=SD)
power

## [1] 0.8037649

QUESTION:

Again, consider a right-tailed test. The effect size is now \(2\) and the standard deviation \(3\). If \(n=30\), compute the power when \(\alpha=0.05\)

QUESTION:

What would need to change if \(H_a\) was a left tailed test? Is the power more or less than the previous answer? Explain how that could have been anticipated.

R includes a function that can compute the power for us, though slightly differently, as it uses \(t^*\). The above example is done with:

power.t.test(n = 25, delta=1, sd=2, sig.level=0.05, type="one.sample", alternative="one.sided")

## 
##      One-sample t test power calculation 
## 
##               n = 25
##           delta = 1
##              sd = 2
##       sig.level = 0.05
##           power = 0.7833861
##     alternative = one.sided

The difference is due to using the \(t_{24}\) table and not the \(z\) table. Notice how many arguments are specified– a lot.

QUESTION:

Again, consider a right-tailed test with the effect size of \(2\) and the standard deviation \(3\). If \(n=30\), compute the power when \(\alpha=0.05\) using power.t.test. What power do you find? Compare to your calculation by hand.

Usually we use a power analysis to reduce sample-size requirements. That is, we want to specify the power, an effect size, and a standard deviation and then compute the needed sample size to achieve this power. This way, we can reduce the sample-size requirements.

Before computing \(n\), let’s just note that in the original picture above, larger \(n\) values will make both the normal curves “skinnier”. That means:

the value \(a\) dividing the colors will move to the left in the top figure, as \(n\) gets bigger than \(25\).
The amount of blue on the bottom will get bigger both because the standard deviation will shrink and because the critical value moves left. So the power will increase.

From this observation, we know bigger \(n\) mean more power to detect an effect size. But how much bigger?

The mathematics is tractable – it involves solving two equations and two unknowns, but we defer to the power.t.test function to do the work for us.

Using this function is pretty easy, rather than specifying n, we specify power and n is computed.

Let’s see on the first case:

power.t.test(delta=1, sd=2, sig.level=0.05, power=0.80, type="one.sample", alternative="one.sided")

## 
##      One-sample t test power calculation 
## 
##               n = 26.13751
##           delta = 1
##              sd = 2
##       sig.level = 0.05
##           power = 0.8
##     alternative = one.sided

This gives an n of 26.13, so we would take a sample of size 27 to ensure an adequate power.

Even though \(\sigma\) is not known, the above requires an estimate for \(\sigma\), \(2\) is used.

QUESTION: What size \(n\) is needed to ensure an effect size of \(3\) is seen when \(\sigma=0.5\) with \(\alpha=0.10\) and a power of \(1-\beta=0.80\)? Use power.t.test.

QUESTION: A new drug regimen has been developed to (hopefully) reduce weight in overweight teenagers. Weight reduction over the one year course of treatment is measured by change X in body mass index (BMI). Formally we will test \(H_0: \mu = 0\) vs \(H_a: \mu \neq 0\). Previous work shows that \(\sigma_x = 2\). A change in BMI of 1.5 is considered important to detect (if the true effect size is 1.5 or higher we need the study to have a high probability of rejecting \(H_0\). How many patients should be enrolled in the study if \(\alpha=0.05\) and the power is \(0.80\)?