Lecture 17

MVJ

12April, 2018

Basic structure of statistical testing

Statistical hypothesis testing aims to provide quantifiable levels of certainty in claims about a data source. The way we create these claims is through the sampling distribution for some useful statistic.

  1. Propose a hypothetical true population distribution (the null hypothesis)
  2. Calculate a sample statistic \(S\) called the test statistic from the data
  3. Calculate the probability, in the sampling distribution and assuming the population distribution in 1 of seeing a sample statistic at least as extreme as \(S\)

This probability is the p-value.

If this probability is low, we have reason to believe that the null hypothesis is not true. We are able to reject the null hypothesis in favor of the alternative.

Basic structure of statistical estimation

Statistical hypothesis testing aims to provide quantifiable levels of certainty in claims about a data source. The way we create these claims is through the sampling distribution for some useful statistic.

  1. Propose a hypothetical true population distribution (the null hypothesis)
  2. Calculate a sample statistic \(S\) called the test statistic from the data
  3. Calculate the range of possible parameters that would have made the test statistic \(S\) likely.

This range is the confidence interval.

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

Should we use self-reported scores from volunteer students?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

Should we use scores from the SATs that were taken?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).

How much can you say about the population mean SAT-M score?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).

From repeating this experiment many times, we estimate the standard deviation of SAT-M scores in California to be \(\sigma=100\).

How much can you say about the population mean SAT-M score?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).

From repeating this experiment many times, we estimate the standard deviation of SAT-M scores in California to be \(\sigma=100\).

The central limit theorem tells us that the means distribute as \[ \overline x \sim \mathcal N(\mu, \sigma/\sqrt{500}) \qquad\text{here,}\qquad \sigma/\sqrt{500} \approx 4.5 \]

The 68-95-99.7 rule says that there is a 95% chance of \(\overline x\) being within \(\pm 2\sigma \approx \pm 9\) of \(\mu\). This means that \(\mu\) has a 95% chance of being within \(\pm 9\) from \(\overline x\): we can be 95% confident that \[ \overline x-9 = 486 < \mu < 504 = \overline x + 9 \]

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We estimated \(\overline x = 495\), \(\sigma = 100\).

The central limit theorem tells us that the means distribute as \[ \overline x \sim \mathcal N(\mu, \sigma/\sqrt{500}) \qquad\text{here,}\qquad \sigma/\sqrt{500} \approx 4.5 \]

The 68-95-99.7 rule says that there is a 95% chance of \(\overline x\) being within \(\pm 2\sigma \approx \pm 9\) of \(\mu\). This means that \(\mu\) has a 95% chance of being within \(\pm 9\) from \(\overline x\): we can be 95% confident that \[ \overline x-9 = 486 < \mu < 504 = \overline x + 9 \]

In reality, it gets more complicated: instead of a large body of past measurements, we can use the sample standard deviation. This produces the T-test.

Confidence intervals: confidence levels

The confidence level measures how often we tolerate being wrong.

Confidence intervals: definition

A confidence interval for a parameter at confidence level \(\alpha\) is an interval computed from sample data by a method that has probability \(\alpha\) of producing an interval containing the true value of the parameter.

In a magic world, where we know the standard deviation of our population, the \(\alpha\) confidence interval for the mean can be calculated using the standard normal quantile function qnorm:

If $z = $qnorm((1-\alpha)/2), then

\[ \overline x - z\sigma/\sqrt{n} < \mu < \overline x + z\sigma/\sqrt{n} \]

The value \(z\sigma/\sqrt{n}\) is the margin of error.

Confidence intervals: why \(\alpha/2\)?

Confidence intervals: Margin of error factors

Consider the margin of error

\[ m = z_{(1-\alpha)/2} \sigma / \sqrt{n} \]

What happens if…

Confidence intervals: Choosing sample size

From the margin of error formula follows, that if we require a margin of error less than some threshold \(m_0\), then

\[ m_0 > z_{(1-\alpha)/2} \sigma / \sqrt{n} \]

So to achieve that margin of error, we will need

\[ \sqrt{n} > \frac{z_{(1-\alpha)/2}\sigma}{m_0} \qquad\text{so}\qquad n > \left(\frac{z_{(1-\alpha)/2}\sigma}{m_0}\right)^2 \]

For a 95% confidence interval (approximate \(z_{(1-\alpha)/2}\approx 2\)) on our SAT-M data, with standard deviation 100, to get a margin of error of less than \(\pm 10\) we would need…

Confidence intervals: Choosing sample size

From the margin of error formula follows, that if we require a margin of error less than some threshold \(m_0\), then

\[ m_0 > z_{(1-\alpha)/2} \sigma / \sqrt{n} \]

So to achieve that margin of error, we will need

\[ \sqrt{n} > \frac{z_{(1-\alpha)/2}\sigma}{m_0} \qquad\text{so}\qquad n > \left(\frac{z_{(1-\alpha)/2}\sigma}{m_0}\right)^2 \]

For a 95% confidence interval (approximate \(z_{(1-\alpha)/2}\approx 2\)) on our SAT-M data, with standard deviation 100, to get a margin of error of less than \(\pm 10\) we would need…

\(n > (2\cdot100/10)^2 = 20^2 = 400\) samples.

Hypotheses and testing

The null hypothesis represents the status quo, the claim that the test aims to disprove.

The alternative hypothesis is the negation of the null hypothesis.

To disprove the null hypothesis, we assume it to be true and calculate the probability of the data if it were true. If the data is unlikely under the null hypothesis, we reject the null hypothesis in favor of the alternative.

Null hypotheses

A null hypothesis is usually formulated as a specific value of a parameter (or a difference of parameters, or a quotient of parameters). The alternative hypothesis is usually formulated as an inequality.

We might not care about all possible deviations from the null hypothesis.

Alternative hypothesis Inequality What rejecting the null proves
Two-tailed not equal to The true parameter is significantly different from the null hypothesis.
Upper tail greater than The true parameter is significantly larger than the null hypothesis.
Lower tail less than The true parameter is significantly smaller than the null hypothesis.

Null hypotheses

Do these questions require upper / lower / two-tailed alternative hypotheses?

Alternative hypotheses

Do these questions require upper / lower / two-tailed alternative hypotheses?

One-tailed vs Two-tailed

Null and alternative hypotheses

Null and alternative hypotheses

Pair up. Write out null and alternative hypotheses for these questions:

The structure of a hypothesis test.

  1. Formulate null and alternative hypotheses.
  2. Set a significance level \(\alpha\).
  3. Calculate a test statistic \(S\).
  4. Calculate the \(p\)-value for \(S\) (ie the probability of seeing \(S\) under the null hypothesis)
  5. If \(p<\alpha\), reject the null hypothesis in favor of the alternative.

The result is reported as

p-values, critical values, statistical significance

The \(p\)-value is…

A significance threshold for \(S\) directly is called a critical value.

Statistical significance is not the only factor: low \(p\)-values can always be achieved with large enough data sets. Statistical significance has to be balanced against what the deviation means.

Effect sizes

One way to measure significance is through effect sizes. Effect sizes do not scale with the sample size, and measure the size of deviation from the null hypothesis, rather than whether it is a likely deviation.

We will use Cohen’s \(d\) quite a lot: it is the \(z\)-score of the sample mean.

Error types

Quantifying certainty in statistics is usually done by trying to avoid being wrong. We distinguish different types of error.

Alternative is true Alternative is false
Significant True rejection False rejection
Not significant False acceptance True acceptance

Type I error / false rejection / false positive: rejects the null when we should not

Type II error / false acceptance / false negative: fails to reject the null when we should

Probability of false rejection: significance level

Probability of false acceptance: power

Significance level and power

Significance level is a single fixed value for a single test, before performing the test.

Common significance levels are 1%, 5%, 10%. (primarily because Fisher included 5% tables in is 1925 textbook)

Power is a different value for any effect size: usually we talk about power curves that show how well tests work for a range of possible outcomes.

Common power thresholds to use is at 80% for the effect sizes of interest.

The danger of fixating on significance levels

“… surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?” (Rosnow & Rosenthal)

“for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of evidence and his ideas” (Fisher)

“Need we – should we – stick to p=0.05 if what we seek is a relatively pure list of appearances? No matter where our cutoff comes, we will not be sure of all appearances. Might it not be better to adjust the critical p moderately – say to 0.03 or 0.07 – whenever such a less standard value seems to offer a greater fraction of presumably real appearances among those significant at the critical p? We would then use different modifications for different sets of data. No one, to my knowledge, has set himself the twin problems of how to do this and how well doing this in a specific way performs.” (Tukey)

The danger of ignoring field standards

Some quotes from published papers

The danger of ignoring field standards

Power

Power

Power

Power influcences

Power increases if…

Power curves

The power.t.test and power.prop.test can be used to calculate either one of

from the other three.

With this, we can find a graph that describes how power varies with effect size.

Power curves