Lecture 17

MVJ

12April, 2018

Basic structure of statistical testing

Statistical hypothesis testing aims to provide quantifiable levels of certainty in claims about a data source. The way we create these claims is through the sampling distribution for some useful statistic.

Propose a hypothetical true population distribution (the null hypothesis)
Calculate a sample statistic \(S\) called the test statistic from the data
Calculate the probability, in the sampling distribution and assuming the population distribution in 1 of seeing a sample statistic at least as extreme as \(S\)

This probability is the p-value.

If this probability is low, we have reason to believe that the null hypothesis is not true. We are able to reject the null hypothesis in favor of the alternative.

Basic structure of statistical estimation

Statistical hypothesis testing aims to provide quantifiable levels of certainty in claims about a data source. The way we create these claims is through the sampling distribution for some useful statistic.

Propose a hypothetical true population distribution (the null hypothesis)
Calculate a sample statistic \(S\) called the test statistic from the data
Calculate the range of possible parameters that would have made the test statistic \(S\) likely.

This range is the confidence interval.

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

Should we use self-reported scores from volunteer students?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

Should we use scores from the SATs that were taken?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).

How much can you say about the population mean SAT-M score?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).

From repeating this experiment many times, we estimate the standard deviation of SAT-M scores in California to be \(\sigma=100\).

How much can you say about the population mean SAT-M score?

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).

From repeating this experiment many times, we estimate the standard deviation of SAT-M scores in California to be \(\sigma=100\).

The central limit theorem tells us that the means distribute as \[ \overline x \sim \mathcal N(\mu, \sigma/\sqrt{500}) \qquad\text{here,}\qquad \sigma/\sqrt{500} \approx 4.5 \]

The 68-95-99.7 rule says that there is a 95% chance of \(\overline x\) being within \(\pm 2\sigma \approx \pm 9\) of \(\mu\). This means that \(\mu\) has a 95% chance of being within \(\pm 9\) from \(\overline x\): we can be 95% confident that \[ \overline x-9 = 486 < \mu < 504 = \overline x + 9 \]

Confidence intervals: a worked example

We wish to measure the mean SAT-M score for high school seniors in California.

We estimated \(\overline x = 495\), \(\sigma = 100\).

The central limit theorem tells us that the means distribute as \[ \overline x \sim \mathcal N(\mu, \sigma/\sqrt{500}) \qquad\text{here,}\qquad \sigma/\sqrt{500} \approx 4.5 \]

The 68-95-99.7 rule says that there is a 95% chance of \(\overline x\) being within \(\pm 2\sigma \approx \pm 9\) of \(\mu\). This means that \(\mu\) has a 95% chance of being within \(\pm 9\) from \(\overline x\): we can be 95% confident that \[ \overline x-9 = 486 < \mu < 504 = \overline x + 9 \]

In reality, it gets more complicated: instead of a large body of past measurements, we can use the sample standard deviation. This produces the T-test.

Confidence intervals: confidence levels

The confidence level measures how often we tolerate being wrong.

Confidence intervals: definition

A confidence interval for a parameter at confidence level \(\alpha\) is an interval computed from sample data by a method that has probability \(\alpha\) of producing an interval containing the true value of the parameter.

In a magic world, where we know the standard deviation of our population, the \(\alpha\) confidence interval for the mean can be calculated using the standard normal quantile function qnorm:

If $z = $qnorm((1-\alpha)/2), then

\[ \overline x - z\sigma/\sqrt{n} < \mu < \overline x + z\sigma/\sqrt{n} \]

The value \(z\sigma/\sqrt{n}\) is the margin of error.

Confidence intervals: why \(\alpha/2\)?

Confidence intervals: Margin of error factors

Consider the margin of error

\[ m = z_{(1-\alpha)/2} \sigma / \sqrt{n} \]

What happens if…

…our required confidence level grows? (95% instead of 90%)
…our population variation grows?
…our sample size grows?

Confidence intervals: Choosing sample size

From the margin of error formula follows, that if we require a margin of error less than some threshold \(m_0\), then

\[ m_0 > z_{(1-\alpha)/2} \sigma / \sqrt{n} \]

So to achieve that margin of error, we will need

\[ \sqrt{n} > \frac{z_{(1-\alpha)/2}\sigma}{m_0} \qquad\text{so}\qquad n > \left(\frac{z_{(1-\alpha)/2}\sigma}{m_0}\right)^2 \]

For a 95% confidence interval (approximate \(z_{(1-\alpha)/2}\approx 2\)) on our SAT-M data, with standard deviation 100, to get a margin of error of less than \(\pm 10\) we would need…

Confidence intervals: Choosing sample size

From the margin of error formula follows, that if we require a margin of error less than some threshold \(m_0\), then

\[ m_0 > z_{(1-\alpha)/2} \sigma / \sqrt{n} \]

So to achieve that margin of error, we will need

\[ \sqrt{n} > \frac{z_{(1-\alpha)/2}\sigma}{m_0} \qquad\text{so}\qquad n > \left(\frac{z_{(1-\alpha)/2}\sigma}{m_0}\right)^2 \]

For a 95% confidence interval (approximate \(z_{(1-\alpha)/2}\approx 2\)) on our SAT-M data, with standard deviation 100, to get a margin of error of less than \(\pm 10\) we would need…

… \(n > (2\cdot100/10)^2 = 20^2 = 400\) samples.

Hypotheses and testing

The null hypothesis represents the status quo, the claim that the test aims to disprove.

The alternative hypothesis is the negation of the null hypothesis.

To disprove the null hypothesis, we assume it to be true and calculate the probability of the data if it were true. If the data is unlikely under the null hypothesis, we reject the null hypothesis in favor of the alternative.

Null hypotheses

A null hypothesis is usually formulated as a specific value of a parameter (or a difference of parameters, or a quotient of parameters). The alternative hypothesis is usually formulated as an inequality.

We might not care about all possible deviations from the null hypothesis.

Alternative hypothesis	Inequality	What rejecting the null proves
Two-tailed	not equal to	The true parameter is significantly different from the null hypothesis.
Upper tail	greater than	The true parameter is significantly larger than the null hypothesis.
Lower tail	less than	The true parameter is significantly smaller than the null hypothesis.

Null hypotheses

Do these questions require upper / lower / two-tailed alternative hypotheses?

Has the cell phone usage increased since 2015?
A student advocacy group suspects the dorm rooms are smaller than advertised.
Planning a department budget requires the result to come close to planned: going over overspends, going under reduces next year’s budget.
Has the proportion of large chain restaurants in midtown changed since 2015?

Alternative hypotheses

Do these questions require upper / lower / two-tailed alternative hypotheses?

Has the cell phone usage increased since 2015?
upper
A student advocacy group suspects the dorm rooms are smaller than advertised.
lower
Planning a department budget requires the result to come close to planned: going over overspends, going under reduces next year’s budget.
two-tailed
Has the proportion of large chain restaurants in midtown changed since 2015?
two-tailed

One-tailed vs Two-tailed

Null and alternative hypotheses

Has the cell phone usage increased since 2015?
Null: Cell phone usage has not changed.
Alternative: Cell phone usage has increased.
A student advocacy group suspects the dorm rooms are smaller than advertised.
Null: Dorm rooms are as advertised.
Alternative: Dorm rooms are smaller than advertised.
Planning a department budget requires the result to come close to planned: going over overspends, going under reduces next year’s budget.
Null: Department spending matches the budget.
Alternative: Department spending does not match the budget.
Has the proportion of large chain restaurants in midtown changed since 2015?
Null: Proportion of large chain restaurants has not changed.
Alternative: Proportion of large chain restaurants has changed.

Null and alternative hypotheses

Pair up. Write out null and alternative hypotheses for these questions:

A market research firm interviews shoppers at the Staten Island Mall to see whether the mean houshold income of mall shoppers is higher than that of the general Staten Island population.
CUNY recently changed the way results on the early placement tests assign students to remedial or early math courses. Did the change improve pass rates for these courses?

The structure of a hypothesis test.

Formulate null and alternative hypotheses.
Set a significance level \(\alpha\).
Calculate a test statistic \(S\).
Calculate the \(p\)-value for \(S\) (ie the probability of seeing \(S\) under the null hypothesis)
If \(p<\alpha\), reject the null hypothesis in favor of the alternative.

The result is reported as

The null hypothesis is rejected at a significance level of 5% (p=0.018)
We fail to reject the null hypothesis at a significance level of 1% (p=0.018)

p-values, critical values, statistical significance

The \(p\)-value is…

…the probability of \(S\) being at least as extreme as observed.
…the largest level of significance that allows us to reject the null hypothesis.

A significance threshold for \(S\) directly is called a critical value.

Statistical significance is not the only factor: low \(p\)-values can always be achieved with large enough data sets. Statistical significance has to be balanced against what the deviation means.

Effect sizes

One way to measure significance is through effect sizes. Effect sizes do not scale with the sample size, and measure the size of deviation from the null hypothesis, rather than whether it is a likely deviation.

We will use Cohen’s \(d\) quite a lot: it is the \(z\)-score of the sample mean.

Error types

Quantifying certainty in statistics is usually done by trying to avoid being wrong. We distinguish different types of error.

	Alternative is true	Alternative is false
Significant	True rejection	False rejection
Not significant	False acceptance	True acceptance

Type I error / false rejection / false positive: rejects the null when we should not

Type II error / false acceptance / false negative: fails to reject the null when we should

Probability of false rejection: significance level

Probability of false acceptance: power

Significance level and power

Significance level is a single fixed value for a single test, before performing the test.

Common significance levels are 1%, 5%, 10%. (primarily because Fisher included 5% tables in is 1925 textbook)

Power is a different value for any effect size: usually we talk about power curves that show how well tests work for a range of possible outcomes.

Common power thresholds to use is at 80% for the effect sizes of interest.

The danger of fixating on significance levels

“… surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?” (Rosnow & Rosenthal)

“for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of evidence and his ideas” (Fisher)

“Need we – should we – stick to p=0.05 if what we seek is a relatively pure list of appearances? No matter where our cutoff comes, we will not be sure of all appearances. Might it not be better to adjust the critical p moderately – say to 0.03 or 0.07 – whenever such a less standard value seems to offer a greater fraction of presumably real appearances among those significant at the critical p? We would then use different modifications for different sets of data. No one, to my knowledge, has set himself the twin problems of how to do this and how well doing this in a specific way performs.” (Tukey)

The danger of ignoring field standards

Some quotes from published papers

a certain trend toward significance (p=0.08)
approached the borderline of significance (p=0.07)
at the margin of statistical significance (p<0.07)
close to being statistically signiﬁcant (p=0.055)
fell just short of statistical significance (p=0.12)
just very slightly missed the significance level (p=0.086)
near-marginal significance (p=0.18)
only slightly non-significant (p=0.0738)
provisionally significant (p=0.073)

The danger of ignoring field standards

Power

Power

Power

Power influcences

Power increases if…

…effect size increases (look for more extreme effects)
…significance level increases (look for less certainty)
…sample size increases (look at more samples)
…variance decreases (look at a subpopulation, improve measurements)

Power curves

The power.t.test and power.prop.test can be used to calculate either one of

sample size
power
significance level
population variance

from the other three.

With this, we can find a graph that describes how power varies with effect size.

Power curves