MVJ
12April, 2018
Statistical hypothesis testing aims to provide quantifiable levels of certainty in claims about a data source. The way we create these claims is through the sampling distribution for some useful statistic.
This probability is the p-value.
If this probability is low, we have reason to believe that the null hypothesis is not true. We are able to reject the null hypothesis in favor of the alternative.
Statistical hypothesis testing aims to provide quantifiable levels of certainty in claims about a data source. The way we create these claims is through the sampling distribution for some useful statistic.
This range is the confidence interval.
We wish to measure the mean SAT-M score for high school seniors in California.
Should we use self-reported scores from volunteer students?
We wish to measure the mean SAT-M score for high school seniors in California.
Should we use scores from the SATs that were taken?
We wish to measure the mean SAT-M score for high school seniors in California.
We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).
How much can you say about the population mean SAT-M score?
We wish to measure the mean SAT-M score for high school seniors in California.
We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).
From repeating this experiment many times, we estimate the standard deviation of SAT-M scores in California to be \(\sigma=100\).
How much can you say about the population mean SAT-M score?
We wish to measure the mean SAT-M score for high school seniors in California.
We use SRS to select 500 students, and administer the SAT-M to these students. Within this sample, the mean score is \(\overline x = 495\).
From repeating this experiment many times, we estimate the standard deviation of SAT-M scores in California to be \(\sigma=100\).
The central limit theorem tells us that the means distribute as \[ \overline x \sim \mathcal N(\mu, \sigma/\sqrt{500}) \qquad\text{here,}\qquad \sigma/\sqrt{500} \approx 4.5 \]
The 68-95-99.7 rule says that there is a 95% chance of \(\overline x\) being within \(\pm 2\sigma \approx \pm 9\) of \(\mu\). This means that \(\mu\) has a 95% chance of being within \(\pm 9\) from \(\overline x\): we can be 95% confident that \[ \overline x-9 = 486 < \mu < 504 = \overline x + 9 \]
We wish to measure the mean SAT-M score for high school seniors in California.
We estimated \(\overline x = 495\), \(\sigma = 100\).
The central limit theorem tells us that the means distribute as \[ \overline x \sim \mathcal N(\mu, \sigma/\sqrt{500}) \qquad\text{here,}\qquad \sigma/\sqrt{500} \approx 4.5 \]
The 68-95-99.7 rule says that there is a 95% chance of \(\overline x\) being within \(\pm 2\sigma \approx \pm 9\) of \(\mu\). This means that \(\mu\) has a 95% chance of being within \(\pm 9\) from \(\overline x\): we can be 95% confident that \[ \overline x-9 = 486 < \mu < 504 = \overline x + 9 \]
In reality, it gets more complicated: instead of a large body of past measurements, we can use the sample standard deviation. This produces the T-test.
The confidence level measures how often we tolerate being wrong.
A confidence interval for a parameter at confidence level \(\alpha\) is an interval computed from sample data by a method that has probability \(\alpha\) of producing an interval containing the true value of the parameter.
In a magic world, where we know the standard deviation of our population, the \(\alpha\) confidence interval for the mean can be calculated using the standard normal quantile function qnorm
:
If $z = $qnorm((1-\alpha)/2)
, then
\[ \overline x - z\sigma/\sqrt{n} < \mu < \overline x + z\sigma/\sqrt{n} \]
The value \(z\sigma/\sqrt{n}\) is the margin of error.
Consider the margin of error
\[ m = z_{(1-\alpha)/2} \sigma / \sqrt{n} \]
What happens if…
From the margin of error formula follows, that if we require a margin of error less than some threshold \(m_0\), then
\[ m_0 > z_{(1-\alpha)/2} \sigma / \sqrt{n} \]
So to achieve that margin of error, we will need
\[ \sqrt{n} > \frac{z_{(1-\alpha)/2}\sigma}{m_0} \qquad\text{so}\qquad n > \left(\frac{z_{(1-\alpha)/2}\sigma}{m_0}\right)^2 \]
For a 95% confidence interval (approximate \(z_{(1-\alpha)/2}\approx 2\)) on our SAT-M data, with standard deviation 100, to get a margin of error of less than \(\pm 10\) we would need…
From the margin of error formula follows, that if we require a margin of error less than some threshold \(m_0\), then
\[ m_0 > z_{(1-\alpha)/2} \sigma / \sqrt{n} \]
So to achieve that margin of error, we will need
\[ \sqrt{n} > \frac{z_{(1-\alpha)/2}\sigma}{m_0} \qquad\text{so}\qquad n > \left(\frac{z_{(1-\alpha)/2}\sigma}{m_0}\right)^2 \]
For a 95% confidence interval (approximate \(z_{(1-\alpha)/2}\approx 2\)) on our SAT-M data, with standard deviation 100, to get a margin of error of less than \(\pm 10\) we would need…
… \(n > (2\cdot100/10)^2 = 20^2 = 400\) samples.
The null hypothesis represents the status quo, the claim that the test aims to disprove.
The alternative hypothesis is the negation of the null hypothesis.
To disprove the null hypothesis, we assume it to be true and calculate the probability of the data if it were true. If the data is unlikely under the null hypothesis, we reject the null hypothesis in favor of the alternative.
A null hypothesis is usually formulated as a specific value of a parameter (or a difference of parameters, or a quotient of parameters). The alternative hypothesis is usually formulated as an inequality.
We might not care about all possible deviations from the null hypothesis.
Alternative hypothesis | Inequality | What rejecting the null proves |
---|---|---|
Two-tailed | not equal to | The true parameter is significantly different from the null hypothesis. |
Upper tail | greater than | The true parameter is significantly larger than the null hypothesis. |
Lower tail | less than | The true parameter is significantly smaller than the null hypothesis. |
Do these questions require upper / lower / two-tailed alternative hypotheses?
Do these questions require upper / lower / two-tailed alternative hypotheses?
Pair up. Write out null and alternative hypotheses for these questions:
The result is reported as
The \(p\)-value is…
A significance threshold for \(S\) directly is called a critical value.
Statistical significance is not the only factor: low \(p\)-values can always be achieved with large enough data sets. Statistical significance has to be balanced against what the deviation means.
One way to measure significance is through effect sizes. Effect sizes do not scale with the sample size, and measure the size of deviation from the null hypothesis, rather than whether it is a likely deviation.
We will use Cohen’s \(d\) quite a lot: it is the \(z\)-score of the sample mean.
Quantifying certainty in statistics is usually done by trying to avoid being wrong. We distinguish different types of error.
Alternative is true | Alternative is false | |
---|---|---|
Significant | True rejection | False rejection |
Not significant | False acceptance | True acceptance |
Type I error / false rejection / false positive: rejects the null when we should not
Type II error / false acceptance / false negative: fails to reject the null when we should
Probability of false rejection: significance level
Probability of false acceptance: power
Significance level is a single fixed value for a single test, before performing the test.
Common significance levels are 1%, 5%, 10%. (primarily because Fisher included 5% tables in is 1925 textbook)
Power is a different value for any effect size: usually we talk about power curves that show how well tests work for a range of possible outcomes.
Common power thresholds to use is at 80% for the effect sizes of interest.
“… surely, God loves the .06 nearly as much as the .05. Can there be any doubt that God views the strength of evidence for or against the null as a fairly continuous function of the magnitude of p?” (Rosnow & Rosenthal)
“for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of evidence and his ideas” (Fisher)
“Need we – should we – stick to p=0.05 if what we seek is a relatively pure list of appearances? No matter where our cutoff comes, we will not be sure of all appearances. Might it not be better to adjust the critical p moderately – say to 0.03 or 0.07 – whenever such a less standard value seems to offer a greater fraction of presumably real appearances among those significant at the critical p? We would then use different modifications for different sets of data. No one, to my knowledge, has set himself the twin problems of how to do this and how well doing this in a specific way performs.” (Tukey)
Some quotes from published papers
Power increases if…
The power.t.test
and power.prop.test
can be used to calculate either one of
from the other three.
With this, we can find a graph that describes how power varies with effect size.