\[ \def\RR{{\mathbb{R}}} \def\PP{{\mathbb{P}}} \def\EE{{\mathbb{E}}} \def\VV{{\mathbb{V}}} \]
Let \(X_1 \sim Binomial(n_1, p_1)\) and \(X_2 \sim Binomial(n_2, p_2)\). We are interested in relative behaviors of \(p_1\) and \(p_2\).
There are two “obvious” ways to compare \(p_1\) to \(p_2\) - by their difference \(p_1-p_2\) or by their ratio \(p_1/p_2\)1
The parameter \(p_1-p_2\) does not have an obvious exact distribution, but we can use the central limit theorem here. \(p_1-p_2\) has an unbiased, asymptotically normal estimator in \(\hat{p}_1 - \hat{p}_2\), and we can see that for \(n_1, n_2\) large enough that \(n_1\hat{p}_1 \geq 10\) and \(n_2\hat{p}_2 \geq 10\),
\[ \hat{p}_1-\hat{p}_2 \overset\approx\sim \mathcal{N}(p_1-p_2, \sigma^2_{\hat{p}_1} + \sigma^2_{\hat{p}_2}) \qquad\text{and}\qquad \sigma^2_{\hat{p}_1} + \sigma^2_{\hat{p}_2} = \frac{p_1q_1}{n_1} + \frac{p_2q_2}{n_2} \]
We may proceed to construct a Wald test as usual, with one important tweak - we need to keep track of what we are assuming at any given point.
If our assumption is \(p_1\neq p_2\) (because our null hypothesis is a non-zero difference in proportions, or because we are calculating power or confidence intervals), then we are justified (for large samples) to insert \(\hat{p}_1\) for \(p_1\) and \(\hat{p}_2\) for \(p_2\).
If we are assuming \(p_1=p_2=p\), then we need to pool our data. Instead of separate samples of size \(n_1\) and \(n_2\) from different populations, we have a single sample \(X_1+X_2\sim Binomial(n_1+n_2, p)\) and get a pooled estimator:
\[ \hat{p} = \frac{X_1+X_2}{n_1+n_2} = \frac{n_1}{n_1+n_2}\hat{p}_1 + \frac{n_2}{n_1+n_2}\hat{p}_2 \]
From the Wald construction we get:
Large-Sample Hypothesis Test for Difference of Proportions
Write \(\hat{p}_1 = X_1/n_1\), \(\hat{p}_2 = X_2/n_2\), \(\hat{p} = (X_1+X_2)/(n_1+n_2)\), and \(q_1, q_2, q, \hat{q}_1, \hat{q}_2, \hat{q}\) all defined as usual.
Alternative Hypothesis | Rejection Region for Level \(\alpha\) | Power at \(p_1\) and \(p_2\) | Sample size needed for power \(1-\beta\) at \(p_1\) and \(p_2\) |
---|---|---|---|
Upper Tail | \(z \geq z_\alpha\) | \(1-\Phi\left(\frac{z_\alpha\sqrt{\overline{p}\overline{q}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}-(p_1-p_2)}{\sigma}\right)\) | \(\frac{\left(z_\alpha\sqrt{(p_1+p_2)(q_1+q_2)/2}+z_\beta\sqrt{p_1q_1+p_2q_2}\right)^2}{(p_1-p_2)^2}\) |
Two-tailed | \(|z| \geq z_{\alpha/2}\) | \(\Phi\left(\frac{z_{\alpha/2}\sqrt{\overline{p}\overline{q}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}-(p_1-p_2)}{\sigma}\right) - \Phi\left(\frac{-z_{\alpha/2}\sqrt{\overline{p}\overline{q}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}-(p_1-p_2)}{\sigma}\right)\) | \(\frac{\left(z_{\alpha/2}\sqrt{(p_1+p_2)(q_1+q_2)/2}+z_\beta\sqrt{p_1q_1+p_2q_2}\right)^2}{(p_1-p_2)^2}\) |
Lower Tail | \(z \leq -z_\alpha\) | \(\Phi\left(\frac{-z_\alpha\sqrt{\overline{p}\overline{q}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)}-(p_1-p_2)}{\sigma}\right)\) | \(\frac{\left(z_\alpha\sqrt{(p_1+p_2)(q_1+q_2)/2}+z_\beta\sqrt{p_1q_1+p_2q_2}\right)^2}{(p_1-p_2)^2}\) |
Where we also define \(\overline{p} = (n_1p_1+n_2p_2)/(n_1+n_2)\), \(\overline{q}\) and \(\sigma=\sqrt{\frac{p_1q_1}{n_1}+\frac{p_2q_2}{n_2}}\).
The parameter of interest is the risk ratio or relative risk \(RR = \frac{p_1}{p_2}\). We pick as our estimator of interest \(\widehat{RR} = \hat{p}_1/\hat{p}_2\).
For large \(n_1, n_2\), we get an approximately normal distribution of the log of the risk ratio: \[ \log{\widehat{RR}} = \log\hat{p}_1 - \log\hat{p}_2 \sim\mathcal{N}\left(\log{RR}, \frac{n_1-X_1}{n_1X_1} + \frac{n_2-X_2}{n_2X_2}\right) \]
From this, we may compute a large-sample confidence interval and formulate a large-sample test by the Wald approach:
Large Sample Risk Ratio Test
Alternative Hypothesis | Rejection Region for Level \(\alpha\) | p-value | |
---|---|---|---|
Upper Tail | \(z > z_\alpha\) | \(1-\Phi(z)\) | |
Two-tailed | \(|z| > z_{\alpha/2}\) | \(2\min\{\Phi(z), 1-\Phi(z)\}\) | |
Lower Tail | \(z < -z_\alpha\) | \(\Phi(z)\) |
Example 1.3: Heart attack reported to be reduced by taking aspirin, based on a random controlled designed experiment. Data given was \(x\) and \(n\) - the rest is computed:
n = np.array([11034, 11037])
x = np.array([189, 104])
p_hat = x/n
pqn = p_hat * (1-p_hat) / n
rrvar = ((n-x)/(n*x)).sum()
p_diff = p_hat[1]-p_hat[0]
RR_hat = p_hat[1]/p_hat[0]
se_diff = np.sqrt(pqn.sum())
se_RR = np.sqrt(rrvar)
z025 = scipy.stats.norm.isf(0.025)
print(f"""
Treatment | $n$ | $x$ | $\hat{{p}}$ | Intervals | $\hat{{p}}_A-\hat{{p}}_C$ | $\widehat{{RR}}$ | 95% CI | low | high
-|-|-|-|-|-|-|-|-|-
Control | {n[0]} | {x[0]} | {p_hat[0]:.4f} | Estimators | {p_diff:.4f} | {RR_hat:.4f} | RR | {np.exp(np.log(RR_hat) - z025*se_RR):.4f} | {np.exp(np.log(RR_hat) + z025*se_RR):.4f}
Aspirin | {n[1]} | {x[1]} | {p_hat[1]:.4f} | Std. Err.s | {se_diff:.4f} | {np.sqrt(rrvar):.4f} | Difference | {p_diff - z025*se_diff:.4f} | {p_diff + z025*se_diff:.4f}
""")
Treatment | \(n\) | \(x\) | \(\hat{p}\) | Intervals | \(\hat{p}_A-\hat{p}_C\) | \(\widehat{RR}\) | 95% CI | low | high |
---|---|---|---|---|---|---|---|---|---|
Control | 11034 | 189 | 0.0171 | Estimators | -0.0077 | 0.5501 | RR | 0.4337 | 0.6978 |
Aspirin | 11037 | 104 | 0.0094 | Std. Err.s | 0.0015 | 0.1213 | Difference | -0.0107 | -0.0047 |
Using the difference of proportions confidence interval we can say that there is a reduction between \(0.4%\) and \(1%\) in prevalence of heart attack with Aspirin.
Using the RR confidence interval, we can say that you are about twice as likely (between 43% more likely and 130% more likely) to have a heart attack without aspirin as with aspirin.
Probabilities may often be given as odds (also known as “odds for”):
\[ o = \frac{p}{1-p} \qquad\qquad p = \frac{o}{o+1} \]
Odds of \(1:1\) correspond to a probability of \(50%\); \(2:1\) is approximately \(67%\), etc.
Where probabilities are computed as \(\frac{\text{# favored outcomes}}{\text{# outcomes}}\), odds are computed as \(\frac{\text{# favored outcomes}}{\text{# unfavored outcomes}}\). They give an easy way to compute fair payouts in betting setups.
Commonly used especially in medical and epidemiological research is a statistic called the odds ratio (OR) to quantify the strength of association between two events. It is the ratio of conditional odds \(\frac{o(A | B)}{o(A | \text{not }B)}\).
The OR gets used a lot in lieu of the risk ratio, when one of the events is sufficiently rare (\(\PP(A) < 10\%\)) - because with this assumption, the odds ratio approximates the risk ratio.
Treatment | Control | Margin | |
---|---|---|---|
Event | \(X_{11}\) | \(X_{01}\) | \(X_{*1}\) |
Non-event | \(X_{10}\) | \(X_{00}\) | \(X_{*0}\) |
Margin | \(X_{1*}\) | \(X_{0*}\) | N |
With this notation, to compare the association of a treatment and an event, we might use the risk ratio: \[ \frac{\PP(\text{event}|\text{treatment})}{\PP(\text{event}|\text{control})} = \frac{X_{11}/X_{1*}}{X_{01}/X_{0*}} = \frac{X_{11}/(X_{11}+X_{10})}{X_{01}/(X_{01}+X_{00})} \]
Another option would be the odds ratio: \[ \frac{o(\text{event}|\text{treatment})}{o(\text{event}|\text{control})} = \frac{X_{11}/X_{10}}{X_{01}/X_{00}} \]
Under the rare disease assumption, we can expect \(X_{11}\) and \(X_{01}\) to be much smaller than \(X_{10}\) and \(X_{00}\), so that \[ \frac{X_{11}/(X_{11}+X_{10})}{X_{01}/(X_{01}+X_{00})} \approx \frac{X_{11}/X_{10}}{X_{01}/X_{00}} \]
While less conceptually accessible than the risk ratio, the odds ratio has important and useful theoretical properties:
It turns out that the log odds ratio can be estimated with the sample log odds ratio \[ \begin{align*} \log OR &= \log \frac{p_{11}/p_{10}}{p_{01}/p_{00}} = \log p_{11} + \log p_{00} - \log p_{10} - \log p_{01} \\ L &= \log \frac{\hat p_{11}/\hat p_{10}}{\hat p_{01}/\hat p_{00}} = \log \frac{X_{11}/X_{10}}{X_{01}/X_{00}} = \log \frac{X_{11}X_{00}}{X_{01}X_{10}} \end{align*} \]
The sample log odds ratio is asymptotically approximately normal, unbiased, and with standard error \[ SE = \sqrt{\frac{1}{X_{11}}+\frac{1}{X_{01}}+\frac{1}{X_{10}}+\frac{1}{X_{00}}} \]
Using Wald’s construction we get a hypothesis test and confidence intervals from this information - where for a CI for the odds ratio, we need to exponentiate the bounds in the Wald CI.
Using the radiation and disease example on the previous slide, we had \(X_{11}=20\), \(X_{10}=6\), \(X_{01}=10\), \(X_{00}=16\), which gives us
\[ \begin{align*} L &= \log 20 + \log 16 - \log 6 - \log 10 \approx 1.6740 & SE &= \sqrt{\frac{1}{20}+\frac{1}{6}+\frac{1}{10}+\frac{1}{16}} \approx 0.1387 \\ CI_L &= \left(L-z_{0.975}SE, L+z_{0.975}SE\right) \approx (1.4021, 1.9458) & CI_{OR} &= \exp[CI_L] \approx (4.0640,6.9990) \\ p &= 2\min\{CDF(L/SE), 1-CDF(L/SE)\} \approx 7.49\times 10^{-34} \end{align*} \]
With \(n\) experimental units, two measurements are performed (paired data), producing a contingency table:
After Success |
After Failure |
Marginals | |
---|---|---|---|
Before Success | \(X_{11}\) | \(X_{10}\) | \(X_{1*}\) |
Before Failure | \(X_{01}\) | \(X_{00}\) | \(X_{0*}\) |
Marginals | \(X_{*1}\) | \(X_{*0}\) | N |
Notice that the marginal probabilities tell us about total successes before and after the treatment. So the null hypothesis (no increase) would be \(H_0:p_{*1} - p_{1*} = 0\), whereas the alternative (increased support) would be \(H_a:p_{*1} - p_{1*} > 0\).
We can estimate the difference by using the marginal sums:
\[ \hat{p}_{*1} = \frac{X_{*1}}{N} = \frac{X_{11}+X_{01}}{N} \qquad \hat{p}_{1*} = \frac{X_{1*}}{N} = \frac{X_{11}+X_{10}}{N} \qquad \hat{\Delta}_p = \hat{p}_{*1}-\hat{p}_{1*} = \frac{X_{01} - X_{10}}{N} \]
\(\hat\Delta_p\) estimates the difference of the marginal probabilities.
While each entry behaves like a \(Binomial(N, p_{ij})\) variable, their interactions are easier to understand with the multinomial distribution - the vector \((X_{11}, X_{10}, X_{01}, X_{00})\) follows \(Multinomial(N, (p_{11}, p_{10}, p_{01}, p_{00}))\).
We can write the outcome as a sum of indicator functions \(X_i = \sum_k \boldsymbol{1}_i(k)\), and each indicator function is a Bernoulli-distributed variable.
Then: \[ \begin{align*} Cov(X_i, X_j) &= Cov\left(\sum_k\boldsymbol{1}_i(k), \sum_l\boldsymbol{1}_j(k)\right) \\ &= \sum_k\sum_l Cov(\boldsymbol{1}_i(k), \boldsymbol{1}_j(l)) \\ &= \sum_k\left[Cov(\boldsymbol{1}_i(k), \boldsymbol{1}_j(k)) + \sum_{l, l\neq k}\underbrace{Cov(\boldsymbol{1}_i(k),\boldsymbol{1}_j(l)}_{=0\text{ bc independent}}\right] \\ \text{assuming } i\neq j \quad &= \sum_k\left[ \underbrace{\EE[\boldsymbol{1}_i(k)\cdot\boldsymbol{1}_j(k)]}_{=0} - \EE[\boldsymbol{1}_i(k)]\cdot\EE[\boldsymbol{1}_j(k)] \right] \\ &= -\sum\EE[\boldsymbol{1}_i(k)]\cdot\EE[\boldsymbol{1}_j(k)] = -np_ip_j \end{align*} \]
The variables \(X_i, X_j\) from a multinomial vector are not independent, so any calculation of variance involving both will need to involve covariance.
\[ \begin{align*} \VV[X_i-X_j] &= \VV[X_i] + \VV[X_j] + 2\cdot(-1)\cdot Cov(X_i, X_j) \\ &= np_i(1-p_i) + np_j(1-p_j) - 2\cdot(-np_ip_j) \\ &= \color{Teal}{np_i} \color{DarkMagenta}{- np_i^2} + \color{Teal}{np_j} \color{DarkMagenta}{- np_j^2 + 2np_ip_j} \\ &= \color{Teal}{np_i+np_j} \color{DarkMagenta}{-n(p_i-p_j)^2} \end{align*} \]
Our null hypothesis is equal probabilities, so \(p_{01} = p_{10} = p_x\)
\[ \hat\Delta_p = \frac{X_{01}-X_{10}}{N} \sim \mathcal{N}\left( p_{01}-p_{10}, \frac{p_{01}+p_{10}-(p_{01}-p_{10})^2}{N} \right) = \mathcal{N}(0, (p_{01}+p_{10})/N)\\ Z = \frac{\Delta_p}{\sqrt{\hat{p}/N}} = \frac{X_{01}/N - X_{10}/N}{\sqrt{(X_{01}+X_{10})/N}/\sqrt{N}} = \frac{X_{01}-X_{10}}{\sqrt{X_{01}+X_{10}}} \sim \mathcal{N}(0,1) \]
After Success |
After Failure |
Marginals | |
---|---|---|---|
Before Success | \(350\) | \(150\) | \(\color{DarkMagenta}{500}\) |
Before Failure | \(200\) | \(300\) | \(\color{DarkMagenta}{500}\) |
Marginals | \(\color{DarkMagenta}{550}\) | \(\color{DarkMagenta}{450}\) | 1000 |
So we compute: \(Z = \frac{200-150}{\sqrt{200+150}} \approx 2.6726\), for a two-sided p-value of \(0.0075\)
Edwards (1948)1 suggested a continuity correction version of the test, where the test statistic is adjusted to \[ Z^2 = \frac{(|X_{01}-X_{10}|-1)^2}{X_{01}+X_{10}} \sim \chi^2(1) \]
The chi-squared or standard normal approximations assume large enough data, usually taken to mean \(X_{01}+X_{10}<25\). For small sample sizes, an exact binomial test can be used instead:
Out of the \(X_{01}+X_{10}\) pairs that changed group across the pairing, we could view - under the null hypothesis of no difference between before and after - the \(X_{01}\) value as \(X_{01}\sim Binomial(X_{01}+X_{10}, 1/2)\) and compute a two-tailed p-value by doubling the smaller tail as usual.
Suppose \(S_1\sim\chi^2(d_1)\) and \(S_2\sim\chi^2(d_2)\) are independent. Consider the quantity \(F = \frac{S_1/d_1}{S_2/d_2}\).
Snedecor (1934) tabulated the values of what is now known as the F distribution, naming it in honor of Fisher, while improving the presentation of Fisher’s work on analysis of variance (ANOVA). In Fisher’s original work, he gave critical values for \(z = \frac{1}{2}\log F\) instead.
It turns out that such fractions of scaled chi-squared variables have a PDF, defined on \([0,\infty)\): \[ PDF_{F(d_1,d_2)}(x) = \frac{\Gamma\left(\frac{d_1+d_2}{2}\right)}{\Gamma\left(\frac{d_1}{2}\right)\cdot\Gamma\left(\frac{d_2}{2}\right)}\cdot \left(\frac{d_1}{d_2}\right)^{d_1/2}\cdot\frac{x^{d_1/2-1}}{(1+d_1x/d_2)^{(d_1+d_2)/2}} \]
Computation with this PDF tend to be difficult: use existing software or probability tables instead of integrating it on your own.
It is immediate from the characterization as a ratio of chi-squared variables that if \(F\sim F(d_1,d_2)\), then \(1/F \sim F(d_2,d_1)\).
If \(T=Z/\sqrt{X/\nu}\) is a \(T(\nu)\)-distributed variable, then \(T^2=Z^2/(X/\nu)\) is a ratio of two rescaled chi-squared variables - numerator with 1 degree of freedom and denominator with \(\nu\) degrees of freedom. So \(T^2\sim F(1,\nu)\).
\[ \text{If } F \sim F(d_1,d_2) \text{ then:}\qquad \EE[F] = \frac{d_2}{d_2-2} \text{ if } d_2>2\qquad \VV[F] = \frac{2d_2^2(d_1+d_2-2)}{d_1(d_2-2)^2(d_2-4)}\text{ if } d_2>4 \]
Theorem
Let \(X_1,\dots,X_{n_X}\sim\mathcal{N}(\mu_X,\sigma_X^2)\) iid and independently \(Y_1,\dots,Y_{n_Y}\sim\mathcal{N}(\mu_Y, \sigma_Y^2)\) iid.
Then \[ F = \frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2} \sim F(n_X-1, n_Y-1) \]
We know \(\frac{(n_X-1)S_X^2}{\sigma_X^2}\sim\chi^2(n_X-1)\) and \(\frac{(n_Y-1)S_Y^2}{\sigma_Y^2}\sim\chi^2(n_Y-1)\). Divide each variable by its respective degrees of freedom, and we get a ratio of independent rescaled chi-squared variables, hence an F-distributed variable:
\[ \frac {\frac{\color{Teal}{(n_X-1)}S_X^2}{\sigma_X^2}{\Large/}\color{Teal}{(n_X-1)}} {\frac{\color{DarkMagenta}{(n_Y-1)}S_Y^2}{\sigma_Y^2}{\Large/}\color{DarkMagenta}{(n_Y-1)}} = \frac{S_X^2/\sigma_X^2}{S_Y^2/\sigma_Y^2} \]
Important
That this ratio follows an F-distribution relies crucially on the samples being independent and normal. This distribution, and everything we derive from it are highly sensitive to departures from normality.
From this we get a pivot quantity as well as a test statistic. If we (under the null hypothesis) assume \(\sigma_X^2=\sigma_Y^2\), then they cancel each other out in the ratio.
The F test for equality of variances
Alternative Hypothesis | Rejection Region for Level \(\alpha\) | p-value | |
---|---|---|---|
Upper Tail | \(f\geq F_{\alpha,n_X-1,n_Y-1}\) | \(1-CDF_{F(n_X-1,n_Y-1)}(f)\) | |
Two-tailed | \(f\geq F_{\alpha/2,n_X-1,n_Y-1}\) or \(f\leq F_{1-\alpha/2,n_X-1,n_Y-1}\) | \(2\min\{CDF_{F(n_X-1,n_Y-1)}(f), 1-CDF_{F(n_X-1,n_Y-1)}(f)\}\) | |
Lower Tail | \(f\leq F_{1-\alpha,n_X-1,n_Y-1}\) | \(CDF_{F(n_X-1,n_Y-1)}(f)\) |
qf(1-alpha, d1, d2)
scipy.stats.f(d1,d2).isf(alpha)
pf(f, d1, d2)
(lower), pf(f, d1, d2, lower.tail=FALSE)
(upper)
scipy.stats.f(d1,d2).cdf(f)
(lower), scipy.stats.f(d1,d2).sf(f)
(upper)
Here, we write \(F_{\alpha,d_1,d_2} = CDF_{F(d_1,d_2)}^{-1}(1-\alpha)\).