Two-sample Inference: Bootstrap

Mikael Vejdemo-Johansson

The Two-Sample Bootstrap

To use the bootstrap for two samples we follow very similar procedures to the ones for the one-sample bootstrap. The main difference is that we take simultaneous and separate samples with replacement from both groups. With the samples, compute some statistic to compare them, repeat for a large number of times.

If the resulting statistics seem normally distributed, we can use the normal bootstrap confidence intervals - if not, we should use the basic, or the percentile, or the studentized method - where for the studentized method we would collect \(t\)-values and find quantiles of those.

Two-Sample Bootstrap; Exercise 10.69

We are given GPA scores from 30 students on so-called lifestyle floors in a dorm, and from 30 students on other floors:

Code

dorm.gpa = tibble(
  lifestyle=c(2.00, 2.25, 2.60, 2.90, 3.00, 3.00, 3.00, 3.00, 3.00, 3.20, 
              3.20, 3.25, 3.30, 3.30, 3.32, 3.50, 3.50, 3.60, 3.60, 3.70, 
              3.75, 3.75, 3.79, 3.80, 3.80, 3.90, 4.00, 4.00, 4.00, 4.00),
  normal=c(1.20, 2.00, 2.29, 2.45, 2.50, 2.50, 2.50, 2.50, 2.65, 2.70, 
           2.75, 2.75, 2.79, 2.80, 2.80, 2.80, 2.86, 2.90, 3.00, 3.07, 
           3.10, 3.25, 3.50, 3.54, 3.56, 3.60, 3.70, 3.75, 3.80, 4.00)
)

Code

ggplot(dorm.gpa, aes(sample=lifestyle)) +
  geom_qq() + geom_qq_line() +
  labs(title="Normal Probability Plot, Lifestyle")
ggplot(dorm.gpa, aes(sample=normal)) +
  geom_qq() + geom_qq_line() +
  labs(title="Normal Probability Plot, Other")

Two-Sample Bootstrap; Exercise 10.69

a. Obtain a 95% CI for difference of means using the T-test

Code

t.test(dorm.gpa$lifestyle, dorm.gpa$normal)$conf.int

[1] 0.1582970 0.7350363
attr(,"conf.level")
[1] 0.95

This CI used 56.7228696 degrees of freedom, and a computed difference in means of 0.4466667.

b. Obtain a bootstrap sample of 999 differences of means. Check the bootstrap distribution for normality.

Code

library(foreach)
bs.means = foreach(i=1:999, .combine=rbind) %do% (
  cbind(
    dorm.gpa %>% slice_sample(prop=1.0, replace=TRUE) %>% summarise(mean.lifestyle=mean(lifestyle)),
    dorm.gpa %>% slice_sample(prop=1.0, replace=TRUE) %>% summarise(mean.normal=mean(normal))
  )
)
bs.means$mean.diff = bs.means$mean.lifestyle - bs.means$mean.normal
ggplot(bs.means, aes(sample=mean.diff)) + geom_qq() + geom_qq_line() +
  labs(title="Normal Probability Plot, Bootstrap mean diff.")

Two-Sample Bootstrap; Exercise 10.69

c. Use the standard deviation of the bootstrap distribution with the mean and t critical values from a. to get a 95% CI for the difference of means.

\(t_{\alpha/2, 56.7} \approx 2\), \(\overline{X}-\overline{Y}\approx 0.4466\).

Code

mean.diff = mean(dorm.gpa$lifestyle) - mean(dorm.gpa$normal)
t.alpha = qt(0.975, 56.7)

c(lo=mean.diff - t.alpha*sd(bs.means$mean.diff),
  hi=mean.diff + t.alpha*sd(bs.means$mean.diff))

       lo        hi 
0.1625250 0.7308083

d. Use the bootstrap sample and the percentile method to obtain a 95% CI.

Code

quantile(bs.means$mean.diff, c(0.025,0.975))

   2.5%   97.5% 
0.17660 0.73755

Two-Sample Bootstrap; Exercise 10.69

e. Compare all three intervals

Code

bs.means.ci = rbind(
  t.test(dorm.gpa$lifestyle, dorm.gpa$normal)$conf.int,
  c(mean.diff - t.alpha*sd(bs.means$mean.diff), mean.diff + t.alpha*sd(bs.means$mean.diff)),
  quantile(bs.means$mean.diff, c(0.025,0.975))
) %>% as_tibble() %>% add_column(type=c("Welch.T", "normal", "percentile"), .before=1)
colnames(bs.means.ci) = c("type", "lo", "hi")
ggplot(bs.means.ci) +
  geom_point(aes(x=lo, y=type), size=5) +
  geom_point(aes(x=hi, y=type), size=5) +
  geom_segment(aes(x=lo, xend=hi, y=type, yend=type)) +
  geom_vline(aes(xintercept=mean.diff)) +
  scale_x_continuous(breaks=(1:20)/20)
kable(bs.means.ci)

type	lo	hi
Welch.T	0.158297	0.7350363
normal	0.162525	0.7308083
percentile	0.176600	0.7375500

Chapter 10; Exercise 87

Assuming \(n_X<n_Y\), we are to show that Welch’s degrees of freedom \(df > n_X-1\).

\[ \begin{align*} df &= \frac{(se_X^2+se_Y^2)^2}{\frac{se_X^4}{n_X-1}+\frac{se_Y^4}{n_Y-1}} & se_X^2 &= \frac{s_X^2}{n_X} & se_Y^2 &= \frac{s_Y^2}{n_Y} \\ &= (n_X-1)(n_Y-1)\frac{se_X^4 + se_Y^4 + \overbrace{2se_X^2se_Y^2}^{>0}}{(n_Y-1)se_X^4 + \underbrace{(n_X-1)se_Y^4}_{<(n_Y-1)se_Y^4\text{ because }n_X<n_Y}} \\ &\geq (n_x-1)\color{Teal}{(n_Y-1)} \frac{\color{DarkMagenta}{se_X^4 + se_Y^4}} {\color{Teal}{(n_Y-1)}\color{DarkMagenta}{(se_X^4 + se_Y^4)}} \\ &= n_X-1 \end{align*} \]

If we reduce the degrees of freedom, the T-distribution gets fatter tails, so higher probability for more extreme outcomes, which makes the p-values larger and the CI wider.

It is a “safe” change, because choosing our degrees of freedom too low is conservative (ie, will give us a lower actual significance level or higher confidence level than we actually try to get) and our probability of false rejections is lower than we try to get it.

Chapter 10; Exercise 110

We are picking up McNemar’s test again, from last lecture. We are given pairs of individuals, each treated with either Ergotamine or Placebo for migraine headaches, and the following (fictitous) data:

		Ergotamine
		S	F
Placebo	S	44	34
Placebo	F	46	30

McNemar’s test statistic uses the SF and FS combined counts: \[ Z=\frac{46-34}{\sqrt{46+34}} \approx 1.341641 \\ p = 1-\Phi(1.341641) \approx 0.08985625 \]

An upper-tail McNemar test does not support rejecting the null hypothesis at \(\alpha=0.05\). Ergotamine does not seem significantly more effective than placebo.

Chapter 10; Exercise 111

Let \(X_1,\dots,X_{n_X}\sim Poisson(\lambda_X)\) iid and independently \(Y_1,\dots,Y_{n_Y}\sim Poisson(\lambda_Y)\) iid.

From \(\VV[\overline{P}]=\lambda/n\) for Poisson distributed samples \(P_i\) and from the independence of the samples follows that \(\VV[\overline{X}-\overline{Y}] = \frac{\lambda_X}{n_X} + \frac{\lambda_Y}{n_Y}\).

Combining these we get a test statistic with a standard normal distribution under \(H_0\): \[ Z=\frac{\overline{X}-\overline{Y}}{\sqrt{\frac{\overline{X}}{n_X}+\frac{\overline{Y}}{n_Y}}} \]

Code

qcount = pd.DataFrame({
  "counts": [0,1,2,3,4,5,6,7],
  "region1": [28,40,28,17,8,2,1,1],
  "region2": [14,25,30,18,49,2,1,1]
})
X_mean = (qcount.counts*qcount.region1).sum()/qcount.region1.sum()
Y_mean = (qcount.counts*qcount.region2).sum()/qcount.region2.sum()
Z = (X_mean-Y_mean)/(np.sqrt(X_mean/qcount.region1.sum() + Y_mean/qcount.region2.sum()))
print(f"""
X_mean = {X_mean:.4f}
Y_mean = {Y_mean:.4f}
Z = {Z:.4f}
p = {2*scipy.stats.norm.sf(np.abs(Z))}
""")


X_mean = 1.6160
Y_mean = 2.5571
Z = -5.3287
p = 9.889433845220309e-08

Chapter 10; Exercise 112

Using the normal distribution of the test statistic in the last exercise as a pivot variable, we get in the usual way:

\[ \lambda_X-\lambda_Y\in\overline{X}-\overline{Y} \pm z_{\alpha/2}\sqrt{\frac{\overline{X}}{n_X}+\frac{\overline{Y}}{n_Y}} \]

Using the data from the last example, we get:

Code

print(f"""
CI = ({X_mean-Y_mean-scipy.stats.norm.ppf(0.975)*np.sqrt(X_mean/qcount.region1.sum() + Y_mean/qcount.region2.sum()):.4f}, {X_mean-Y_mean+scipy.stats.norm.ppf(0.975)*np.sqrt(X_mean/qcount.region1.sum() + Y_mean/qcount.region2.sum()):.4f})
""")


CI = (-1.2873, -0.5950)

Chapter 10; Exercise 113

Suppose we have:

one test for \(H_{01}:\theta\in\Omega_1\) vs. \(H_{a1}:\theta\not\in\Omega_1\) with rejection region \(R_1\) at significance level \(\alpha\),
another test for \(H_{02}:\theta\in\Omega_2\) vs. \(H_{a2}:\theta\not\in\Omega_2\) with rejection region \(R_2\) at significance level \(\alpha\)

And that \(\Omega_1\) and \(\Omega_2\) are disjoint sets of possible values for the parameter \(\theta\).

We have a proposed test (the Union-Intersection Test) for \(H_0:\theta\in\Omega_1\cup\Omega_2\) vs. \(H_a:\theta\not\in\Omega_1\cup\Omega_2\) with proposed rejection region \(R_1\cap R_2\).

a. Show that the UIT is a level \(\alpha\) test.

We can split the computation into two cases:

\[ \PP(\text{reject} | H_0) = \PP(\text{reject}| H_{01} \text{ or } H_{02})\\ \begin{align*} \PP(\text{reject} | H_{01}) &= \PP(T\in R_1\cap R_2|H_{01}) & \PP(\text{reject} | H_{02}) &= \PP(T\in R_1\cap R_2|H_{02}) \\ &\leq \PP(T\in R_1 | H_{01}) = \alpha& &\leq \PP(T\in R_2|H_{02}) = \alpha \\ \end{align*} \]

Where the inequalities are because we are expanding the event in each case.

Chapter 10; Exercise 113

b. Let \(\mu_T\) be the mean value of some variable for a generic (test) drug, and \(\mu_R\) the same for a brand-name (reference) drug. Bioequivalence testing would test \(H_0:\mu_T/\mu_R\leq\delta_L\) or \(\mu_T/\mu_R\geq\delta_U\) versus \(H_a:\delta_L<\mu_T/\mu_R <\delta_U\) (bioequivalence), where \(\delta_L\) and \(\delta_U\) are standards set by regulatory agencies - FDA uses 0.80 and 1.25=1/0.80 for certain cases, and mandates \(\alpha=0.05\).

Take logarithms, and the question becomes one of testing for differences rather than ratios.

Let \(D\) be an estimator of \(\log\mu_T-\log\mu_R\) with standard error \(S_D\). The standardized variable \(T = \frac{D-(\log\mu_T-\log\mu_R)}{S_D}\sim T(\nu)\). The standard test procedure in this setup is the TOST - two one-sided tests, and uses the two test statistics \(T_U=(D-\log\delta_U)/S_D\) and \(T_L=(D-\log\delta_L)/S_D\).

Concrete examples

Using \(\nu=20\) degrees of freedom and a one-sided cut-off \(t_{0.05,20}\approx1.724\), we get the following outcomes:

	Case 1	Case 2	Case 3
\(T_L\)	2.0	1.5	2.0
\(T_U\)	-1.5	-2.0	-2.0
Reject \(L\)	Yes	No	Yes
Reject \(U\)	No	Yes	Yes
Reject \(H_0\)	No	No	Yes

Your Homework

10:14 - Justin Ramirez

The z-score for \(\theta\) is¹ \[ z = \frac{\hat\theta-\theta_0}{\sigma_\theta} \]

The Null Hypothesis is: \(H_0: \theta-\theta_0=0\).

The alternative hypothesis is: \(H_1:\theta-\theta_0\neq 0\)

Since we are given the parameter: \(\theta=2\mu_1-\mu_2\) which means \(\hat\theta=2\overline{X}_1-\overline{X}_2\)

To find the standard deviation: Expected value of \(\hat\theta: \theta=2\mu_1-\mu_2\)

The variance is: \[ \VV=\VV(2\overline{X}_1-\overline{X}_2) \Rightarrow 4\VV(\overline{X}_1)+\VV(\overline{X_2}) \Rightarrow 4\left(\frac{\sigma_1^2}{n_1}\right)+\left(\frac{\sigma_2^2}{n_2}\right) \]

So the standard error is: \(\sigma_{\hat\theta}=\sqrt{4\left(\frac{\sigma_1^2}{n_1}\right)+\left(\frac{\sigma_2^2}{n_2}\right)}\)

So to get the z-score: \[ z = \frac{(2\overline{X}_1-\overline{X}_2)-(2\mu_1-\mu_2)}{\sqrt{4\left(\frac{\sigma_1^2}{n_1}\right)+\left(\frac{\sigma_2^2}{n_2}\right)}} \]

Plug in: A test done at 0.01 shows:

\[ z = \frac{2\cdot 2.69 - 6.35}{\sqrt{4\left(\frac{2.3^2}{43}\right)+\left(\frac{4.03^2}{45}\right)}} \approx -1.050259774 \] which shows that at 0.1 it fails to reject the Null hypothesis.

10:15 - Maxim Kleyer

a.: \(\beta(\mu') = \Phi\left(z_\alpha-\frac{(\mu_1-\mu_2)-\Delta_0}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}}\right)\)

\(\sigma_1, \sigma_2\) are known. \(\mu_1-\mu_2>\Delta_0 \Rightarrow \mu_1-\mu_2-\Delta_0>0\)

Let \(n'=n+m\)

\[ \frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n+a} < \frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n} \Rightarrow \frac{1}{\sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n+a}}} > \frac{1}{\sqrt{\frac{\sigma_1^2}{n}+\frac{\sigma_2^2}{n}}} \]

implies \[ \Phi\left(z_\alpha-\frac{(\mu_1-\mu_2)-\Delta_0}{\sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n}}}\right) > \Phi\left(z_\alpha-\frac{(\mu_1-\mu_2)-\Delta_0}{\sqrt{\frac{\sigma_1^2}{m}+\frac{\sigma_2^2}{n+a}}}\right) \]

So the \(\beta\) value DECREASES as \(n\) increases.

b.: The \(\beta\) value decreases as the sample size increases.

10:27 - Nicholas Basile

\(\mu_1=\) true average of lean angle for OF

\(\mu_2=\) true average of lean angle for YF

\[ H_0:\mu_1-\mu_2=10 \qquad H_a:\mu_1-\mu_2>10 \]

\[ \begin{align*} YF: \overline{x} &= 30.7, & s_1^2 &= 7.57 & n&=10\\ OF: \overline{x} &= 16.2, & s_1^2 &= 19.7 & n&=5\\ \end{align*} \] \[ \begin{align*} t &= \frac{(\overline{x}-\overline{y})-(\mu_1-\mu_2)}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}} \approx \frac{(30.7-16.2)-10}{\sqrt{\frac{7.57}{10}+\frac{19.7}{5}}} \approx 2.076 \\ DoF &= \frac{\left(\frac{7.57}{10}+\frac{19.7}{5}\right)^2}{\frac{\left(\frac{7.57}{10}\right)^2}{10-1}+\frac{\left(\frac{19.7}{5}\right)^2}{5-1}} \approx 6 \quad \color{DarkMagenta}{\text{MVJ: Actually closer to 5.6; always round DoF down}} \\ \alpha &= 0.1 \\ \PP(t>2.076 | 6 DoF) &= 0.041\quad\color{DarkMagenta}{\text{MVJ: With 5.6 DoF, this comes out as 0.043}} \end{align*} \]

This is less than \(\alpha\), so we reject \(H_0\), so true average of max. lean angle for OF is more than 10º that of YF.

10:32 - Victoria Paukova

\[ \begin{align*} \overline{x} &= 4\,358 & n_1 &= 20 & s_1 &= 2\,218 \\ \overline{y} &= 5\,805 & n_2 &= 20 & s_2 &= 3\,990 \\ \end{align*} \\ H_0: \mu_1-\mu_2 = -1\,000 \qquad H_a : \mu_1-\mu_2>-1\,000 \\ t = \frac{4\,358-5\,805+1\,000}{\sqrt{245\,976.2+796\,005}} \approx -.4379 \\ \nu = 32 \qquad p=\Phi(-.4379) = 0.3322 \]

MVJ:How did you get the DoF here?

Even if we didn’t work with a specific significance level, this is a high enough P value where it’s safe to say we should not reject \(H_0\).

10:40 a. - Justin Ramirez

\(\overline{z}\) = the sample mean of the paired difference: using the following code in Python:

Code

import pandas as pd
d = {'house': range(1,33+1),
     'indoor': [.07,.08,.09,.12,.12,.12,.13,.14,.15,.15,.17,.17,.18,.18,.18,.18,.19,.20,.22,.22,.23,.23,.25,.26,.28,.28,.29,.34,.39,.40,.45,.54,.62],
     'outdoor': [.29,.68,.47,.54,.97,.35,.49,.84,.86,.28,.32,.32,1.55,.66,.29,.21,1.02,1.59,.90,.52,.12,.54,.88,.49,1.24,.48,.27,.37,1.26,.70,.76,.99,.36]
}
df = pd.DataFrame(data=d)
df["di"] = df["indoor"]-df["outdoor"]
print(f"""
mean: {df["di"].mean()}\tsd: {df["di"].std()}
""")


mean: -0.42393939393939395  sd: 0.3867730228598179

We get \(\overline{z}=-0.42394\).

The standard deviation is: \(\sigma=\sqrt{\frac{\sum(d_i-\overline{d})^2}{n-1}}\Rightarrow\sqrt{\frac{4.787}{32}}\approx 0.38677\)

The degrees of freedom is equal to \(n-1\), so 32.

The level of significance is \(\alpha=0.05\).

Using the t table at 0.05 with 32 degrees of freedom: \(t_{\alpha/2}=\pm2.037\)

To get the CI: \[ -0.42394\pm2.037\cdot\left(\frac{0.38677}{\sqrt{33}}\right) \Rightarrow -0.42394\pm0.13714\Rightarrow(-0.561, -0.287) \]

10:40 b. - Justin Ramirez

The prediction interval for the 34th house would be: \[ \overline{z}\pm t_{0.025,32}\cdot\sigma\sqrt{1+\frac{1}{n}} \\ \Rightarrow-0.42394\pm2.037\cdot0.38677\sqrt{1+\frac{1}{33}} \\ \Rightarrow-0.42394\pm0.799699 \\ \Rightarrow\boxed{(-1.224,0.376)} \]

10:42 - Nicholas Basile

a.:

Code

df = tibble(
  isotopic=c(1509,1418,1561,1556,2169,1760,1098,1198,1479,1281,1414,1954,2174,2058),
  test=c(1498,1254,1336,1565,2000,1318,1410,1129,1342,1124,1468,1604,1722,1518),
) %>% mutate(difference=isotopic-test, infant=row_number())
ggplot(df, aes(sample=difference)) +
  geom_qq() + geom_qq_line()

It is plausible the differences are normally distributed.

b.: \[ \begin{align*} H_0&:\mu=0 & H_a&:\mu\neq 0 \\ \overline{x}&\approx167.21 & s&\approx228.21 \\ t &= \frac{\overline{x}-0}{s/\sqrt{n}}\approx2.74 & dof&:14-1=13 \\ \text{p-value}&:.016 < .05 \end{align*} \]

Rejects \(H_0\).

c.: If incorrectly used, the true p-value will not be obtained and we may falsely not reject \(H_0\).

MVJ: In fact, for the two independent sample T-test, the outcome is \(t=1.4773\), \(dof=22.579\), \(p=0.1534\)