Rao’s Score Test; Two-sample Inference: Means

Mikael Vejdemo-Johansson

One more one-sample test…

Rao’s Score Test

Recall from MLE that we defined the score as \(U(\theta)=\frac{d\ell(\theta)}{d\theta}\) and the Fisher information as \(I(\theta)=-\EE\left[\frac{d^2}{d\theta^2}\ell(\theta)\right]\).

Theorem (Rao)

Asymptotically, as \(n\to\infty\), we approach distributions \[ S(\theta_0) = \frac{U(\theta_0)^2}{\mathcal{I}(\theta_0)} \overset\to\sim \chi^2(1) \qquad \text{or equivalently} \qquad \sqrt{S(\theta_0)} \overset\to\sim \mathcal{N}(0,1) \]

This gives rise to

The Rao Test / The Score Test / The Lagrange Multiplier Test

Null Hypothesis: \(H_0: \theta = \theta_0\)
Test Statistic: \(S = S(\theta_0) = \frac{U(\theta_0)^2}{\mathcal{I}_n(\theta_0)} = \frac{\ell'(\theta_0|x)^2}{-n\EE_\xi\left[\ell''(\theta_0|\xi)\right]}\); or \(Z = \sqrt{S}\)
Rejection Region for Level \(\alpha\): \(S > X_\alpha = CDF^{-1}_{\chi^2(1)}(1-\alpha)\); \(Z > Z_\alpha = CDF^{-1}_{\mathcal{N}(0,1)}(1-\alpha)\)

Rao’s Score Test for Small Deviations

Consider Neyman-Pearson: a most powerful test for \(H_0:\theta=\theta_0\) vs. \(H_a:\theta>\theta_0\) is one that rejects if \[ \frac{\mathcal{L}(\theta_0+h|x)}{\mathcal{L}(\theta_0 | x)} \geq K \quad\text{or equivalently}\quad \ell(\theta_0+h|x) - \ell(\theta_0|x) \geq \log K \]

For small \(h\), we can make a Taylor series approximation: \[ \ell(\theta_0+h|x) \approx \ell(\theta_0|x) + h\cdot\left(\frac{d}{d\theta}\ell(\theta)\right)_{\theta=\theta_0} \]

So we would get a most powerful test using \(U(\theta_0|x)\geq\frac{\log K}{h}\). If we knew the sampling distribution of \(U(\theta_0|x)\), we would be done here. For large samples, however, we do know that \(U(\theta_0|x)\overset\approx\sim\mathcal{N}(0,\mathcal{I}_n(\theta_0))\).

Hence \(\frac{U(\theta_0|x)}{\sqrt{\mathcal{I}_n(\theta_0)}}\overset\approx\sim\mathcal{N}(0,1)\), and the test follows immediately.

Example: \(Binomial(n,p)\)

From \(\mathcal{L}(p|k) = {n\choose k}p^k(1-p)^{n-k}\) follows \[ \begin{align*} U(p) = \frac{d}{dp}\log\mathcal{L}(p|k) &= \frac{d}{dp}\left(\log{n\choose k} + k\log p + (n-k)\log(1-p)\right) \\ &= \frac{k}{p} + \frac{n-k}{1-p} = \frac{k-np}{p-p^2}\\ \mathcal{I}(p) = -\EE_k\left[\frac{d^2}{dp^2}\log PDF(k|p)\right] &= -\EE_k\left[\frac{d}{dp}\left(U(p)\right)\right] = -\EE_k\left[\frac{k-n}{(1-p)^2} - \frac{k}{p^2}\right] \\ \text{because }\EE[k] = np\qquad\qquad &=-\frac{np-n}{(1-p)^2} + \frac{np}{p^2} = \frac{n}{1-p} + \frac{n}{p} \\ S = \frac{U^2}{\mathcal{I}} &= \left(\frac{k-np_0}{p_0(1-p_0)}\right)^2\cdot\frac{p_0(1-p_0)}{n} = \frac{(k-np_0)^2}{np_0(1-p_0)} \\ \sqrt{S} = \frac{k-np_0}{\sqrt{np_0(1-p_0)}} &= \frac{\frac{k}{n}-p_0}{\sqrt{p_0(1-p_0)/n}} = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}} \end{align*} \]

…which we recognize from earlier proportions tests.

Two-sample Testing

Two-sample Hypothesis Testing

So far most of our null hypotheses have been of the shape

Some parameter of the population from which we take a single sample takes a single specific value.

However, there is frequent need to compare not just one observation against our expectations, but to compare two observations against each other.

For possibly the primary source of why we care about comparing two samples, we turn to causality.

Causality

Causality is the (difficult) study of whether one thing causes another thing. It is notoriously difficult to establish, primarily because we cannot usually observe the exact same process once with and once without a potential cause present.

You have likely all heard at some point:

Correlation is not causation.

Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.

There are (philosophically) four possible causes for correlation:

Direct causation: If I drop a pen, it falls down (ie the Earth has gravity). (in reality \(A\Rightarrow B\))
Reverse causation: The faster a windmill rotates, the stronger the wind. So windmills cause wind. (in reality \(B\Rightarrow A\))
Common-causal variable: There is some other, lurking variable that influences both observed variables. As ice cream sales increase, the rate of drowning accidents also increases. Therefore ice cream causes drowning. (in reality \(C\Rightarrow A\) and \(C\Rightarrow B\) - when it is nice, warm and sunny out, more people swim, and more people want ice cream)
The relationship is entirely coincidental. Russian leaders have alternated between bald and hairy for nearly 200 years now.

Bradford Hill’s Criteria

Sir Austin Bradford Hill proposed 9 criteria to provide epidemiological evidence for a causal relationship between a potential cause and an effect:

Strength. A large effect sizes increases likelihood of causality.
Consistency. An effect that can be independently reproduced increases likelihood of causality.
Specificity. Causation is likely if there is a very specific population at a specific site and no other likely explanation.
Temporality. The effect has to occur after the cause (and after any expected delay between the cause and the effect)
Biological Gradient. Greater exposure should generally lead to greater incidence of the effect.
Plausibility. It helps if there is a plausible mechanism for the cause to influence the effect.
Coherence. If field findings and laboratory findings agree, it increases the likelihood.
Experiment. The closest we come to counter-factual examinations (rewinding history and trying again) is in carefully and completely controlled experiments.
Analogy. If similar causes have been established producing similar effects.
Reversibility. (bonus criterion) If the cause is deleted then the effect should also disappear.

Mathematically Modeled Causation

We will work with counterfactual hypotheses - about potential outcomes if some treatment was taken or not.

Consider a set of experimental units \(I\). For any \(i\in I\), we define: * \(Y_{0i}\) the potential outcome for unit \(i\) without the treatment. * \(Y_{1i}\) the potential outcome for unit \(i\) with the treatment. * \(T_i\) a treatment indicator, so that \(T_i=1\) if unit \(i\) received treatment, and \(T_i=0\) if the unit did not.

We can define the individual treatment effect \(Y_{1i}-Y_{0i}\), but this can not actually be measured. Instead we may focus on the average treatment effect \(ATE = \EE[Y_1-Y_0]\). Or even easier to work with would be the ATE on the treated \(ATT=\EE[Y_1-Y_0|T=1]\).

We still, however, only have quantities we are not able to actually measure - since we cannot simultaneously measure \(Y_{0i}\) and \(Y_{1i}\) for any one unit.

\[ \begin{align*} \EE[Y|T=1] - \EE[Y|T=0] &= \EE[Y_1|T=1] - \EE[Y_0|T=0] \\ &= \underbrace{\EE[Y_1|T=1] \color{SteelBlue}{-\EE[Y_0|T=1]}}_{ATT} + \underbrace{\color{SteelBlue}{\EE[Y_0|T=1]} - \EE[Y_0|T=0]}_{\text{Bias}} \end{align*} \]

ATT measures the effect of the treatment on the treated, and Bias measures the extent to which the treated and control groups differ without the treatment.

This means, if we could eliminate the Bias term, then we would be measuring precisely the impact of the treatment in general. And eliminating Bias means both groups are comparable before the treatment.

And if the two groups are comparable before the treatment, they are still exchangeable after the treatment. So Not only does the association measure the ATT, but it also measures the ATE.

Removing Bias: Randomized Experiments

Randomization of the treatment annihilates bias by making the potential outcomes independent of the treatment. In other words, knowing anything about who got assigned which treatment should not give me any information about how the outcome was before the treatment.

Or in other words, that on average, the groups are indistinguishable.

Independence implies precisely that \(\EE[Y_0|T=0] = \EE[Y_0] = \EE[Y_0|T=1]\).

Gold Standard for Causality

The Randomized Controlled Trial (RCT) is the best way to eliminate Bias from group comparisons, as such a carefully designed RCT is our best access to causal inference.

In an RCT, we would split our experimental units \(I\) into two groups: \(I_1\) and \(I_0\) (or \(I_T\) and \(I_C\)), apply treatment to the units in \(I_1\), and then compare the expected values between the two groups.

In other words, the underlying null hypothesis for an RCT is \(\boxed{H_0: \mu_{X} = \mu_{Y}}\).

Hypothesis Testing: Difference of population means

We know how this goes by now… We know a good estimator for \(\mu_X - \mu_Y\), namely \(\overline{X}-\overline{Y}\).

If \(X = X_1,\dots,X_{n_X} \sim \mathcal{N}(\mu_X, \sigma^2_X)\) and \(Y = Y_1,\dots,Y_{n_Y}\sim \mathcal{N}(\mu_Y, \sigma^2_Y)\) are independent iid samples, then by the rules for computing expected values and variances of sums of random variables, and by the theorem that two normal variables sum to a new normal variable, it follows:

\[ \begin{align*} \EE[\overline{X}-\overline{Y}] &= \EE\left[\frac{\sum X_i}{n_X} - \frac{\sum Y_i}{n_Y}\right] & \VV[\overline{X}-\overline{Y}] &= \VV\left[\frac{\sum X_i}{n_X} - \frac{\sum Y_i}{n_Y}\right] \\ &= \frac{\sum \EE[X_i]}{n_X} - \frac{\sum \EE[Y_i]}{n_Y} & &= \frac{1}{n_X^2}\sum\VV[X_i] + \frac{1}{n_Y^2}\sum\VV[Y_i] \\ &= \frac{n_X\mu_X}{n_X} - \frac{n_Y\mu_Y}{n_Y} = \mu_X-\mu_Y& &= \frac{n_X\sigma_X^2}{n_X^2} + \frac{n_Y\sigma_Y^2}{n_Y^2} = \frac{\sigma_X^2}{n_X} + \frac{\sigma_Y^2}{n_Y} \end{align*} \]

We are ignoring all the \(Cov(X_i,X_j)\) and \(Cov(Y_i,Y_j)\) terms in the variance computation because each sample was assumed to be iid, and we are ignoring all the \(Cov(X_i,Y_j)\) terms in the variance because the two samples were assumed to be independent.

Hypothesis Testing: Difference of population means

So \(\overline{X}-\overline{Y}\sim\mathcal{N}\left(\mu_X-\mu_Y, \frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}\right)\) and we get

Two-sample test for population means with known variances

Null Hypothesis: \(H_0: \mu_X - \mu_Y = \Delta_0\) (usually with \(\Delta_0=0\), but we can test for any fixed difference we are interested in)
Test Statistic: \(Z = \frac{(\overline{X}-\overline{Y})-\Delta_0}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}\)

Alternative Hypothesis	Rejection Region for Level \(\alpha\)	Power at \(\Delta\)	Sample size needed for power \(1-\beta\) at \(\Delta\) with balanced groups
Upper Tail	\(z \geq z_\alpha\)	\(1-\Phi\left(z_\alpha-\frac{\Delta-\Delta_0}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}\right)\)	\(m=n=\frac{(\sigma_1^2+\sigma_2^2)(z_\alpha+z_\beta)^2}{(\Delta-\Delta_0)^2}\)
Two-tailed	\(\|z\| \geq z_{\alpha/2}\)	\(\Phi\left(z_{\alpha/2}-\frac{\Delta-\Delta_0}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}\right)-\Phi\left(-z_{\alpha/2}-\frac{\Delta-\Delta_0}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}\right)\)	\(m=n=\frac{(\sigma_1^2+\sigma_2^2)(z_{\alpha/2}+z_\beta)^2}{(\Delta-\Delta_0)^2}\)
Lower Tail	\(z \leq -z_\alpha\)	\(\Phi\left(-z_\alpha-\frac{\Delta-\Delta_0}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}\right)\)	\(n_X=n_Y=\frac{(\sigma_X^2+\sigma_Y^2)(z_\alpha+z_\beta)^2}{(\Delta-\Delta_0)^2}\)

If \(m>40\) and \(n>40\), we can get away with estimating \(\sigma_X^2\approx S_X^2\) and \(\sigma_Y^2\approx S_Y^2\).

Estimation: Difference of population means

Knowing a pivot quantity allows us to derive a confidence interval - and \(\frac{(\overline{X}-\overline{Y})-(\mu_X-\mu_Y)}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}\) is exactly the pivot we need.

Solve the simultaneous inequalities: \[ \PP\left(-z_{\alpha/2} \leq \frac{(\overline{X}-\overline{Y})-(\mu_X-\mu_Y)}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}} \leq z_{\alpha/2}\right) \\ \PP\left( -(\overline{X}-\overline{Y}) -z_{\alpha/2}\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}} \leq \mu_X-\mu_Y \leq -(\overline{X}-\overline{Y}) +z_{\alpha/2}\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}} \right) \\ \PP\left( (\overline{X}-\overline{Y}) +z_{\alpha/2}\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}} \geq \mu_X-\mu_Y \geq (\overline{X}-\overline{Y}) -z_{\alpha/2}\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}} \right) \]

Again, the Central Limit Theorem allows us to replace \(\sigma_X^2\approx s_X^2\) and \(\sigma_Y^2\approx s_Y^2\).

Two-Sample T-test: small normal samples

As usual, ensure that both your samples look normal with probability plots or QQ-plots before relying on this.

Theorem

With \(X_1,\dots,X_{n_X}\sim\mathcal{N}(\mu_X,\sigma_X^2)\) and \(Y_1,\dots,Y_{n_Y}\sim\mathcal{N}(\mu_Y,\sigma_Y^2)\), the standardized variable \[ T = \frac{(\overline{X}-\overline{Y})-(\mu_X-\mu_Y)}{\sqrt{se_X^2+se_Y^2}} \sim T(\nu) \quad\text{where}\quad se_X = \frac{s_X}{\sqrt{n_X}}, se_Y=\frac{s_Y}{\sqrt{n_Y}} \quad\text{and}\quad \nu \approx \frac{(se_X^2+se_Y^2)^2}{\frac{se_X^4}{n_X-1}+\frac{se_Y^4}{n_Y-1}} \]

To see this, start by dividing numerator and denominator of \(T\) by the standard deviation of the numerator (…just like for the one-sample case…) to get

\[ \frac{\frac{(\overline{X}-\overline{Y})-(\mu_X-\mu_Y)}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}}}}{\frac{\sqrt{se_X^2+se_Y^2}}{\sqrt{\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}} }} \]

The numerator is a standard normal, but this denominator is not usually going to be a ratio of a \(\chi^2\)-variable by its degrees of freedom.

But suppose there was some \(W\sim\chi^2(\nu)\) such that \[ \frac{S_X^2}{n_X}+\frac{S_Y^2}{n_Y} = \left(\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}\right)\frac{W}{\nu} \]

We could try to determine a good value for \(\nu\) by matching expected values and variances.

Two-Sample T-test: small normal samples

We want to find \(W\sim\chi^2(\nu)\) such that \[ \frac{S_X^2}{n_X}+\frac{S_Y^2}{n_Y} = \left(\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X}\right)\frac{W}{\nu} \]

We know many of the component variables and their behaviors here: \[ \begin{align*} \EE[W] &= \nu & \VV[W] &= 2\nu & \frac{(n_X-1)S_X^2}{\sigma_X^2}&\sim\chi^2(n_X-1) & \frac{(n_Y-1)S_Y^2}{\sigma_Y^2}&\sim\chi^2(n_Y-1) \\ \EE\left[\frac{W}{\nu}\right] &= 1 & && \EE\left[\frac{S_X^2}{n_X}+\frac{S_Y^2}{n_Y}\right] &= \frac{\sigma^2_X}{n_X}+\frac{\sigma^2_X}{n_X} \end{align*} \]

So expected values agree between the two sides. (bear in mind that \(\sigma_X^2\) and \(\sigma_Y^2\) are deterministic, not random)

Two-Sample T-test: small normal samples

\[ 2(n_X-1) = \VV\left[\frac{(n_X-1)S_X^2}{\sigma_X^2}\right] = \frac{(n_X-1)^2}{\sigma_X^4}\VV[S_X^2] \\ \VV[S_X^2] = \frac{2\sigma_X^4}{n_X-1} \quad\text{and similar for }\VV[S_Y^2] \\ \VV\left[\frac{S_X^2}{n_X}+\frac{S_Y^2}{n_Y}\right] = \frac{1}{n_X^2}\VV[S_X^2] + \frac{1}{n_Y^2}\VV[S_Y^2] = \frac{2\sigma_X^4}{(n_X-1)n_X^2} + \frac{2\sigma_Y^4}{(n_Y-1)n_Y^2} \]

And for the other side of the equation, \[ \VV\left[\left(\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}\right)\frac{W}{\nu}\right] = \left(\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}\right)^2\frac{1}{\nu^2}\VV[W] = \left(\frac{\sigma_X^2}{n_X}+\frac{\sigma_Y^2}{n_Y}\right)^2\frac{2\nu}{\nu^2} \]

Equating the two variances and solving for \(\nu\) finishes the computation of an appropriate choice of Degrees of Freedom.

The resulting hypothesis test and confidence interval constructions are known as Welch’s Two Sample T-Test after the statistician who derived the approximation for \(\nu\).

Pooled Variance Estimator

We might have reason to believe the two data sets may have the same variance. If we do, both samples contribute to estimating the same parameter - so by rights, we should be able to get a better estimate of that parameter that way.

A first approach could be to take the average of \(S_X^2\) and \(S_Y^2\) as an estimator: surely if \(\hat\theta_1\) is unbiased for \(\theta\) and \(\hat\theta_2\) is unbiased for \(\theta\), the variance of their average is: \[ \VV\left[\frac{\hat\theta_1+\hat\theta_2}{2}\right] = \frac{1}{4}\VV[\hat\theta_1] + \frac{1}{4}\VV[\hat\theta_2] \]

So the variance is half of the average of the variances.

This works for equal sample sizes, fine - but if sample sizes are unequal, then the information contained in either estimator varies. Instead we’d want a weighted average.

Remember that \(\frac{(n-1)S^2}{\sigma^2}\) is \(\chi^2\)-distributed. But this is just our way of rewriting the claim that \(\frac{\sum(X_i-\overline{X})^2}{\sigma^2}\) is \(\chi^2\)-distributed.

Pooled Variance Estimator

Suppose we pick some \(S_p^2 = wS_X^2 + (1-w)S_Y^2\). We would like to pick \(w\) to minimize the variance of this (unbiased) estimator.

\[ \begin{align*} \VV[S_p^2] &= \VV[wS_X^2 + (1-w)S_Y^2] & \frac{d}{dw}\VV[S_p^2] = 2w\VV[S_X^2] - 2(1-w)\VV[S_Y^2] &= 0 \\ &= w^2\VV[S_X^2] + (1-w)^2\VV[S_Y^2] & \text{precisely when }\qquad \frac{w}{1-w} &= \frac{\VV[S_Y^2]}{\VV[S_X^2]} \end{align*} \]

The variance \(\VV[S^2]\) can be shown¹ to take the value \(\frac{\mu_4}{n}-\frac{n-1}{n(n-1)}\sigma^4\) (where \(\mu_4\) is the 4th central moment)

For most distributions, this makes the weighting very complicated. However, for the normal distribution, \(\mu_4 = 3\sigma^4\), and thus

\[ \frac{\mu_4}{n}-\frac{n-1}{n(n-1)}\sigma^4 = \frac{3\sigma^4}{n}-\frac{n-3}{n(n-1)}\sigma^4 = \frac{2}{n-1}\sigma^4\\ \text{and therefore, since we assume equal variances}\\ \frac{w}{1-w} = \frac{\VV[S_Y^2]}{\VV[S_X^2]} = \frac{2\sigma^4/(n_X-1)}{2\sigma^4/(n_Y-1)} = \frac{n_X-1}{n_Y-1} \]

so we choose some \(w\) such that

\[ \begin{align*} w(n_Y-1) &= (1-w)(n_X-1) \\ w((n_Y-1) + (n_X-1)) &= n_X-1 \\ w &= \frac{n_X-1}{n_X+n_Y-2} \end{align*} \]

Pooled Variance Estimator

To sum up all this work, we can make a better estimator for \(\sigma^2\) by taking the weighted average, or pooled estimator

\[ S_P^2 = \frac{(n_X-1)S_X^2 + (n_Y-1)S_Y^2}{n_X+n_Y-2} \]

Note that \[ (n_X+n_Y-2)\frac{S_P^2}{\sigma^2} = \frac{(n_X-1)S_X^2}{\sigma^2} + \frac{(n_Y-1)S_Y^2}{\sigma^2} \]

is a sum of two variables distributed as \(\chi^2(n_X-1)\) and \(\chi^2(n_Y-1)\) respectively. It follows that their sum is distributed as a \(\chi^2(n_X-1+n_Y-1)=\chi^2(n_X+n_Y-2)\) variable.

Deriving a T-distributed test statistic and confidence interval with this variance estimator goes exactly as for every other time we have done it.

Caveat!

Since the derivation crucially involved the behavior of the 4th central moment of a normal distribution, this is more sensitive than many other things to the assumptions of normality. And working with ratios of the shape \(\sigma_X^4/\sigma_Y^4\), even small deviations from equality of the two variances get amplified a lot.

This approach used to be widely recommended, to a large part because the degrees of freedom are so much easier to compute, and because when the assumptions are true we do get sharper estimates and more precise statements (both in terms of p-values and in terms of confidence intervals).

However, it is easy to be led astray by small deviations from the assumptions, and the Welch T-test works regardless of whether the variances do or do not match. We could imagine first testing for equal variances, but the F-test that can do that is even more sensitive to data being normally distributed.

The safe bet is to only use pooled variances when you have very compelling evidence of equal variances.

Chapter 10, Exercises 33, 34

33.a

Use the pooled \(T\) pivot variable to obtain a pooled \(T\) confidence interval:

\[ \PP\left( -t_{\alpha/2, m+n-2} \leq \frac{(\overline{X}-\overline{Y})-(\mu_1-\mu_2)}{S_p\sqrt{\frac{1}{m}+\frac{1}{n}}} \leq t_{\alpha/2, m+n-2} \right) = 1-\alpha \\ -t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \leq (\overline{X}-\overline{Y})-(\mu_1-\mu_2) \leq t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \\ -(\overline{X}-\overline{Y}) - t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \leq -(\mu_1-\mu_2) \leq -(\overline{X}-\overline{Y}) + t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \\ (\overline{X}-\overline{Y}) + t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \geq (\mu_1-\mu_2) \geq (\overline{X}-\overline{Y}) - t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \]

So we get the pooled \(T\) confidence interval

\[ \mu_1-\mu_2 \in (\overline{X}-\overline{Y}) \pm t_{\alpha/2, m+n-2}S_p\sqrt{\frac{1}{m}+\frac{1}{n}} \]

Chapter 10, Exercises 33, 34

33.b

Samples of maximum output of moisture from ultrasonic humidifiers of two particular brands were taken, yielding values:

Code

moisture_1 = np.array([14.0, 14.3, 12.2, 15.1])
moisture_2 = np.array([12.1, 13.6, 11.9, 11.2])

We compute:

Code

X_Y = moisture_1.mean()-moisture_2.mean()
df = len(moisture_1)+len(moisture_2)-2
Sp = ((len(moisture_1)-1)*moisture_1.std() + (len(moisture_2)-1)*moisture_2.std())/df
tadf = scipy.stats.t(df).ppf(0.975)
margin = tadf * Sp * np.sqrt(1/len(moisture_1)+1/len(moisture_2))

print(f"""
$\overline{{X}}-\overline{{Y}}$ | $S_p$ | $t_{{0.025,{df}}}$ | Margin of Error | CI
-|-|-|-|-
{X_Y:.4f} | {Sp:.4f} | {tadf:.4f} | {margin:.4f} | ({X_Y-margin:.4f}, {X_Y+margin:.4f})
""")

\(\overline{X}-\overline{Y}\)	\(S_p\)	\(t_{0.025,6}\)	Margin of Error	CI
1.7000	0.9677	2.4469	1.6743	(0.0257, 3.3743)

33.c

Using Welch’s T Confidence Interval instead, we compute:

Code

sesq_1 = moisture_1.var()/len(moisture_1)
sesq_2 = moisture_2.var()/len(moisture_2)
df_W = (sesq_1 + sesq_2)**2/((sesq_1**2)/(len(moisture_1)-1)+(sesq_2**2)/(len(moisture_2)-1))
tadf_W = scipy.stats.t(df_W).ppf(0.975)
margin_W = tadf_W*np.sqrt(sesq_1+sesq_2)

print(f"""
$\overline{{X}}-\overline{{Y}}$ | $se_1^2$ | $se_2^2$ | $t_{{0.025,{df_W:.4f}}}$ | Margin of Error | CI
-|-|-|-|-|-
{X_Y:.4f} | {sesq_1:.4f} | {sesq_2:.4f} | {tadf_W:.4f} | {margin_W:.4f} | ({X_Y-margin_W:.4f}, {X_Y+margin_W:.4f})
""")

\(\overline{X}-\overline{Y}\)	\(se_1^2\)	\(se_2^2\)	\(t_{0.025,5.7899}\)	Margin of Error	CI
1.7000	0.2813	0.1913	2.4686	1.6969	(0.0031, 3.3969)

We can see that the margin of error is slightly bigger for the Welch method, yielding a confidence interval that just barely misses including \(0\).

Chapter 10, Exercises 33, 34

34

Pooled T-Test for Difference of Means

Null Hypothesis: \(H_0: \mu_1-\mu_2 = \Delta_0\), often with \(\Delta_0=0\)
Test Statistic: \(T = \frac{(\overline{X}-\overline{Y})-\Delta_0}{S_p\sqrt{\frac1m+\frac1n}}\)

Alternative Hypothesis	Rejection Region for Level \(\alpha\)	P-value
Upper Tail	\(T>t_{\alpha,m+n-2}\)	\(1-CDF_{T(m+n-2)}(T)\)
Two-tailed	\(\|T\|>t_{\alpha/2,m+n-2}\)	\(2\min\{1-CDF_{T(m+n-2)}(T), CDF_{T(m+n-2)}(T)\}\)
Lower Tail	\(T<-t_{\alpha,m+n-2}\)	\(CDF_{T(m+n-2)}(T)\)

For our data on moisturizer output, we compute:

Code

sep = Sp*np.sqrt(1/len(moisture_1)+1/len(moisture_2))
T = X_Y/sep
p = 2*scipy.stats.t(df).sf(T)

print(f"""
$\overline{{X}}-\overline{{Y}}$ | $\Delta_0$ | $S_p$ | $S_p\sqrt{{\\frac1m+\\frac1n}}$ | $T$ | df | $p$
-|-|-|-|-|-|-
{X_Y:.4f} | 0 | {Sp:.4f} | {sep:.4f} | {T:.4f} | {df} | {p:.4f}
""")

\(\overline{X}-\overline{Y}\)	\(\Delta_0\)	\(S_p\)	\(S_p\sqrt{\frac1m+\frac1n}\)	\(T\)	df	\(p\)
1.7000	0	0.9677	0.6842	2.4845	6	0.0475

A two-tailed test can conclude at a 5% significance level that there is a difference in moisture output between the two brands.