8: Two-sample \(t\) tests

The benefits of comparative experiments are many. In Nature we read that

Good experimental designs limit the impact of variability and reduce sample-size requirements.

The sample-size requirements discussion is related to power and effect sizes, a topic of a different day.

This project is about the experimental design to observe the response of experimental units to a treatment, as compared to no treatment (such as a placebo), or the response of experimental units to two different treatments.

In either case, we have have sample data that consists of two samples, \(x_1, x_2, \dots, x_n\) and \(y_1, y_2, \dots, y_m\).

We assume theses samples come from normal populations, an assumption necessary for identifying the sampling distribution of our test statistics. We do not assume \(n\) or \(m\) are large.

For this project, the null and alternative hypotheses are always comparing \(\mu_1\) to \(\mu_2\), the population means for the \(x\)s and the \(y\)s. The null will always be \(H_0: \mu_1 = \mu_2\), though in the real world a shift may be added. The alternative is one of the one-sided ones \(H_a: \mu_1 < \mu_2\) or \(H_a:\mu_1 > \mu_2\), or the two-sided alternative \(H_a: \mu_1 \neq \mu_2\).

For this project, the test statistic will always be of the form:

\[ T = \frac{(\bar{x} - \bar{y}) -(\mu_1 - \mu_2)}{SE}. \]

However, the SE will depend on assumptions.

We have discussed 3 different ways to compute SE, the standard error:

In a study on diet and weight loss, it would be intuitive to compare the amount of weight each person lost, as opposed to comparing just the initial weights with just the final weights. The latter would be problematic, as the effect of how much weight was lost would be swamped by the variability between the subject’s weights. By comparing differences, that between-subject variability is controlled.

The general paired sample design is similar, the two groups have paired off members and the samples then must have equal sizes.

In this case, the differences, \(x_i - y_i\) form a single sample, and the distribution of \(T\) will be the \(t\) distribution with \(n-1\) degrees of freedom.

Here we have unpaired experimental units. The sample size need not be the same. The variance of \(\bar{x} - \bar{y}\) is \(SD = \sigma_1^2/n_1 + \sigma_2^2/n_2\). To find the standard error we use the sample standard deviations \(s_1\) and \(s_2\) to estimate \(\sigma_1\) and \(\sigma_2\), to get:

\[ SE = \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}. \]

With this the test statistic will have the \(t\) distribution, as well. The degrees of freedom are conservatively the smaller of \(n_1-1\) and \(n_2-1\), though we will see that R uses an approximate value that is less conservative.

When the variances are assumed to be equal in the populations, then the samples can be combined to give a better estimate for \(\sigma = \sigma_1 = \sigma_2\). This pooled sample standard deviation is

\[ s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1 + n_2 - 2};\quad\text{ and } SE = s_p \sqrt{\frac{1}{n1} + \frac{1}{n2}}. \]

As pooling provides a generally better estimate for the variability, the degrees of freedom increase, in this case to \(n_1 + n_2 - 2\).

Finally, if you have the data stored in two variables, all three tests are computed with one R function, t.test through two arguments: paired=TRUE or var.equal=TRUE. If neither is given, the there is no assumption of pairing or equal variances (just independent samples).

We do three examples, to illustrate:

The shoes data set in the MASS package gives “A list of two vectors, giving the wear of shoes of materials A and B for one foot each of ten boys.” (This is a classic example of paired data.) To run a significance test of equal population means, against a two-sided alternative (the default), we have:

library(MASS)   # loads the MASS package
t.test(shoes$A, shoes$B, paired=TRUE)
## 
##  Paired t-test
## 
## data:  shoes$A and shoes$B
## t = -3.3489, df = 9, p-value = 0.008539
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6869539 -0.1330461
## sample estimates:
## mean of the differences 
##                   -0.41

Hunting yields, p-value = 0.008539, indicating a small \(p\)-value suggesting the differences in the population means are statistically significant at the \(\alpha=0.05\) significance level with a two-sided test. You will also find t = -3.3489, a report of the observed value of the T statistic, and df = 9 for 9 degrees of freedom, 9=10-1.

What if paired=TRUE was not specified? This would assume the two samples were independent. We will see that the variability between the subjects hides the differences between the shoe materials:

t.test(shoes$A, shoes$B)
## 
##  Welch Two Sample t-test
## 
## data:  shoes$A and shoes$B
## t = -0.36891, df = 17.987, p-value = 0.7165
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.745046  1.925046
## sample estimates:
## mean of x mean of y 
##     10.63     11.04

Looking we see

t = -0.36891, df = 17.987, p-value = 0.7165

This shows a \(p\) value bigger than \(\alpha\). The observed value of T is different, as different SE values are used. The df = 17.987 uses an approximation. Were this done by hand, you would use the smaller of \(10-1\) and \(10-1\), or just \(9\).

Finally, what if we assumed the samples are independent and the population standard deviations were equal. Then we would get:

t = -0.36891, df = 18, p-value = 0.7165

This is basically the same! Why? This isn’t guaranteed, but here the sample sizes are balanced off and the sample standard deviations are close (2.45 and 2.51…) so the standard errors are similar enough that the observed value is no different to 5 decimal points.


Okay, enough background. Now, we have some problems.

The MASS package has a data set UScereal with measurement on various cereal brands. Imagine that these are sample for the population of potential cereal brands out there. We can compare across different factors.

For example, to get the data for calories by shelves 1 and 2, we have:

shelf1 = subset(UScereal, shelf==1, select=calories, drop=TRUE)
shelf2 = subset(UScereal, shelf==2, select=calories, drop=TRUE)

A boxplot shows:

boxplot(shelf1, shelf2)

QUESTION: The first shelf appears to have a different center than the second. Is the difference statistically significant using a two-sided test, with an assumption of independent samples?

QUESTION: Repeat the last question, only compare the values in sugars. Is the difference in means statistically significant?

QUESTION: Make a boxplot of the sample data for sugars. Comment on the assumption that the samples come from normal populations.

QUESTION: Kelloggs and Nabisco dominate the shelf space for cereals, as far as UScereal is concerned. To get just those data for calories, we have:

kelloggs = subset(UScereal, mfr=="K", select=calories, drop=TRUE)
nabisco = subset(UScereal, mfr=="N", select=calories, drop=TRUE)

Taking this data as samples from the populations of all cereals these two companies could produce, is the different in calories statistically significant with \(\alpha = 0.05\)?

QUESTION: The immer data set contains yields in 1931 and 1932 for several locations and several varieties of barley. The variables immer$Y1 and immer$Y2 match off both the location and variety. Is there a difference in yields between the two years, assuming this data to be a random sample? Use \(\alpha=0.05\). What assumptions do you make.

QUESTION: Data for the manchuria variety alone can be extracted through

m1 = subset(immer, Var=="M", select="Y1", drop=TRUE)
m2 = subset(immer, Var=="M", select="Y2", drop=TRUE)

This data is still paired. Is the difference statistically significant?