We need to cover two more types of tests for two sample tests, but first a quick review. (Well, nothing ever seems to be quick but bear with me.)
The standard $t$-test begins with a statistic
$$t = \frac{(\bar{y_1} - \bar{y_2}) - (\mu_1 - \mu_2)}{ \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2} } }$$
The assumptions that allow us to know the distribution of $t$ are that the individual $X_i$ are approximately normal. (The statistic is somewhat robust). If this is the case then, the distribution of $t$ is the $t$-distribution with the following degrees of freedom
$$\frac{\left(s_1^2/(n_1) + s_1^2/(n_1)\right)^2 }{ \frac{(s_1^2/(n_1))^2}{n_1-1} + \frac{(s_2^2/(n_2))^2}{n_2-1} }$$
This does not assume that the variances are equal. If this is the case, then one pools the data to make a better estimate of $\sigma$ the common standard deviation. If we denote the estimate by $s_p$, then one has
$$s_p = \frac{(n_1-1)s_1 + (n_2 - 2) s_2}{n_1 +n_2 -2}.$$
And the new $t$ statistic
$$t = \frac{(\bar{y_1} - \bar{y_2}) - (\mu_1 - \mu_2)}{ s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2} } },$$
again has the $t$ distribution, but with the more manageable $n_1 + n_2 -2$ degrees of freedom.
Another simplification to the above, comes when the samples are paired up in an attempt to minimize differences between the pairs. If one sample is denoted $X_1,X_2, ..., X_n$ and the other $Y_1,Y_2,...,Y_n$ (same size), then the essential sequence is the differences $D_i = X_i - Y_i$. If the individual observations are normal (or nearly normal) then, the the sequence $D_1,D_2,...,D_n$, may be analyzed with a one-sample $t$ test.
For example, suppose a Reebok wishes to test a new sole on a sneaker against the current one. It creates pairs of shoes each having identical looking, but different soles. The left and right are assigned randomly. These shoes are then tested on 10 children and wear is measures. Suppose larger numbers indicate more wear. The data is given by
old | 90 | 100 | 119 | 98 | 88 | 110 | 86 | 112 | 112 | 102 |
---|---|---|---|---|---|---|---|---|---|---|
new | 83 | 93 | 114 | 93 | 81 | 101 | 80 | 105 | 106 | 94 |
Differences | -7 | -7 | -5 | -5 | -7 | -9 | -6 | -7 | -6 | -8 |
Are these two shoe soles identical? Let’s try the standard $t$-test.
First we check for equal sized variances with side-by-side boxplots. They seem to be, so we would try the $t$-test with pooled data. Our hypotheses are
$$H_0: \mu_1 -\mu_2 = 0,\quad H_A: \mu_1 - \mu_2 > 0$$
That is the alternative is that the new sole is better (less wear). Our $t$-statistic under the null hypotheses gives a $p$-value of 0.2072.
However, if we apply the simple one-sample $t$-test to the differences with the hypotheses
$$H_0: \mu_1= 0,\quad H_A: \mu_1 > 0,$$
where $\mu_1$ is the mean of the differences, we get a $p$-value of 3.93e-08 or essentially 0.
As an aside, for those of you who like this stuff, essentially what we are assuming is that
$$Y_i = X_i + \epsilon_i$$
where $\epsilon_i$ is normal with unknown mean $\mu$ and standard deviation $\sigma$. We are testing if $\mu=0$ or not. When we study regression, the assumption will be that $Y_i = b_0 + b_1 X_i + \epsilon_i$ which is very similar looking.
Sometimes, the assumption of normality or near normality is just not true. Then what can you do? A rank transform test is available to test if the two distribution have the same shape, and different centers. For example, it can detect between normal with mean 10 and standard deviation 10, against normal with mean 20 and standard deviation 10, but not necessary normal with mean 10 and standard deviation 20.
The basic idea of the rank transform is to combine all the numbers together and then rank them from smallest to largest. For these ranks, we assign a minus sign if the original number was from the first sample and a plus sign if from the second. This yields a set of numbers with average 0. A one-sample $t$-test is applied and studies as though the $t$-distribution is appropriate.
For example, suppose we have the first sample is 1 3 5, and the second one 4 6 7. Then the ranking is 1,3,4,5,6,7 and so the sequence would be -1,-2,3,-4,5,6. Applying the $t$-test gives a $t$-statistic of 0.70, 5 degrees of freedom and a $p$-value of 0.5139.
Actually, this is a bad example as there aren’t enough numbers in the two lists, to ensure a great variation in the answers. You should have atleast 30 observations.
How does this work? Well, the idea is that if the two similarly shaped distributions have one a shift of the other, then the ranking will be biased in the direction of the shift. If the biasing is enough, then the $t$-test will have a small $p$-value.
How do we do this in MINITAB? We need to rank the data, and then apply the $t$-test. The trick is to stack the data (as we did with the boxplot) and keep track of subscripts. Then use the Rank Data function, and store the ranks in a new column. Use the 2-sample $t$-test on the rank column with subscripts from before.
Download Twin
. This is data for 9 pairs of identical twins. One twin was chosen at random and given a drug, the other presumably a placebo. Both took the same intelligence test.
Are these a matched sample?
Do we need the data to be approximately normal? Is it?
What does a test of hypotheses that the drug has no effect, vs. it having an effect yield?
Look at the data set Asthmati
. This is data from an experiment where a group of subjects was given (randomly assigned) both a drug and a placebo on separate dates, and then tested for asthma relief.
Why would you think to use a matched-pairs $t$ test?
Are the differences approximately normal? Can you use the matched-pairs test?
If everything seems appropriate, test the hypothesis that the drug has no effect, against the alternative that is has a positive effect (value is reduced). What do you conclude?
Look at the data set Darwin
. These are data collected by Darwin himself who was studying the effects of cross-fertilization vs. self-fertilization. Two such plants were placed in the same pot and their heights measured.
Are the data symmetric or skewed?
Are these matched samples?
Which is more appropriate for the differences – the $t$-tests or the Wilcoxon signed rank test? Do both and discuss.
Return to the data set Censored
. Use the Wilcoxen rank-sum test to test the hypothesis that the two distributions have the same median survival time.
The Wilcoxen test seems like magic. Let’s see how well it does when we know the answer. That is, create two sets of data with known distributions, and test to see if the Wilcoxen can figure this out. Here are some pairs to try. In each case, try 20 rows of data for each variable.
Both are normal with mean 10 and std 10.
One is normal with mean 10 and std 10, the other mean 10 and std 30.
Both are exponential with mean 10
One is exponential 10 the other exponential 20
One is exponential 10 the other normal mean 10 and std 10.
How did it do?