Lesson 2

The two sample t-test

Lesson 2 covers the two sample $t$-test. The basic test question here is whether two population means are equal. This is “answered” through as significance test. Answered is in quotes, as we don ot really answer the question, rather pose it as a null hyptothesis and let the data (our sample means basically) guide us through a $p$-value.

Lesson 2 uses the t.test function of R for the heavy lifting, but shows that even a simple function in R can require us to consider how we use it: using two variables, using a formula interface, specifying parameters, …

This set of questions is for self quizzing as you read the material.

As mentioned, the default for t.test when used with two samples is not to assume the population variances are equal. They may be, they may not be, we just don't assume. There are formal statistical tests for equality of variance, but often an assumption is made on a graphical examination and an understanding of the underlying source for randomness that might lead one to believe the variances are equal.

From the graph of the data below does an equal variance assumption seem reasonable:

bottom <- c(0.43, 0.266, 0.567, 0.531, 0.707, 0.716)
surface <- c(0.415, 0.238, 0.39, 0.41, 0.605, 0.609)
DF <- data.frame(bottom = bottom, surface = surface)
boxplot(DF)

plot of chunk unnamed-chunk-2

No, the spread of the surface variable is less in the boxplot Yes, given the small sample sizes the spreads do not seem to radically different

If we assume equal variances, what is the degrees for freedom used in a two-sample $t$-test:

n1 + n2 - 2 n - 1 Some crazy approximation with 4th powers

As an aside, this is how the boxplots can be done with ggplot2. We use stack to create a data frame with two columns: values holding the numeric values and ind holding a factor containing which case a value belongs to.

require(ggplot2)
p <- ggplot(stack(DF), aes(ind, values))
p + geom_boxplot()

plot of chunk unnamed-chunk-5

Buried deep in any use of t.test is the assumption that the data involved comes from a normal population, or atleast one not so long-tailed of skewed as to effect the use of the t-distribution to describe the test statistics.

A visual test of the normality can also be done with boxplots.

Select which interpretations from the above boxplot diagrams are true:

The absence of outliers is consistent the a normal distribution assumption The length of the whiskers show slight skew, but given the small sample size this is not too much of a concern. The position of the median within the box should be near the center. The surface boxplot is not quite that. Were it not for the small sample size we would be a bit concerned.

In Gossett's paper on the t-test he has

It is usual, however, to assume a normal distribution, because, in a very large number of cases, this gives an approximation so close that a small sample will give no real information as to the manner in which the population deviates from normality: since some law of distribution must he assumed it is better to work with a curve whose area and ordinates are tabled, and whose properties are well known. This assumption is accordingly made in the present paper, so that its conclusions are not strictly applicable to populations known not to be normally distributed; yet it appears probable that the deviation from normality must be very extreme to load to serious error. STUDENT, http://www.york.ac.uk/depts/maths/histstat/student.pdf

The question of deviation from the normality assumption being extreme is vague. Do the side-by-side boxplots show an extreme deviation?

No Yes

Density plots can also show such info. Here is how one way we can do them in ggplot2

p <- ggplot(stack(DF), aes(x = values, group = ind, colour = ind, 
    y = ..density..))
p + geom_density()

plot of chunk unnamed-chunk-7

The following output comes from calling t.test:


    Welch Two Sample t-test

data:  bottom and surface 
t = 1.0097, df = 9.662, p-value = 0.3373
alternative hypothesis: true difference in means is not equal to 0 
95 percent confidence interval:
 -0.1115787  0.2949121 
sample estimates:
mean of x mean of y 
0.5361667 0.4445000

What is the null hypothesis (it isn't printed)

H0: mu1 = mu2 H0: the populations are normal with mean mu1 and mu2 H0: The variances are not assumed to be equal

The value labeled t reports what:

The observed difference between the sample means The (observed - expected)/SE The critical value for the problem

Why is the degrees of freedom not an integer?

Because n1 + n2 - 2 is not an integer for these values of n Because an approximation for the degrees of freedom was used, as equality of variances was not assumed Who says 9.662 is not an integer?

What is the value of $\alpha$? (The confidence level is 1 - $\alpha$)

0.95 0.05

The p-value is greater than $\alpha$, the confidence interval contains 0. Could it have been that the confidence interval did not contain $\alpha$?

Yes No

The sample means are given with y being less. Did you know that already from the graphics?

Yes. The center lines in the box show this. No. though we could have guesses as the center lines in the box show this and the data is more or less symmtetric leading us to believe the means and medians are close.

Can you tell what the original sample sizes are from the output?

Yes, but only with some reverse engineering No

The standard error (SE) is that benchmark which gives us the sense of
scale to answer whether a value is extreme. It is so important, it
must be reported. Can you tell what the standard error is from the
output?

Yes, but only with some reverse engineering No

Lesson 2

About

Equal variance assumption

Normal distribution assumption

Reading the t.test output