3/25/2020
Recall the setup for the last example on likelihood ratios: \[ X_{1,i} \sim \mathcal{N}(\mu_1,\sigma^2) \qquad X_{2,i} \sim \mathcal{N}(\mu_2,\sigma^2) \qquad X_{3,i} \sim \mathcal{N}(\mu_3,\sigma^2) \] with unknown means and unknown but identical variance.
Our interest is in \(H_0:\mu_1=\mu_2=\mu_3\) vs. \(H_A:\) at least one pair unequal.
In the example, we found a likelihood ratio test to reduce to testing \[ \frac{\hat\sigma_0^2}{\hat\sigma^2} < k \] for some \(k\), where \(\hat\sigma^2\) is the pooled sample variance estimator and \(\hat\sigma_0^2\) is the usual sample variance taken on all the data.
ANOVA and the T-test fit into a sequence of tests focused on population means:
Test | \(H_0\) |
---|---|
One-sample | \(\mu = \mu_0\) |
Two-sample | \(\mu_1 = \mu_2\) |
Many-sample (ANOVA) | \(\mu_1 = \dots = \mu_k\) |
Continuous (Regression) | \(\mu(x) = f(x)\) |
ANOVA is all about sample variances - so there will be many sums of squares involved. We write:
\[ Y_{i1},\dots,Y_{i,n_i}\sim\mathcal{N}(\mu_i,\sigma^2) \\ \overline Y_{i*} = \frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij} \qquad \overline Y = \frac{1}{\sum n_i}\sum_{i=1}^k\sum_{j=1}^{n_i}Y_{ij} \]
where we have \(k\) different groups, each containing \(n_i\) samples. We separate the in-group means from the global mean - and will be examining each of these separately.
Type | Formula |
---|---|
Total Sum of Squares | \(TSS = \sum_{i=1}^k\sum_{j=1}^{n_i}(Y_{ij}-\overline Y)^2\) |
Sum of Squares for Errors | \(SSE = \sum_{i=1}^k\sum_{j=1}^{n_i}(Y_{ij}-\overline Y_{i*})^2\) |
Sum of Squares for Treatments | \(SST = \sum_{i=1}^k\sum_{j=1}^{n_i}(\overline Y_{i*}-\overline Y)^2\) |
Mean Squares for Errors | \(MSE = SSE/DoFE\) |
Mean Squares for Treatments | \(MST = SST/DoFT\) |
We will investigate the degrees of freedom \(DoFE\) and \(DoFT\) more later.
The quantity \(TSS/(n-1)\) is the classical sample variance calculated for the entire dataset at once.
\(MSE\) is the pooled sample variance that we have met occasionally before.
Each of \(TSS/(n-1)\), \(MSE\) and \(MST\) is (under \(H_0\)) an unbiased estimator of \(\sigma^2\).
\(MSE\) is an unbiased estimator of \(\sigma^2\) even without support from \(H_0\).
\(TSS = SST + SSE\)
\[ TSS = \sum_i\sum_j(Y_{ij}-\overline Y)^2 = \sum_i\sum_j(\color{blue}{(Y_{ij}-\overline Y_{i*})} + \color{green}{(\overline Y_{i*}-\overline Y)})^2 \\ = \sum_i\sum_j\left[ \color{blue}{(Y_{ij}-\overline Y_{i*})^2} + 2\color{blue}{(Y_{ij}-\overline Y_{i*})}\color{green}{(\overline Y_{i*}-\overline Y)} + \color{green}{(\overline Y_{i*}-\overline Y)^2} \right] \]
Let’s look at the term \(2\color{blue}{(Y_{ij}-\overline Y_{i*})}\color{green}{(\overline Y_{i*}-\overline Y)}\) for a fixed \(i\):
\[ \sum_j2\color{blue}{(Y_{ij}-\overline Y_{i*})}\color{green}{(\overline Y_{i*}-\overline Y)} = 2\color{green}{(\overline Y_{i*}-\overline Y)}\sum\color{blue}{(Y_{ij}-\overline Y_{i*})} =\\ 2{(\overline Y_{i*}-\overline Y)}\left(\sum Y_{ij} - n\overline Y_{i*}\right) =2{(\overline Y_{i*}-\overline Y)}\left(n\overline Y_{i*} - n\overline Y_{i*}\right) =0 \]
Returning to \(TSS\):
\[ TSS = \\ \sum_i\sum_j\left[ \color{blue}{(Y_{ij}-\overline Y_{i*})^2} + \color{purple}{2(Y_{ij}-\overline Y_{i*})(\overline Y_{i*}-\overline Y)} + \color{green}{(\overline Y_{i*}-\overline Y)^2} \right] = \\ \color{blue}{\sum_i\sum_j(Y_{ij}-\overline Y_{i*})^2} + \color{purple}{\sum_i 0} + \color{green}{\sum_i\sum_j(\overline Y_{i*}-\overline Y)^2} =\\ \color{blue}{SSE} + \color{green}{SST} \]
If \(U\sim\chi^2(n)\) and \(V\sim\chi^2(m)\) are independent then \(U+V\sim\chi^2(n+m)\).
Recall that if moment generating functions are equal, then so are the probability distributions. A \(\chi^2(n)\) variable has \(MGF=(1-2t)^{-n/2}\).
\[ MGF_{U+V} = \mathbb{E}[e^{t(U+V)}] = \mathbb{E}[e^{tU}e^{tV}] = \\ \mathbb{E}[e^{tU}]\mathbb{E}[e^{tV}] + \text{covariances} = \\ MGF_U\cdot MGF_V = (1-2t)^{-n/2}(1-2t)^{-m/2} =\\ (1-2t)^{-(n+m)/2} \]
If \(U\sim\chi^2(n)\) and \(V\) are independent and \(U+V\sim\chi^2(n+m)\) then \(V\sim\chi^2(m)\).
A \(\chi^2(n)\) variable has \(MGF=(1-2t)^{-n/2}\).
\[ (1-2t)^{-(n+m)/2} = MGF_{U+V} = \\ MGF_U\cdot MGF_V = (1-2t)^{-n/2}MGF_V \\ \text{So } MGF_V = (1-2t)^{-(n+m)/2}/(1-2t)^{-n/2} = (1-2t)^{-m/2} \]
Recall \[ SSE = \sum_i\sum_j(Y_{ij}-\overline Y_{i*})^2 \qquad S_i^2 = \frac{1}{n_i-1}\sum_j(Y_{ij}-\overline Y_{i*})^2 \]
It follows that \(SSE\) can be written in terms of the group sample variances:
\[ SSE = \sum_i(n_i-1)S_i^2 \]
We know that \((n-1)S^2/\sigma^2\sim\chi^2(n-1)\) in general for sample variances.
Hence, each \((n_i-1)S_i^2/\sigma^2\sim\chi^2(n_i-1)\).
It follows by the first theorem (addition of \(\chi^2\) DoF) that \[ \frac{SSE}{\sigma^2} = \sum_i\frac{(n_i-1)S_i^2}{\sigma^2} \sim\chi^2\left(\sum_i(n_i-1)\right) = \chi^2(n-k) \]
Under the null nypothesis \(\mu_1 = \dots = \mu_k\), all the \(Y_{ij}\) are iid. Then \(TSS = (n-1)S^2\) for the ordinary sample variance \(S^2\) of all the data.
It follows that \[ \frac{TSS}{\sigma^2} = \frac{(n-1)S^2}{\sigma^2} \sim\chi^2(n-1) \]
Since \(TSS = SST + SSE\), it follows that \[ \frac{TSS}{\sigma^2} = \frac{SST}{\sigma^2} + \frac{SSE}{\sigma^2} \] and by the second theorem (subtraction of \(\chi^2\) DoF) that \[ \frac{SST}{\sigma^2}\sim\chi^2((n-1) - (n-k)) = \chi^2(k-1) \]
To summarize, we now know that
Quantity | Distribution | Degrees of Freedom |
---|---|---|
\(TSS/\sigma^2\) | \(\chi^2(n-1)\) | \(n-1\) |
\(SSE/\sigma^2\) | \(\chi^2(n-k)\) | \(n-k\) |
\(SST/\sigma^2\) | \(\chi^2(k-1)\) | \(k-1\) |
Thus \[ \frac {\left.\frac{SST}{\sigma^2}\right/(k-1)} {\left.\frac{SSE}{\sigma^2}\right/(n-k)} \sim F_{n-k}^{k-1} \]
We define \[ DoFE = n-k \qquad DoFT = k-1 \\ F = \frac {\left.\frac{SST}{\color{red}{\sigma^2}}\right/(k-1)} {\left.\frac{SSE}{\color{red}{\sigma^2}}\right/(n-k)} = \frac{SST/(k-1)}{SSE/(n-k)} = \frac{MST}{MSE} \]
This \(F\)-statistic follows - under \(H_0\) - a known probability distribution, so we can use it to create a statistical test.
We expect the variance to increase under \(H_A\), which would increase \(MST\) but not \(MSE\) - so we would reject \(H_0\) if \(F > F_\alpha\) for some threshold \(F_\alpha\) chosen from \(F_{n-k}^{k-1}\).
It is very common to summarize all the components of the calculation of the ANOVA \(F\)-statistic and its \(p\)-value in a single table:
Source | DoF | SS | MS | F | p |
---|---|---|---|---|---|
Treatments | \(DoFT\) | \(SST\) | \(MST\) | \(MST/MSE\) | \(1-F_{F_{n-k}^{k-1}}(F)\) |
Error | \(DoFE\) | \(SSE\) | \(MSE\) | ||
Total | \(n-1\) | \(TSS\) | \(S^2\) |
The table can be printed in R for a fitted model (linear regression model or multiple means ANOVA) using the command anova
.
Regardless of whether or not \(H_0\) is true, \(MSE\) is an unbiased pooled estimator of \(\sigma^2\). Since it uses all the available data, it produces a better estimate (smaller confidence intervals) than would any of the group specific sample variances in isolation.
Write \(S=\sqrt{MSE}\) and set \(t_{\alpha} = F^{-1}_{t(n-k)}(1-\alpha)\). Then \[ \mu_i \in \overline{Y}_{i*}\pm t_{\alpha/2}S/\sqrt{n_i} \\ \mu_i-\mu_j\in(\overline Y_{i*}-\overline Y_{j*})\pm t_{\alpha/2}S\sqrt{\frac{1}{n_i}+\frac{1}{n_j}} \]
This is only fully valid if you are interested in exactly one mean or mean difference among all possible ones. The reason is that the tolerated errors compound.
Suppose we are seeking simultaneous confidence intervals \(I_1,\dots,I_m\) for parameters \(\theta_1,\dots,\theta_m\) such that \[ \mathbb{P}(\theta_i\in I_i\text{ for all $i$}) = 1-\alpha \]
Picking each interval to be a \(1-\alpha\) interval will not give this probability - because we are combining several events.
The failure of all intervals to work simultaneously is called a family-wise error, and the probability \(\alpha\) here is the family-wise error rate.
Recall that \(\overline{A_1\cap\dots\cap A_m} = \overline A_1\cup\dots\cup\overline A_m\). By sub-additivity of probabilities, \[ \begin{aligned} \mathbb{P}(A_1\cap\dots\cap A_m) &= 1-\mathbb{P}(\overline A_1\cup\dots\cup\overline A_m) \\ &\geq 1-\sum\mathbb{P}(\overline A_i) \\ &=1-\sum\alpha_i \end{aligned} \]
If each of our confidence intervals is a \((1-\alpha)\) confidence interval, their joint probability - the family-wise error rate - could be as small as \(1-m\alpha\).
Bonferroni’s inequality suggests a method to control the family-wise error rate. For a family-wise error rate of \(\alpha\) - a probability of all confidence intervals to simultaneously contain their respective parameters of \(1-\alpha\) - we choose to construct each interval as a \((1-\alpha/m)\) confidence interval.
Bonferroni’s Method is known to be overly conservative (rejects the null too rarely). Better methods have been proposed - Holm’s Method and Hochberg’s Method - but these are out of scope for this course.
Family-Wise Error Rate can be considered to be overly harsh - especially for large sets of simultaneous CIs. An alternative is to control the False Discovery Rate - allow a rate of up to \(\alpha\) erroneous CIs.