\[ \def\RR{{\mathbb{R}}} \def\PP{{\mathbb{P}}} \def\EE{{\mathbb{E}}} \def\VV{{\mathbb{V}}} \]
Many of our tests have focused on comparing proportions of labels in some categorical data:
Null Hypothesis Type | Test Type |
---|---|
\(H_0:p = p_0\) | Binomial test, Normal approximation test of proportions |
\(H_0:p_1 = p_2\) | Normal approximation test of proportions, McNemar’s test |
We will next introduce the next step in this progression: comparing proportions for more than 2 subgroups of a set of observations.
For a starting point, we will assume that we have some number of observations following a multinomial probability distribution.
Given a vector of probabilities \(\boldsymbol p = (p_1,\dots,p_m)\) that sum to 1, the Multinomial distribution \(Multinomial(n,\boldsymbol p)\) counts the number of outcomes from \(n\) draws of a value from \(1,\dots,m\) with probabilities \(\PP(\text{outcome }j)=p_j\). Each observation \(\boldsymbol k=(k_1,\dots,k_m)\) is a vector of integers, summing to \(n\).
By just picking any one outcome and calling it success, it is easy to see that any one entry \(k_j\sim Binomial(n, p_j)\).
When developing McNemar’s test in Lecture 15, we established that for two entries of a Multinomial vector \(k_i, k_j\), their covariance is \(Cov(k_i, k_j) = -np_ip_j\)
Since each entry is binomial, the expected value of any one entry \(\EE[k_j]=np_j\). It is very attractive to try to build a pivot or a test statistic out of the squared errors \((k_j-np_j)^2\) – however, these are sensitive to scale: an error of 5 yields much more information if we expected 10 than if we expected 100.
Pearson suggested that we may focus on the quantity \(X = \sum{(k_j-np_j)^2\over np_j}\) to see whether a proposed multinomial distribution seems likely.
Following the 6th proof in Benhamou and Melot: Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation, we will prove Pearson’s Theorem next:
Theorem
Let \(\boldsymbol K\sim Multinomial(n,\boldsymbol p)\). Then \[ X=\sum{(K_j-np_j)^2\over np_j}\sim\chi^2(m-1) \]
We will proceed with a proof by induction over the number of categories \(m\).
Theorem
\[ X=\sum{(K_j-np_j)^2\over np_j}\sim\chi^2(m-1) \]
Induction base: If \(m=2\), \[ \begin{align*} X={(k-np)^2\over np}+{((n-k)-nq)^2\over nq} &= {q(k-np)^2+p((n-k)-nq)^2\over npq} = \\ = {(k-np)^2q + (\overbrace{n(1-q)-k}^{=np-k})^2p\over npq} &= {(k-np)^2(\overbrace{q+p}^{=1})\over npq} = \left(k-np\over \sqrt{npq}\right)^2 \end{align*} \]
We recognize this last fraction: since \(k\sim Binomial(n,p)\), we know that \({k-np\over\sqrt{npq}}\sim\mathcal{N}(0,1)\), and hence \(X\sim\chi^2(1)\).
Theorem
\[ X=\sum{(K_j-np_j)^2\over np_j}\sim\chi^2(m-1) \]
Induction Step: The induction step is encapsulated in the following Proposition:
Proposition
If for every \(m\)-dimensional multinomial distribution we know \(X\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X\xrightarrow{dist.}\chi^2(m)\).
In the following proof, we will add \(m\) and \(m+1\) as indices to disambiguate between the currently considered \(m\)-dimensional and \((m+1)\)-dimensional cases.
Proposition
If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).
Suppose \(\boldsymbol k\sim Multinomial(n,\boldsymbol p)\) is the \(m+1\)-dimensional case. We will rewrite \(X_{m+1}\) as follows: \[ \begin{align*} X_{m+1} &= \sum_{j=1}^{m+1}{(k_{j}-np_{j})^2\over np_{j}} \\ &= \sum_{j=1}^{m-1}{(k_{j}-np_{j})^2\over np_{j}} + {(k_{m}-np_{m})^2\over np_{m}} + {(k_{m+1}-np_{m+1})^2\over np_{m+1}} \\ &= \sum_{j=1}^{m-1}{(k_{j}-np_{j})^2\over np_{j}} + {((k_{m}+k_{m+1})-n(p_{m}+p_{m+1}))^2\over n(p_{m}+p_{m+1})} + U^2 \\ &= X_{m} + U^2 \end{align*} \]
where \(X_m\) corresponds to observing \(Multinomial(n, (p_{1},\dots,p_{m-1},(p_{m}+p_{m+1})))\): the multinomial distribution where we just combined the last two categories into one category.
Proposition
If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).
This combination will of course not be an exact correspondence - there’s an error term that we name \(U^2\) and that is defined as \[ U^2 = {(k_{m}-np_{m})^2\over np_{m}} + {(k_{m+1}-np_{m+1})^2\over np_{m+1}} - {((k_{m}+k_{m+1})-n(p_{m}+p_{m+1}))^2\over n(p_{m}+p_{m+1})} \]
Proposition
If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).
We have two steps left to finish the proof.
First, since this \(X_m\) is built from an \(m\)-dimensional multinomial distribution, \(X_m\xrightarrow{dist.}\chi^2(m-1)\) by our assumption.
Second, we will prove \(U^2\sim\chi^2(1)\). By the subtraction property for chi-squared variables that we used before, this will finish the proof.
Proposition
If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).
Write \(T_m=k_m-np_m\) and \(T_{m+1}=k_{m+1}-np_{m+1}\).
\[ \begin{align*} U^2 &= {T_m^2\over np_m} + {T_{m+1}^2\over np_{m+1}} -{(\color{Teal}{k_{m}}+\color{DarkMagenta}{k_{m+1}}-n(\color{Teal}{p_m}+\color{DarkMagenta}{p_{m+1}}))^2\over n(p_m+p_{m+1})} \\ &= {{T_m^2\over p_m}(p_m+p_{m+1})+{T_{m+1}^2\over p_{m+1}}(p_m+p_{m+1})-(\color{Teal}{T_m}+\color{DarkMagenta}{T_{m+1}})^2\over n(p_m+p_{m+1})} \\ &= {\color{SteelBlue}{T_m^2}+\color{DarkGoldenrod}{T_{m+1}^2}+{p_{m+1}\over p_m}T_m^2 + {p_m\over p_{m+1}}T_{m+1}^2-(\color{SteelBlue}{T_m^2}+2T_mT_{m+1}+\color{DarkGoldenrod}{T_{m+1}^2})\over n(p_m+p_{m+1})} \end{align*} \]
\[ \begin{align*} U^2 &= {1\over n(p_m+p_{m+1})}\left({{p_{m+1}\over p_m}T_m^2 + {p_m\over p_{m+1}}T_{m+1}^2-2T_mT_{m+1}}\right) \\ &= {1\over n(p_m+p_{m+1})}\left(\sqrt{p_{m+1}\over p_m}T_m - \sqrt{p_m\over p_{m+1}}T_{m+1}\right)^2 \\ &= {1\over n(p_m+p_{m+1})}\left({p_{m+1}}T_m - {p_m}T_{m+1}\over\sqrt{p_mp_{m+1}}\right)^2 \\ &= \left({p_{m+1}}T_m - {p_m}T_{m+1}\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}\right)^2 = V^2 \end{align*} \]
\[ \begin{align*} {p_{m+1}}T_m - {p_m}T_{m+1} &= {p_{m+1}}(k_m-np_m) - {p_m}(k_{m+1}-np_{m+1}) \\ &= k_mp_{m+1} - k_{m+1}p_m + np_{m+1}p_m - np_mp_{m+1} \\ &= k_mp_{m+1} - k_{m+1}p_m \end{align*} \]
Hence \(V = {k_mp_{m+1}-k_{m+1}p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}}\) is a linear combination of two components that each converges in distribution to a normal distribution - hence, their linear combination converges to a normal distribution. Which one? Let’s compute \(\EE\) and \(\VV\)!
\[ \begin{align*} \EE[V] &= \EE\left[{k_mp_{m+1}-k_{m+1}p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}}\right] \\ &= {\EE[k_m]p_{m+1}-\EE[k_{m+1}]p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}} \\ &= {np_mp_{m+1}-np_{m+1}p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}} = 0 \end{align*} \]
\[ \begin{align*} \VV[k_mp_{m+1}-k_{m+1}p_m] &= p_{m+1}^2\VV[k_m] + p_m^2\VV[k_{m+1}] -2p_mp_{m+1}Cov(k_m,k_{m+1})\\ &= p_{m+1}^2np_mq_m+p_m^2np_{m+1}q_{m+1}-2p_mp_{m+1}(-np_mp_{m+1}) \\ &= np_mp_{m+1}(p_{m+1}-p_{m}p_{m+1}+p_m-p_{m}p_{m+1}+2p_mp_{m+1}) \\ &= np_mp_{m+1}(p_m+p_{m+1}) \end{align*} \]
Which is exactly what’s under the square root in the denominator of \(V\). Hence \(V\xrightarrow{dist.}\mathcal{N}(0,1)\) and thus \(U^2\xrightarrow{dist.}\chi^2(1)\).
Strictly speaking, we would also need to show that \(U^2\) is independent of \(X_m\), which can be done by computing the covariance of any one term in \(X_m\) with $U^2 and arguing about multivariate normal vectors.
Pearson’s Chi-Squared Test of Goodness of Fit
From the 2000 Census, we get the following distribution of racial self-identification:
White | Black | Asian | Other | Mixed | Total | |
---|---|---|---|---|---|---|
NYC | 3 579 700 | 2 130 202 | 784 811 | 1 121 159 | 392 406 | 8 008 278 |
USA | 75.1% | 12.3% | 3.6% | 6.6% | 2.4% | 100% |
Question: Does the demographics of NYC significantly differ from the demographics of the US at large?
By multiplying the total population of NYC with the percentages observed in the country at large, we get the following:
White | Black | Asian | Other | Mixed | Total | |
---|---|---|---|---|---|---|
NYC | 3 579 700 | 2 130 202 | 784 811 | 1 121 159 | 392 406 | 8 008 278 |
USA | 75.1% | 12.3% | 3.6% | 6.6% | 2.4% | 100% |
Expected | 6 014 216.8 | 985 018.2 | 288 298.0 | 528 546.3 | 192 198.7 | |
\(O-E\) | -2.435e+06 | 1.145e+06 | 4.965e+05 | 5.926e+05 | 2.002e+05 | |
\((O-E)^2\over E\) | 9.855e+05 | 1.331e+06 | 8.551e+05 | 6.644e+05 | 2.085e+05 | 4.045e+06 |
The resulting p-value is much smaller than what R can represent - a threshold for a p-value of \(1\times10^{-32}\) would have been a mere 156.
We can thus confidently say that the racial makeup of NYC is different from that of the entire USA.
A paper1 investigates Benford’s Law (first digits of “natural” numbers often distribute as \(\PP(D=d) = \log_{10}\left(1+1/d\right)\)). They report, from a collection of genomic data, observed digit frequencies:
from tabulate import tabulate
observed = [48, 14, 12, 6, 18, 5, 7, 8, 9]
expected = (sum(observed)*np.log10(1+1/np.arange(1,10)))
print(tabulate(
[[""] + np.arange(1,10).tolist(),
["Observed"] + observed,
["Predicted %"] + (np.log10(1+1/np.arange(1,10))*100).tolist(),
["Expected"] + expected.tolist(),
["Difference"] + (observed-expected).tolist(),
["Std. Resid."] + ((observed-expected)/np.sqrt(expected)).tolist(),
["Chi-2 Term"] + ((observed-expected)**2/expected).tolist()
],
headers="firstrow"
))
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|
Observed | 48 | 14 | 12 | 6 | 18 | 5 | 7 | 8 | 9 |
Predicted % | 30.103 | 17.6091 | 12.4939 | 9.691 | 7.91812 | 6.69468 | 5.79919 | 5.11525 | 4.57575 |
Expected | 38.2308 | 22.3636 | 15.8672 | 12.3076 | 10.056 | 8.50224 | 7.36498 | 6.49637 | 5.8112 |
Difference | 9.76919 | -8.36359 | -3.86722 | -6.30757 | 7.94398 | -3.50224 | -0.364977 | 1.50363 | 3.1888 |
Std. Resid. | 1.57998 | -1.76857 | -0.970842 | -1.79794 | 2.5051 | -1.2011 | -0.134487 | 0.589937 | 1.3228 |
Chi-2 Term | 2.49634 | 3.12784 | 0.942534 | 3.2326 | 6.27553 | 1.44264 | 0.0180867 | 0.348025 | 1.7498 |
Lesperance et al report, from a collection of genomic data, observed digit frequencies. Note that since only 9 has a cell with \(1\leq E\leq 5\), Cochran’s condition is fulfilled (all cells \(E\geq 1\), at most 20% \(1\leq E\leq 5\), here \(\approx 11\%\)).
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|
Observed | 48 | 14 | 12 | 6 | 18 | 5 | 7 | 8 | 9 |
Predicted % | 30.103 | 17.6091 | 12.4939 | 9.691 | 7.91812 | 6.69468 | 5.79919 | 5.11525 | 4.57575 |
Expected | 38.2308 | 22.3636 | 15.8672 | 12.3076 | 10.056 | 8.50224 | 7.36498 | 6.49637 | 5.8112 |
Chi-2 Term | 2.49634 | 3.12784 | 0.942534 | 3.2326 | 6.27553 | 1.44264 | 0.0180867 | 0.348025 | 1.7498 |
We get an accumulated chi-2 test statistic of 19.6334, which yields a p-value of 0.0118.
This p-value would support a claim that this data set does not seem to fully conform to Benford’s Law.
The chi-squared test for Goodness of Fit has to be modified a bit when the hypotheses are composite - ie, when there are more than one set of probability assignments to the categories in the null hypothesis.
Similar to the likelihood ratio tests, the way out here is to perform the test for the most likely set of parameters determining the probability assignments. In other words, before performing the chi-squared test, we would estimate some number of parameters with maximum likelihood.
The effect of these estimations is to lower the degrees of freedom for the chi-square distribution by the number of estimates we have used.
Pearson’s Chi-Squared Test of Goodness of Fit with estimated parameters
Flashlights are sold with 4 batteries included. A random sample of 150 packages is taken, and the number of defective batteries reported:
Defective | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Frequency | 26 | 51 | 47 | 16 | 10 |
We are testing whether these numbers indicate a good fit with a Binomial model for defective batteries: ie, is is the case that \(X\sim Binomial(4,\theta)\) for some appropriate \(\theta\)?
The likelihood is \[ \begin{align*} \mathcal{L} &= \PP(X=0)^{26}\cdot\PP(X=1)^{51}\cdot\PP(X=2)^{47}\cdot\PP(X=3)^{16}\cdot\PP(X=5)^{10} \\ &= \left({4\choose 0}\theta^0(1-\theta)^4\right)^{26}\cdot \left({4\choose 1}\theta^1(1-\theta)^3\right)^{51}\cdot \left({4\choose 2}\theta^2(1-\theta)^2\right)^{47}\cdot \\ &\phantom{=}\cdot \left({4\choose 3}\theta^3(1-\theta)^1\right)^{16}\cdot \left({4\choose 4}\theta^4(1-\theta)^0\right)^{10} \\ &= {4\choose 0}^{26}\cdot{4\choose 1}^{51}\cdot{4\choose 2}^{47}\cdot{4\choose 3}^{16}\cdot{4\choose 4}^{10}\cdot\theta^{51+47\cdot2+16\cdot3+10\cdot4}(1-\theta)^{26\cdot4+51\cdot3+47\cdot2+16} \end{align*} \]
The log likelihood is (writing \(C\) for the log of the product of powers of binomial coefficients): \[ \begin{align*} \ell &= C + 233\log\theta + 367\log(1-\theta) \\ \frac{d\ell}{d\theta} &= {233\over\theta} + {367\over1-\theta}\cdot(-1) = 0 \\ 233(1-\theta) &= 367\theta \\ 233 &= 600\theta & \hat\theta &= 233/600\approx 0.38833 \end{align*} \]
With the MLE \(\hat\theta=233/600\), we now compute expected cell counts for the binomial model, and the corresponding chi-square test statistic.
defective | 0.0000 | 1.0000 | 2.0000 | 3.0000 | 4.0000 |
observed | 26.0000 | 51.0000 | 47.0000 | 16.0000 | 10.0000 |
expected | 20.9967 | 53.3213 | 50.7787 | 21.4922 | 3.4112 |
difference | 5.0033 | -2.3213 | -3.7787 | -5.4922 | 6.5888 |
residual | 0.2383 | -0.0435 | -0.0744 | -0.2555 | 1.9315 |
chisq | 0.0568 | 0.0019 | 0.0055 | 0.0653 | 3.7307 |
Summing the chi-square terms in the table we get a test statistic of \(X=3.8602139\).
With 5 cells and 1 estimated value, \(k-1-m = 5-1-1 = 3\) degrees of freedom, which means that the observed test statistic corresponds to a p-value of:
This p-value does not give reason to reject the null hypothesis, which means that the binomial model is possibly a good fit for the observed data.
For some cases it may make sense to group categories together in order to make the minimal expected cell count \(\geq 5\). However, when doing this, there are two complications that enter:
Theorem
Let \(\hat\theta_1,\dots,\hat\theta_m\) be MLEs of \(\theta_1,\dots,\theta_m\) based on estimating with the full sample \(X_1,\dots,X_n\) rather than the grouped data. Then the critical value \(c_\alpha\) that specifies a level \(\alpha\) upper-tailed test satisfies \[ \chi^2_{\alpha,k-1-m}\leq c_\alpha\leq \chi^2_{\alpha,k-1} \]
A test procedure that uses MLEs from the full un-grouped data thus has a region without a clear result, in addition to a reject region and a region that fails to reject.
Investigators looked at 300 chromosomes and counted the number of sister-chromatid exchanges on each. They hypothesize a Poisson model.
Compute an MLE \(\hat\lambda\), and then group \(\geq 8\) into a single cell and check goodness of fit.
The likelihood is: \[ \begin{align*} \mathcal{L} &= p(0)^{6}\cdot p(1)^{24}\cdot p(2)^{42}\cdot p(3)^{59}\cdot p(4)^{62}\cdot p(5)^{44}\cdot p(6)^{41}\cdot p(7)^{14}\cdot p(8)^{6}\cdot p(9)^{2} \\ &= \left({\lambda^0e^{-\lambda}\over 0!}\right)^{6}\cdot \left({\lambda^1e^{-\lambda}\over 1!}\right)^{24}\cdot \left({\lambda^2e^{-\lambda}\over 2!}\right)^{42}\cdot \left({\lambda^3e^{-\lambda}\over 3!}\right)^{59}\cdot \left({\lambda^4e^{-\lambda}\over 4!}\right)^{62}\cdot \\ &\phantom{=}\cdot \left({\lambda^5e^{-\lambda}\over 5!}\right)^{44}\cdot \left({\lambda^6e^{-\lambda}\over 6!}\right)^{41}\cdot \left({\lambda^7e^{-\lambda}\over 7!}\right)^{14}\cdot \left({\lambda^8e^{-\lambda}\over 8!}\right)^{6}\cdot \left({\lambda^9e^{-\lambda}\over 9!}\right)^{2} \\ &= {\lambda^{\overbrace{24+42\cdot2+59\cdot3+62\cdot4+44\cdot5+41\cdot6+14\cdot7+6\cdot8+2\cdot9}^{=1163}}e^{-300\lambda}\over 0!\cdot\dots\cdot 9!} \end{align*} \]
The log likelihood is (writing \(C\) for the log of the product of factorials): \[ \begin{align*} \ell &= 1163\log\lambda - 300\lambda - C \\ {d\ell \over d\lambda} &= {1163\over\lambda} - 300 & \hat\lambda &= {1163\over 300}\approx 3.87666 \end{align*} \]
Observed and estimated counts, assuming a \(Poisson(3.87666)\) model, are:
x | 0.00 | 1.0 | 2.00 | 3.00 | 4.0 | 5.00 | 6.0 | 7.00 | 8.00 | 9.00 |
observed | 6.00 | 24.0 | 42.00 | 59.00 | 62.0 | 44.00 | 41.0 | 14.00 | 6.00 | 2.00 |
expected | 6.22 | 24.1 | 46.71 | 60.36 | 58.5 | 45.35 | 29.3 | 16.23 | 7.86 | 3.39 |
But we cannot use this as-is, since expected counts from 9 and up are too small (and infinitely many), so we need to group values.
Grouping the data, we get instead:
df.16[9,] = df.16[9,] + df.16[10,]
df.16 = df.16[-10,]
df.16$x = as.character(df.16$x)
df.16$x[9] = "≥8"
df.16$expected[9] = 300*ppois(7, 1163/300, lower.tail = FALSE)
df.16$difference = df.16$observed-df.16$expected
df.16$residual = df.16$difference/df.16$expected
df.16$chisq = df.16$residual^2
df.16 %>% kable(digits=2)
x | observed | expected | difference | residual | chisq |
---|---|---|---|---|---|
0 | 6 | 6.22 | -0.22 | -0.03 | 0.00 |
1 | 24 | 24.10 | -0.10 | 0.00 | 0.00 |
2 | 42 | 46.71 | -4.71 | -0.10 | 0.01 |
3 | 59 | 60.36 | -1.36 | -0.02 | 0.00 |
4 | 62 | 58.50 | 3.50 | 0.06 | 0.00 |
5 | 44 | 45.35 | -1.35 | -0.03 | 0.00 |
6 | 41 | 29.30 | 11.70 | 0.40 | 0.16 |
7 | 14 | 16.23 | -2.23 | -0.14 | 0.02 |
≥8 | 8 | 13.24 | -5.24 | -0.40 | 0.16 |
These observations combine to a test statistic of \(X=0.3511067\).
With 9 categories and 1 estimated quantity, the lower bound has 7 degrees of freedom and the upper bound has 8 degrees of freedom.
So for comparing with the lower bound, we need a p-value with 8 DoF, for comparing with the upper bound, we need it with 9 DoF. We get:
These p-values give us a clear signal that we fail to reject the null hypothesis. A Poisson model may work quite well for this situation.
This test is one option1 for testing Goodness of Fit against continuous distributions.
The way to do that is by transforming the continuous distribution fit into a discrete fit, by comparing a histogram to the expected sizes.
Histogram bins should be picked to ensure either Fisher’s or Cochran’s criterion is fulfilled - common methods include either partitioning into equal-width bins, or into equal-expected-count bins.
Since the specific distribution we are fitting against is usually estimated on the full (un-binned) data, the same requirements apply as in the grouped data case above: use both \(k-1\) and \(k-m-1\) degrees of freedom, for \(k\) categories and \(m\) independent estimated parameters:
Reject if the \(k-1\) p-value indicates rejection; fail to reject if the \(k-m-1\) p-value indicates failure to reject; test is inconclusive if neither p-value indicates its conclusion.
In 10.78, we were considering a set of scores from a control group and an experimental group, and used bootstraps to create confidence intervals due to doubts of the normality of the original data.
scores = pd.DataFrame({
"score": [34,27,26,33,23,37,24,34,22,23,32, 5,
30,35,28,25,37,28,26,29,22,33,31,23,
37,29, 0,30,34,26,28,27,32,29,31,33,
28,21,34,29,33, 6, 8,29,36, 7,21,30,
28,34,28,35,30,34, 9,38, 9,27,25,33,
9 ,23,32,25,37,28,23,26,34,32,34, 0,
24,30,36,28,38,35,16,37,25,34,38,34,31, # end of E
37,22,29,29,33,22,32,36,29, 6, 4,37,
0 ,36, 0,32,27, 7,19,35,26,22,28,28,
32,35,28,33,35,24,21, 0,32,28,27, 8,
30,37, 9,33,30,36,28, 3, 8,31,29, 9,
0 , 0,35,25,29, 3,33,33,28,32,39,20,
32,22,24,20,32, 7, 8,33,29, 9, 0,30,
26,25,32,38,22,29,29],
"group": ["E"]*85 + ["C"]*79
})
f = pyplot.figure(figsize=(8,6))
(
so.Plot(data=scores, x="score")
.on(f)
.add(so.Line(), so.KDE(), color="group")
.add(so.Line(color="black"), so.KDE())
.label(title="Score Distributions for the Experimental and Control groups")
.show()
)
We can use the chi-square test of Goodness of Fit to check whether this could reasonably have been a normal distribution at all. Since we have two groups, we should check for normal distribution of the residuals. We standardize - subtract group means and divide by standard deviation - each group separately.
The plan now is to divide up the range into bins, count each bin, and check whether the counts match our expectations if this had been a normal distribution.
With 164 observations total, if we are to fulfills Fisher’s criterion, each bin must have no less than 5 expected entries. So with equal mass bins we can split our data into 32 bins, each with an expected probability mass of 1/32.
Our bin boundaries thus are simply the standard normal quantiles of multiples of \(1/32\) - ie, \(\Phi^{-1}(k/32)\):
array([ -inf, -1.86273187, -1.53412054, -1.3180109 , -1.15034938,
-1.00999017, -0.88714656, -0.77642176, -0.67448975, -0.57913216,
-0.48877641, -0.40225007, -0.31863936, -0.23720211, -0.15731068,
-0.07841241, 0. , 0.07841241, 0.15731068, 0.23720211,
0.31863936, 0.40225007, 0.48877641, 0.57913216, 0.67448975,
0.77642176, 0.88714656, 1.00999017, 1.15034938, 1.3180109 ,
1.53412054, 1.86273187, inf])
From our construction, we expect 5.125 entries in each bin. We can use, for instance, the numpy.histogram
command to split the actual data we have into their respective bins, and we get the observed counts:
from tabulate import tabulate
hist, _ = np.histogram(scores.zscore, bins=boundaries)
diff = hist-164/32
resid = (hist-164/32)/np.sqrt(164/32)
resid2 = resid**2
X = resid2.sum()
p_r = scipy.stats.chi2(31).sf(X)
p_a = scipy.stats.chi2(29).sf(X)
table_resid = [f"{r:.3f}" for r in resid.tolist()]
table_resid2 = [f"{r:.3f}" for r in resid2.tolist()]
table = [
["Observed"] + hist[:16].tolist(),
[""] + hist[16:].tolist(),
["Differences"] + diff[:16].tolist(),
[""] + diff[16:].tolist(),
["Residuals"] + table_resid[:16],
[""] + table_resid[16:],
["Chi2 Terms"] + table_resid2[:16],
[""] + table_resid2[16:]
]
print(tabulate(table, tablefmt="simple"))
Observed | 15 | 4 | 5 | 4 | 0 | 0 | 0 | 2 | 2 | 5 | 1 | 4 | 5 | 5 | 4 | 3 |
10 | 2 | 7 | 7 | 6 | 11 | 7 | 6 | 18 | 9 | 6 | 11 | 5 | 0 | 0 | 0 | |
Differences | 9.875 | -1.125 | -0.125 | -1.125 | -5.125 | -5.125 | -5.125 | -3.125 | -3.125 | -0.125 | -4.125 | -1.125 | -0.125 | -0.125 | -1.125 | -2.125 |
4.875 | -3.125 | 1.875 | 1.875 | 0.875 | 5.875 | 1.875 | 0.875 | 12.875 | 3.875 | 0.875 | 5.875 | -0.125 | -5.125 | -5.125 | -5.125 | |
Residuals | 4.362 | -0.497 | -0.055 | -0.497 | -2.264 | -2.264 | -2.264 | -1.38 | -1.38 | -0.055 | -1.822 | -0.497 | -0.055 | -0.055 | -0.497 | -0.939 |
2.153 | -1.38 | 0.828 | 0.828 | 0.387 | 2.595 | 0.828 | 0.387 | 5.687 | 1.712 | 0.387 | 2.595 | -0.055 | -2.264 | -2.264 | -2.264 | |
Chi2 Terms | 19.027 | 0.247 | 0.003 | 0.247 | 5.125 | 5.125 | 5.125 | 1.905 | 1.905 | 0.003 | 3.32 | 0.247 | 0.003 | 0.003 | 0.247 | 0.881 |
4.637 | 1.905 | 0.686 | 0.686 | 0.149 | 6.735 | 0.686 | 0.149 | 32.345 | 2.93 | 0.149 | 6.735 | 0.003 | 5.125 | 5.125 | 5.125 |
Summing up the chi-squared contributions we get the test statistic and p-value: X: 116.585, p: 1.853e-12 (31 DoF to reject); 7.612e-12 (29 DoF to fail to reject), which is strong support for the claim that the normal distribution does not accurately describe the data.
Alternatively, if we want to work with equal width bins (except for the tails), we could, for instance, split the normal distribution into tails that each expect 5 entries, and split the distance between the tails into equal width parts so that the smallest of them expects 5 entries. With 164 observations, this means we need to compute:
\[ {5\over 164}\approx 0.0305 \qquad z_{low} = \Phi^{-1}(0.0305)\approx -1.873 \qquad z_{high} \approx 1.873 \]
Computing also \(\Phi^{-1}(2\cdot 0.0305)\approx -1.546\) we get a binwidth of \(0.327\), producing about \(11.5\) bins between \(z_{low}\) and \(z_{high}\).
Putting in exactly 11 bins (slightly too large is better than slightly too small), we get the following:
evenboundaries = [-np.inf] + np.linspace(-1.873, 1.873, 12).tolist() + [np.inf]
evenhist, _ = np.histogram(scores.zscore, bins=evenboundaries)
evenprob = [scipy.stats.norm.cdf(hi)-scipy.stats.norm.cdf(lo) for (lo,hi) in zip(evenboundaries[:-1], evenboundaries[1:])]
evenexpected = [len(scores)*ep for ep in evenprob]
O = np.array(evenhist)
E = np.array(evenexpected)
print(tabulate([
["Observed"] + evenhist.tolist(),
["Expected"] + evenexpected,
["Difference"] + (O-E).tolist(),
["Std. Resid."] + ((O-E)/np.sqrt(E)).tolist(),
["Chi2 Term"] + ((O-E)**2/E).tolist()
], floatfmt=".2f"))
Observed | 15.00 | 4.00 | 9.00 | 0.00 | 4.00 | 15.00 | 24.00 | 31.00 | 37.00 | 20.00 | 5.00 | 0.00 | 0.00 |
Expected | 5.01 | 5.28 | 8.85 | 13.22 | 17.62 | 20.94 | 22.17 | 20.94 | 17.62 | 13.22 | 8.85 | 5.28 | 5.01 |
Difference | 9.99 | -1.28 | 0.15 | -13.22 | -13.62 | -5.94 | 1.83 | 10.06 | 19.38 | 6.78 | -3.85 | -5.28 | -5.01 |
Std. Resid. | 4.47 | -0.56 | 0.05 | -3.64 | -3.25 | -1.30 | 0.39 | 2.20 | 4.62 | 1.86 | -1.29 | -2.30 | -2.24 |
Chi2 Term | 19.94 | 0.31 | 0.00 | 13.22 | 10.53 | 1.68 | 0.15 | 4.84 | 21.31 | 3.47 | 1.67 | 5.28 | 5.01 |
With a chi-2 test statistic of 87.41 we get a p-value of 1.5624e-13 (12 DoF to reject) and 1.7437e-14 (10 DoF to fail to reject)
There are several common tests for normality - all of them have as null hypothesis that the data follows a normal distribution: a rejected null hypothesis means you cannot use methods that assume normality.
scipy.stats.anderson
nortest
scipy.stats.shapiro
shapiro.test
nortest
.
scipy.stats.kstest
. Give scipy.stats.norm(mu,sigma).cdf
as second argument.
ks.test
. Give "norm"
as second argument, and mean and standard deviation as additional arguments as if giving them to pnorm
.