Categorical Data

Mikael Vejdemo-Johansson

One proportion, Two proportions, Many proportions

Many of our tests have focused on comparing proportions of labels in some categorical data:

Null Hypothesis Type Test Type
\(H_0:p = p_0\) Binomial test, Normal approximation test of proportions
\(H_0:p_1 = p_2\) Normal approximation test of proportions, McNemar’s test

We will next introduce the next step in this progression: comparing proportions for more than 2 subgroups of a set of observations.

For a starting point, we will assume that we have some number of observations following a multinomial probability distribution.

Comparing many proportions at once

Given a vector of probabilities \(\boldsymbol p = (p_1,\dots,p_m)\) that sum to 1, the Multinomial distribution \(Multinomial(n,\boldsymbol p)\) counts the number of outcomes from \(n\) draws of a value from \(1,\dots,m\) with probabilities \(\PP(\text{outcome }j)=p_j\). Each observation \(\boldsymbol k=(k_1,\dots,k_m)\) is a vector of integers, summing to \(n\).

By just picking any one outcome and calling it success, it is easy to see that any one entry \(k_j\sim Binomial(n, p_j)\).

When developing McNemar’s test in Lecture 15, we established that for two entries of a Multinomial vector \(k_i, k_j\), their covariance is \(Cov(k_i, k_j) = -np_ip_j\)

Since each entry is binomial, the expected value of any one entry \(\EE[k_j]=np_j\). It is very attractive to try to build a pivot or a test statistic out of the squared errors \((k_j-np_j)^2\) – however, these are sensitive to scale: an error of 5 yields much more information if we expected 10 than if we expected 100.

Pearson suggested that we may focus on the quantity \(X = \sum{(k_j-np_j)^2\over np_j}\) to see whether a proposed multinomial distribution seems likely.

Asymptotics for \(X\)

Following the 6th proof in Benhamou and Melot: Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation, we will prove Pearson’s Theorem next:

Theorem

Let \(\boldsymbol K\sim Multinomial(n,\boldsymbol p)\). Then \[ X=\sum{(K_j-np_j)^2\over np_j}\sim\chi^2(m-1) \]

Proof

We will proceed with a proof by induction over the number of categories \(m\).

Pearson’s Theorem

Theorem

\[ X=\sum{(K_j-np_j)^2\over np_j}\sim\chi^2(m-1) \]

Proof

Induction base: If \(m=2\), \[ \begin{align*} X={(k-np)^2\over np}+{((n-k)-nq)^2\over nq} &= {q(k-np)^2+p((n-k)-nq)^2\over npq} = \\ = {(k-np)^2q + (\overbrace{n(1-q)-k}^{=np-k})^2p\over npq} &= {(k-np)^2(\overbrace{q+p}^{=1})\over npq} = \left(k-np\over \sqrt{npq}\right)^2 \end{align*} \]

We recognize this last fraction: since \(k\sim Binomial(n,p)\), we know that \({k-np\over\sqrt{npq}}\sim\mathcal{N}(0,1)\), and hence \(X\sim\chi^2(1)\).

Pearson’s Theorem

Theorem

\[ X=\sum{(K_j-np_j)^2\over np_j}\sim\chi^2(m-1) \]

Proof

Induction Step: The induction step is encapsulated in the following Proposition:

Proposition

If for every \(m\)-dimensional multinomial distribution we know \(X\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X\xrightarrow{dist.}\chi^2(m)\).

In the following proof, we will add \(m\) and \(m+1\) as indices to disambiguate between the currently considered \(m\)-dimensional and \((m+1)\)-dimensional cases.

Pearson’s Theorem - induction step

Proposition

If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).

Proof of the proposition

Suppose \(\boldsymbol k\sim Multinomial(n,\boldsymbol p)\) is the \(m+1\)-dimensional case. We will rewrite \(X_{m+1}\) as follows: \[ \begin{align*} X_{m+1} &= \sum_{j=1}^{m+1}{(k_{j}-np_{j})^2\over np_{j}} \\ &= \sum_{j=1}^{m-1}{(k_{j}-np_{j})^2\over np_{j}} + {(k_{m}-np_{m})^2\over np_{m}} + {(k_{m+1}-np_{m+1})^2\over np_{m+1}} \\ &= \sum_{j=1}^{m-1}{(k_{j}-np_{j})^2\over np_{j}} + {((k_{m}+k_{m+1})-n(p_{m}+p_{m+1}))^2\over n(p_{m}+p_{m+1})} + U^2 \\ &= X_{m} + U^2 \end{align*} \]

where \(X_m\) corresponds to observing \(Multinomial(n, (p_{1},\dots,p_{m-1},(p_{m}+p_{m+1})))\): the multinomial distribution where we just combined the last two categories into one category.

Pearson’s Theorem - induction step

Proposition

If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).

Proof of the proposition

This combination will of course not be an exact correspondence - there’s an error term that we name \(U^2\) and that is defined as \[ U^2 = {(k_{m}-np_{m})^2\over np_{m}} + {(k_{m+1}-np_{m+1})^2\over np_{m+1}} - {((k_{m}+k_{m+1})-n(p_{m}+p_{m+1}))^2\over n(p_{m}+p_{m+1})} \]

Pearson’s Theorem - induction step

Proposition

If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).

Proof of the proposition

We have two steps left to finish the proof.

First, since this \(X_m\) is built from an \(m\)-dimensional multinomial distribution, \(X_m\xrightarrow{dist.}\chi^2(m-1)\) by our assumption.

Second, we will prove \(U^2\sim\chi^2(1)\). By the subtraction property for chi-squared variables that we used before, this will finish the proof.

Pearson’s Theorem - induction step

Proposition

If for every \(m\)-dimensional multinomial distribution we know \(X_{m}\xrightarrow{dist.}\chi^2(m-1)\) with our definitions and setup, then for any \(m+1\)-dimensional multinomial distribution, \(X_{m+1}\xrightarrow{dist.}\chi^2(m)\).

Proof of the proposition

Write \(T_m=k_m-np_m\) and \(T_{m+1}=k_{m+1}-np_{m+1}\).

\[ \begin{align*} U^2 &= {T_m^2\over np_m} + {T_{m+1}^2\over np_{m+1}} -{(\color{Teal}{k_{m}}+\color{DarkMagenta}{k_{m+1}}-n(\color{Teal}{p_m}+\color{DarkMagenta}{p_{m+1}}))^2\over n(p_m+p_{m+1})} \\ &= {{T_m^2\over p_m}(p_m+p_{m+1})+{T_{m+1}^2\over p_{m+1}}(p_m+p_{m+1})-(\color{Teal}{T_m}+\color{DarkMagenta}{T_{m+1}})^2\over n(p_m+p_{m+1})} \\ &= {\color{SteelBlue}{T_m^2}+\color{DarkGoldenrod}{T_{m+1}^2}+{p_{m+1}\over p_m}T_m^2 + {p_m\over p_{m+1}}T_{m+1}^2-(\color{SteelBlue}{T_m^2}+2T_mT_{m+1}+\color{DarkGoldenrod}{T_{m+1}^2})\over n(p_m+p_{m+1})} \end{align*} \]

Pearson’s Theorem - induction step

Proof of the proposition

\[ \begin{align*} U^2 &= {1\over n(p_m+p_{m+1})}\left({{p_{m+1}\over p_m}T_m^2 + {p_m\over p_{m+1}}T_{m+1}^2-2T_mT_{m+1}}\right) \\ &= {1\over n(p_m+p_{m+1})}\left(\sqrt{p_{m+1}\over p_m}T_m - \sqrt{p_m\over p_{m+1}}T_{m+1}\right)^2 \\ &= {1\over n(p_m+p_{m+1})}\left({p_{m+1}}T_m - {p_m}T_{m+1}\over\sqrt{p_mp_{m+1}}\right)^2 \\ &= \left({p_{m+1}}T_m - {p_m}T_{m+1}\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}\right)^2 = V^2 \end{align*} \]

Pearson’s Theorem - induction step

Proof of the proposition

\[ \begin{align*} {p_{m+1}}T_m - {p_m}T_{m+1} &= {p_{m+1}}(k_m-np_m) - {p_m}(k_{m+1}-np_{m+1}) \\ &= k_mp_{m+1} - k_{m+1}p_m + np_{m+1}p_m - np_mp_{m+1} \\ &= k_mp_{m+1} - k_{m+1}p_m \end{align*} \]

Hence \(V = {k_mp_{m+1}-k_{m+1}p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}}\) is a linear combination of two components that each converges in distribution to a normal distribution - hence, their linear combination converges to a normal distribution. Which one? Let’s compute \(\EE\) and \(\VV\)!

Pearson’s Theorem - induction step

Proof of the proposition

\[ \begin{align*} \EE[V] &= \EE\left[{k_mp_{m+1}-k_{m+1}p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}}\right] \\ &= {\EE[k_m]p_{m+1}-\EE[k_{m+1}]p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}} \\ &= {np_mp_{m+1}-np_{m+1}p_m\over\sqrt{np_mp_{m+1}(p_m+p_{m+1})}} = 0 \end{align*} \]

Pearson’s Theorem - induction step

Proof of the proposition

\[ \begin{align*} \VV[k_mp_{m+1}-k_{m+1}p_m] &= p_{m+1}^2\VV[k_m] + p_m^2\VV[k_{m+1}] -2p_mp_{m+1}Cov(k_m,k_{m+1})\\ &= p_{m+1}^2np_mq_m+p_m^2np_{m+1}q_{m+1}-2p_mp_{m+1}(-np_mp_{m+1}) \\ &= np_mp_{m+1}(p_{m+1}-p_{m}p_{m+1}+p_m-p_{m}p_{m+1}+2p_mp_{m+1}) \\ &= np_mp_{m+1}(p_m+p_{m+1}) \end{align*} \]

Which is exactly what’s under the square root in the denominator of \(V\). Hence \(V\xrightarrow{dist.}\mathcal{N}(0,1)\) and thus \(U^2\xrightarrow{dist.}\chi^2(1)\).

Strictly speaking, we would also need to show that \(U^2\) is independent of \(X_m\), which can be done by computing the covariance of any one term in \(X_m\) with $U^2 and arguing about multivariate normal vectors.

Goodness of Fit

Validity of the approximations of Pearson’s theorem
Fisher (1925): valid if \(np_i\geq 5\) for each of the \(p_i\).
Cochran (1954): Fisher is too conservative. Instead, \(np_i\geq 1\) for all cells, and \(np_i\geq 5\) for at least 80% of the cells (so that no more than 20% fall in \(1\leq np_i\leq 5\))

Pearson’s Chi-Squared Test of Goodness of Fit

Null Hypothesis
\(H_0:\boldsymbol p=\boldsymbol p_0\) (ie \(p_1=p_{1,0}, p_2=p_{2,0},\dots\))
Alternative Hypothesis
\(H_a:\) at least one \(p_i\neq p_{i,0}\).
Test Statistic
\(X = \sum{(n_i - np_{i,0})^2\over np_{i,0}} = \sum{(O_i-E_i)^2\over E_i}\)
Where we write \(O_i\) for the observed count of the i:th category and \(E_i\) for the expected count.
Rejection Region
\(X \geq \chi^2_{\alpha,k-1}\)
p-value
\(1-CDF_{\chi^2(k-1)}(X)\)

Example

Demographics of NYC

From the 2000 Census, we get the following distribution of racial self-identification:

  White Black Asian Other Mixed Total
NYC 3 579 700 2 130 202 784 811 1 121 159 392 406 8 008 278
USA 75.1% 12.3% 3.6% 6.6% 2.4% 100%

Question: Does the demographics of NYC significantly differ from the demographics of the US at large?

Example

Demographics of NYC

By multiplying the total population of NYC with the percentages observed in the country at large, we get the following:

  White Black Asian Other Mixed Total
NYC 3 579 700 2 130 202 784 811 1 121 159 392 406 8 008 278
USA 75.1% 12.3% 3.6% 6.6% 2.4% 100%
Expected 6 014 216.8 985 018.2 288 298.0 528 546.3 192 198.7
\(O-E\) -2.435e+06 1.145e+06 4.965e+05 5.926e+05 2.002e+05
\((O-E)^2\over E\) 9.855e+05 1.331e+06 8.551e+05 6.644e+05 2.085e+05 4.045e+06

The resulting p-value is much smaller than what R can represent - a threshold for a p-value of \(1\times10^{-32}\) would have been a mere 156.

We can thus confidently say that the racial makeup of NYC is different from that of the entire USA.

Smaller Example

A paper1 investigates Benford’s Law (first digits of “natural” numbers often distribute as \(\PP(D=d) = \log_{10}\left(1+1/d\right)\)). They report, from a collection of genomic data, observed digit frequencies:

Code
from tabulate import tabulate

observed = [48, 14, 12, 6, 18, 5, 7, 8, 9]
expected = (sum(observed)*np.log10(1+1/np.arange(1,10)))
print(tabulate(
  [[""] + np.arange(1,10).tolist(),
   ["Observed"] + observed,
   ["Predicted %"] + (np.log10(1+1/np.arange(1,10))*100).tolist(),
   ["Expected"] + expected.tolist(),
   ["Difference"] + (observed-expected).tolist(),
   ["Std. Resid."] + ((observed-expected)/np.sqrt(expected)).tolist(),
   ["Chi-2 Term"] + ((observed-expected)**2/expected).tolist()
  ],
  headers="firstrow"
))
1 2 3 4 5 6 7 8 9
Observed 48 14 12 6 18 5 7 8 9
Predicted % 30.103 17.6091 12.4939 9.691 7.91812 6.69468 5.79919 5.11525 4.57575
Expected 38.2308 22.3636 15.8672 12.3076 10.056 8.50224 7.36498 6.49637 5.8112
Difference 9.76919 -8.36359 -3.86722 -6.30757 7.94398 -3.50224 -0.364977 1.50363 3.1888
Std. Resid. 1.57998 -1.76857 -0.970842 -1.79794 2.5051 -1.2011 -0.134487 0.589937 1.3228
Chi-2 Term 2.49634 3.12784 0.942534 3.2326 6.27553 1.44264 0.0180867 0.348025 1.7498

Smaller Example

Lesperance et al report, from a collection of genomic data, observed digit frequencies. Note that since only 9 has a cell with \(1\leq E\leq 5\), Cochran’s condition is fulfilled (all cells \(E\geq 1\), at most 20% \(1\leq E\leq 5\), here \(\approx 11\%\)).

Code
print(tabulate(
  [[""] + np.arange(1,10).tolist(),
   ["Observed"] + observed,
   ["Predicted %"] + (np.log10(1+1/np.arange(1,10))*100).tolist(),
   ["Expected"] + expected.tolist(),
   ["Chi-2 Term"] + ((observed-expected)**2/expected).tolist()
  ],
  headers="firstrow"
))
1 2 3 4 5 6 7 8 9
Observed 48 14 12 6 18 5 7 8 9
Predicted % 30.103 17.6091 12.4939 9.691 7.91812 6.69468 5.79919 5.11525 4.57575
Expected 38.2308 22.3636 15.8672 12.3076 10.056 8.50224 7.36498 6.49637 5.8112
Chi-2 Term 2.49634 3.12784 0.942534 3.2326 6.27553 1.44264 0.0180867 0.348025 1.7498
Code
print(f"""
We get an accumulated chi-2 test statistic of {((observed-expected)**2/expected).sum():.4f}, which yields a p-value of {scipy.stats.chi2(len(observed)-1).sf(((observed-expected)**2/expected).sum()):.4f}.
""")

We get an accumulated chi-2 test statistic of 19.6334, which yields a p-value of 0.0118.

This p-value would support a claim that this data set does not seem to fully conform to Benford’s Law.

Goodness of Fit for composite hypotheses

The chi-squared test for Goodness of Fit has to be modified a bit when the hypotheses are composite - ie, when there are more than one set of probability assignments to the categories in the null hypothesis.

Similar to the likelihood ratio tests, the way out here is to perform the test for the most likely set of parameters determining the probability assignments. In other words, before performing the chi-squared test, we would estimate some number of parameters with maximum likelihood.

The effect of these estimations is to lower the degrees of freedom for the chi-square distribution by the number of estimates we have used.

Goodness of Fit

Validity of the approximations of Pearson’s theorem
Fisher (1925): valid if \(np_i(\boldsymbol{\hat\theta})\geq 5\) for each of the \(p_i\).
Cochran (1954): Fisher is too conservative. Instead, \(np_i(\boldsymbol{\hat\theta})\geq 1\) for all cells, and \(np_i(\boldsymbol{\hat\theta})\geq 5\) for at least 80% of the cells (so that no more than 20% fall in \(1\leq np_i(\boldsymbol{\hat\theta})\leq 5\))

Pearson’s Chi-Squared Test of Goodness of Fit with estimated parameters

Null Hypothesis
\(H_0:\boldsymbol p=\boldsymbol p_0(\boldsymbol\theta)\) (ie \(p_1=p_{1,0}(\boldsymbol\theta), p_2=p_{2,0}(\boldsymbol\theta),\dots\)) with \(\boldsymbol\theta=(\theta_1,\dots,\theta_m)\).
Alternative Hypothesis
\(H_a:\) at least one \(p_i\neq p_{i,0}(\boldsymbol\theta)\).
Test Statistic
\(X = \sum{(n_i - np_{i,0}(\boldsymbol{\hat\theta}))^2\over np_{i,0}(\boldsymbol{\hat\theta})} = \sum{(O_i-E_i)^2\over E_i}\)
Where we write \(O_i\) for the observed count of the i:th category and \(E_i\) for the expected count.
Rejection Region
\(X \geq \chi^2_{\alpha,k-1-m}\)
p-value
\(1-CDF_{\chi^2(k-1-m)}(X)\)

Estimated parameters and degrees of freedom

Exercise 13.15

Flashlights are sold with 4 batteries included. A random sample of 150 packages is taken, and the number of defective batteries reported:

Defective 0 1 2 3 4
Frequency 26 51 47 16 10

We are testing whether these numbers indicate a good fit with a Binomial model for defective batteries: ie, is is the case that \(X\sim Binomial(4,\theta)\) for some appropriate \(\theta\)?

Estimated parameters and degrees of freedom

Exercise 13.15

The likelihood is \[ \begin{align*} \mathcal{L} &= \PP(X=0)^{26}\cdot\PP(X=1)^{51}\cdot\PP(X=2)^{47}\cdot\PP(X=3)^{16}\cdot\PP(X=5)^{10} \\ &= \left({4\choose 0}\theta^0(1-\theta)^4\right)^{26}\cdot \left({4\choose 1}\theta^1(1-\theta)^3\right)^{51}\cdot \left({4\choose 2}\theta^2(1-\theta)^2\right)^{47}\cdot \\ &\phantom{=}\cdot \left({4\choose 3}\theta^3(1-\theta)^1\right)^{16}\cdot \left({4\choose 4}\theta^4(1-\theta)^0\right)^{10} \\ &= {4\choose 0}^{26}\cdot{4\choose 1}^{51}\cdot{4\choose 2}^{47}\cdot{4\choose 3}^{16}\cdot{4\choose 4}^{10}\cdot\theta^{51+47\cdot2+16\cdot3+10\cdot4}(1-\theta)^{26\cdot4+51\cdot3+47\cdot2+16} \end{align*} \]

Estimated parameters and degrees of freedom

Exercise 13.15

The log likelihood is (writing \(C\) for the log of the product of powers of binomial coefficients): \[ \begin{align*} \ell &= C + 233\log\theta + 367\log(1-\theta) \\ \frac{d\ell}{d\theta} &= {233\over\theta} + {367\over1-\theta}\cdot(-1) = 0 \\ 233(1-\theta) &= 367\theta \\ 233 &= 600\theta & \hat\theta &= 233/600\approx 0.38833 \end{align*} \]

Estimated parameters and degrees of freedom

Exercise 13.15

With the MLE \(\hat\theta=233/600\), we now compute expected cell counts for the binomial model, and the corresponding chi-square test statistic.

Code
df = tibble(
  defective=0:4,
  observed=c(26,51,47,16,10),
  expected=sum(observed)*dbinom(defective,4,233/600),
  difference=observed-expected,
  residual=difference/expected,
  chisq=residual^2
)
df %>% t %>% kable(digits=4)
defective 0.0000 1.0000 2.0000 3.0000 4.0000
observed 26.0000 51.0000 47.0000 16.0000 10.0000
expected 20.9967 53.3213 50.7787 21.4922 3.4112
difference 5.0033 -2.3213 -3.7787 -5.4922 6.5888
residual 0.2383 -0.0435 -0.0744 -0.2555 1.9315
chisq 0.0568 0.0019 0.0055 0.0653 3.7307

Estimated parameters and degrees of freedom

Exercise 13.15

Summing the chi-square terms in the table we get a test statistic of \(X=3.8602139\).

With 5 cells and 1 estimated value, \(k-1-m = 5-1-1 = 3\) degrees of freedom, which means that the observed test statistic corresponds to a p-value of:

Code
1-pchisq(sum(df$chisq), 3)
[1] 0.2769595

This p-value does not give reason to reject the null hypothesis, which means that the binomial model is possibly a good fit for the observed data.

Grouped cells

For some cases it may make sense to group categories together in order to make the minimal expected cell count \(\geq 5\). However, when doing this, there are two complications that enter:

  1. The MLE must be computed from the cell counts, not the observed and un-grouped data - which may be difficult.
  2. The actual distribution of the estimators shifts slightly, requiring a different treatment of critical values.

Theorem

Let \(\hat\theta_1,\dots,\hat\theta_m\) be MLEs of \(\theta_1,\dots,\theta_m\) based on estimating with the full sample \(X_1,\dots,X_n\) rather than the grouped data. Then the critical value \(c_\alpha\) that specifies a level \(\alpha\) upper-tailed test satisfies \[ \chi^2_{\alpha,k-1-m}\leq c_\alpha\leq \chi^2_{\alpha,k-1} \]

A test procedure that uses MLEs from the full un-grouped data thus has a region without a clear result, in addition to a reject region and a region that fails to reject.

Grouped cells

Exercise 13:16

Investigators looked at 300 chromosomes and counted the number of sister-chromatid exchanges on each. They hypothesize a Poisson model.

Compute an MLE \(\hat\lambda\), and then group \(\geq 8\) into a single cell and check goodness of fit.

Code
df.16 = tibble(
  x=0:9,
  observed=c(6,24,42,59,62,44,41,14,6,2)
)
df.16 %>% t %>% kable(digits=0)
x 0 1 2 3 4 5 6 7 8 9
observed 6 24 42 59 62 44 41 14 6 2

Grouped cells

Exercise 13:16

The likelihood is: \[ \begin{align*} \mathcal{L} &= p(0)^{6}\cdot p(1)^{24}\cdot p(2)^{42}\cdot p(3)^{59}\cdot p(4)^{62}\cdot p(5)^{44}\cdot p(6)^{41}\cdot p(7)^{14}\cdot p(8)^{6}\cdot p(9)^{2} \\ &= \left({\lambda^0e^{-\lambda}\over 0!}\right)^{6}\cdot \left({\lambda^1e^{-\lambda}\over 1!}\right)^{24}\cdot \left({\lambda^2e^{-\lambda}\over 2!}\right)^{42}\cdot \left({\lambda^3e^{-\lambda}\over 3!}\right)^{59}\cdot \left({\lambda^4e^{-\lambda}\over 4!}\right)^{62}\cdot \\ &\phantom{=}\cdot \left({\lambda^5e^{-\lambda}\over 5!}\right)^{44}\cdot \left({\lambda^6e^{-\lambda}\over 6!}\right)^{41}\cdot \left({\lambda^7e^{-\lambda}\over 7!}\right)^{14}\cdot \left({\lambda^8e^{-\lambda}\over 8!}\right)^{6}\cdot \left({\lambda^9e^{-\lambda}\over 9!}\right)^{2} \\ &= {\lambda^{\overbrace{24+42\cdot2+59\cdot3+62\cdot4+44\cdot5+41\cdot6+14\cdot7+6\cdot8+2\cdot9}^{=1163}}e^{-300\lambda}\over 0!\cdot\dots\cdot 9!} \end{align*} \]

Grouped cells

Exercise 13:16

The log likelihood is (writing \(C\) for the log of the product of factorials): \[ \begin{align*} \ell &= 1163\log\lambda - 300\lambda - C \\ {d\ell \over d\lambda} &= {1163\over\lambda} - 300 & \hat\lambda &= {1163\over 300}\approx 3.87666 \end{align*} \]

Grouped cells

Exercise 13:16

Observed and estimated counts, assuming a \(Poisson(3.87666)\) model, are:

Code
df.16$expected = 300*dpois(df.16$x, 1163/300)

df.16 %>% t %>% kable(digits=2)
x 0.00 1.0 2.00 3.00 4.0 5.00 6.0 7.00 8.00 9.00
observed 6.00 24.0 42.00 59.00 62.0 44.00 41.0 14.00 6.00 2.00
expected 6.22 24.1 46.71 60.36 58.5 45.35 29.3 16.23 7.86 3.39

But we cannot use this as-is, since expected counts from 9 and up are too small (and infinitely many), so we need to group values.

Grouped cells

Exercise 13:16

Grouping the data, we get instead:

Code
df.16[9,] = df.16[9,] + df.16[10,]
df.16 = df.16[-10,]
df.16$x = as.character(df.16$x)
df.16$x[9] = "≥8"

df.16$expected[9] = 300*ppois(7, 1163/300, lower.tail = FALSE)

df.16$difference = df.16$observed-df.16$expected
df.16$residual = df.16$difference/df.16$expected
df.16$chisq = df.16$residual^2

df.16 %>% kable(digits=2)
x observed expected difference residual chisq
0 6 6.22 -0.22 -0.03 0.00
1 24 24.10 -0.10 0.00 0.00
2 42 46.71 -4.71 -0.10 0.01
3 59 60.36 -1.36 -0.02 0.00
4 62 58.50 3.50 0.06 0.00
5 44 45.35 -1.35 -0.03 0.00
6 41 29.30 11.70 0.40 0.16
7 14 16.23 -2.23 -0.14 0.02
≥8 8 13.24 -5.24 -0.40 0.16

Grouped cells

Exercise 13:16

These observations combine to a test statistic of \(X=0.3511067\).

With 9 categories and 1 estimated quantity, the lower bound has 7 degrees of freedom and the upper bound has 8 degrees of freedom.

So for comparing with the lower bound, we need a p-value with 8 DoF, for comparing with the upper bound, we need it with 9 DoF. We get:

Code
pchisq(sum(df.16$chisq), c(7,8), lower.tail=FALSE)
[1] 0.9998299 0.9999656

These p-values give us a clear signal that we fail to reject the null hypothesis. A Poisson model may work quite well for this situation.

Continuous Goodness of Fit

This test is one option1 for testing Goodness of Fit against continuous distributions.

The way to do that is by transforming the continuous distribution fit into a discrete fit, by comparing a histogram to the expected sizes.

Histogram bins should be picked to ensure either Fisher’s or Cochran’s criterion is fulfilled - common methods include either partitioning into equal-width bins, or into equal-expected-count bins.

Since the specific distribution we are fitting against is usually estimated on the full (un-binned) data, the same requirements apply as in the grouped data case above: use both \(k-1\) and \(k-m-1\) degrees of freedom, for \(k\) categories and \(m\) independent estimated parameters:

Reject if the \(k-1\) p-value indicates rejection; fail to reject if the \(k-m-1\) p-value indicates failure to reject; test is inconclusive if neither p-value indicates its conclusion.

Example

Return to Exercise 10.78

In 10.78, we were considering a set of scores from a control group and an experimental group, and used bootstraps to create confidence intervals due to doubts of the normality of the original data.

Code
scores = pd.DataFrame({
  "score": [34,27,26,33,23,37,24,34,22,23,32, 5,
            30,35,28,25,37,28,26,29,22,33,31,23,
            37,29, 0,30,34,26,28,27,32,29,31,33,
            28,21,34,29,33, 6, 8,29,36, 7,21,30,
            28,34,28,35,30,34, 9,38, 9,27,25,33,
            9 ,23,32,25,37,28,23,26,34,32,34, 0,
            24,30,36,28,38,35,16,37,25,34,38,34,31, # end of E
            37,22,29,29,33,22,32,36,29, 6, 4,37,
            0 ,36, 0,32,27, 7,19,35,26,22,28,28,
            32,35,28,33,35,24,21, 0,32,28,27, 8,
            30,37, 9,33,30,36,28, 3, 8,31,29, 9,
            0 , 0,35,25,29, 3,33,33,28,32,39,20,
            32,22,24,20,32, 7, 8,33,29, 9, 0,30,
            26,25,32,38,22,29,29],
  "group": ["E"]*85 + ["C"]*79
})

f = pyplot.figure(figsize=(8,6))
(
  so.Plot(data=scores, x="score")
  .on(f)
  .add(so.Line(), so.KDE(), color="group")
  .add(so.Line(color="black"), so.KDE())
  .label(title="Score Distributions for the Experimental and Control groups")
  .show()
)

We can use the chi-square test of Goodness of Fit to check whether this could reasonably have been a normal distribution at all. Since we have two groups, we should check for normal distribution of the residuals. We standardize - subtract group means and divide by standard deviation - each group separately.

Code
scores["zscore"] = scores.groupby("group").transform(lambda x: (x-x.mean())/x.std())

Example

Return to Exercise 10.78

The plan now is to divide up the range into bins, count each bin, and check whether the counts match our expectations if this had been a normal distribution.

With 164 observations total, if we are to fulfills Fisher’s criterion, each bin must have no less than 5 expected entries. So with equal mass bins we can split our data into 32 bins, each with an expected probability mass of 1/32.

Our bin boundaries thus are simply the standard normal quantiles of multiples of \(1/32\) - ie, \(\Phi^{-1}(k/32)\):

Code
boundaries = scipy.stats.norm.ppf([k/32 for k in range(33)])
boundaries
array([       -inf, -1.86273187, -1.53412054, -1.3180109 , -1.15034938,
       -1.00999017, -0.88714656, -0.77642176, -0.67448975, -0.57913216,
       -0.48877641, -0.40225007, -0.31863936, -0.23720211, -0.15731068,
       -0.07841241,  0.        ,  0.07841241,  0.15731068,  0.23720211,
        0.31863936,  0.40225007,  0.48877641,  0.57913216,  0.67448975,
        0.77642176,  0.88714656,  1.00999017,  1.15034938,  1.3180109 ,
        1.53412054,  1.86273187,         inf])

Example

Return to Exercise 10.78

From our construction, we expect 5.125 entries in each bin. We can use, for instance, the numpy.histogram command to split the actual data we have into their respective bins, and we get the observed counts:

Code
from tabulate import tabulate

hist, _ = np.histogram(scores.zscore, bins=boundaries)
diff = hist-164/32
resid = (hist-164/32)/np.sqrt(164/32)
resid2 = resid**2
X = resid2.sum()
p_r = scipy.stats.chi2(31).sf(X)
p_a = scipy.stats.chi2(29).sf(X)

table_resid = [f"{r:.3f}" for r in resid.tolist()]
table_resid2 = [f"{r:.3f}" for r in resid2.tolist()]
table = [
  ["Observed"] + hist[:16].tolist(),
  [""] + hist[16:].tolist(),
  ["Differences"] + diff[:16].tolist(),
  [""] + diff[16:].tolist(),
  ["Residuals"] + table_resid[:16],
  [""] + table_resid[16:],
  ["Chi2 Terms"] + table_resid2[:16],
  [""] + table_resid2[16:]
]
print(tabulate(table, tablefmt="simple"))
Observed 15 4 5 4 0 0 0 2 2 5 1 4 5 5 4 3
10 2 7 7 6 11 7 6 18 9 6 11 5 0 0 0
Differences 9.875 -1.125 -0.125 -1.125 -5.125 -5.125 -5.125 -3.125 -3.125 -0.125 -4.125 -1.125 -0.125 -0.125 -1.125 -2.125
4.875 -3.125 1.875 1.875 0.875 5.875 1.875 0.875 12.875 3.875 0.875 5.875 -0.125 -5.125 -5.125 -5.125
Residuals 4.362 -0.497 -0.055 -0.497 -2.264 -2.264 -2.264 -1.38 -1.38 -0.055 -1.822 -0.497 -0.055 -0.055 -0.497 -0.939
2.153 -1.38 0.828 0.828 0.387 2.595 0.828 0.387 5.687 1.712 0.387 2.595 -0.055 -2.264 -2.264 -2.264
Chi2 Terms 19.027 0.247 0.003 0.247 5.125 5.125 5.125 1.905 1.905 0.003 3.32 0.247 0.003 0.003 0.247 0.881
4.637 1.905 0.686 0.686 0.149 6.735 0.686 0.149 32.345 2.93 0.149 6.735 0.003 5.125 5.125 5.125

Summing up the chi-squared contributions we get the test statistic and p-value: X: 116.585, p: 1.853e-12 (31 DoF to reject); 7.612e-12 (29 DoF to fail to reject), which is strong support for the claim that the normal distribution does not accurately describe the data.

Example

Return to Exercise 10.78

Alternatively, if we want to work with equal width bins (except for the tails), we could, for instance, split the normal distribution into tails that each expect 5 entries, and split the distance between the tails into equal width parts so that the smallest of them expects 5 entries. With 164 observations, this means we need to compute:

\[ {5\over 164}\approx 0.0305 \qquad z_{low} = \Phi^{-1}(0.0305)\approx -1.873 \qquad z_{high} \approx 1.873 \]

Computing also \(\Phi^{-1}(2\cdot 0.0305)\approx -1.546\) we get a binwidth of \(0.327\), producing about \(11.5\) bins between \(z_{low}\) and \(z_{high}\).

Example

Return to Exercise 10.78

Putting in exactly 11 bins (slightly too large is better than slightly too small), we get the following:

Code
evenboundaries = [-np.inf] + np.linspace(-1.873, 1.873, 12).tolist() + [np.inf]
evenhist, _ = np.histogram(scores.zscore, bins=evenboundaries)

evenprob = [scipy.stats.norm.cdf(hi)-scipy.stats.norm.cdf(lo) for (lo,hi) in zip(evenboundaries[:-1], evenboundaries[1:])]
evenexpected = [len(scores)*ep for ep in evenprob]
O = np.array(evenhist)
E = np.array(evenexpected)

print(tabulate([
  ["Observed"] + evenhist.tolist(),
  ["Expected"] + evenexpected,
  ["Difference"] + (O-E).tolist(),
  ["Std. Resid."] + ((O-E)/np.sqrt(E)).tolist(),
  ["Chi2 Term"] + ((O-E)**2/E).tolist()
], floatfmt=".2f"))
Observed 15.00 4.00 9.00 0.00 4.00 15.00 24.00 31.00 37.00 20.00 5.00 0.00 0.00
Expected 5.01 5.28 8.85 13.22 17.62 20.94 22.17 20.94 17.62 13.22 8.85 5.28 5.01
Difference 9.99 -1.28 0.15 -13.22 -13.62 -5.94 1.83 10.06 19.38 6.78 -3.85 -5.28 -5.01
Std. Resid. 4.47 -0.56 0.05 -3.64 -3.25 -1.30 0.39 2.20 4.62 1.86 -1.29 -2.30 -2.24
Chi2 Term 19.94 0.31 0.00 13.22 10.53 1.68 0.15 4.84 21.31 3.47 1.67 5.28 5.01
Code
print(f"""

With a chi-2 test statistic of {((O-E)**2/E).sum():.2f} we get a p-value of {scipy.stats.chi2(len(O)-1).sf(((O-E)**2/E).sum()):.4e} ({len(O)-1} DoF to reject) and {scipy.stats.chi2(len(O)-1-2).sf(((O-E)**2/E).sum()):.4e} ({len(O)-1-2} DoF to fail to reject)
""")

With a chi-2 test statistic of 87.41 we get a p-value of 1.5624e-13 (12 DoF to reject) and 1.7437e-14 (10 DoF to fail to reject)

Specific Tests for Normality

There are several common tests for normality - all of them have as null hypothesis that the data follows a normal distribution: a rejected null hypothesis means you cannot use methods that assume normality.

Anderson-Darling Test
Measures area between a fitted line and step function in a probability plot, weighting more heavily in the tails. The statistic follows \(\mathcal{N}(0,1)\).
In Python: scipy.stats.anderson
In R: Contained in the package nortest
Shapiro-Wilk Test
Critical values determined by simulations. No closed form distribution known.
In Python: scipy.stats.shapiro
In R: shapiro.test
Ryan-Joiner Test (equivalent to the Shapiro-Francia test)
Described in textbook. Compares correlation coefficient for probability plot against critical values determined by simulations. No closed form distribution known.
In Python: Not implemented in SciPy.
In R: Shapiro-Francia contained in the package nortest.
Kolmogorov-Smirnov Test
Find maximal horizontal distance between ECDF and CDF, compare to the Kolmogorov-Smirnov distribution.
In Python: scipy.stats.kstest. Give scipy.stats.norm(mu,sigma).cdf as second argument.
In R: ks.test. Give "norm" as second argument, and mean and standard deviation as additional arguments as if giving them to pnorm.