Categorical data

For numeric data, we use the mean as the primary testable sample statistic, and the t-test to test for means and differences in means.

For categorical data, the mean does not exist: for categorical data we can count occurrences but not do much else.

Any test on categorical data has to be a test that counts occurrences and compares the count (or a proportion) to an expected distribution.

Distributions of counts and proportions

Recall the binomial situation criteria:

Binary? The outcome of each trial is either a success of a failure.
Independent? Trials must be independent of each other; knowing one succeeds must not have an effect on the others.
Number? The number $n$ of trials must be fixed in advance.
Success? The probability $p$ of success must be the same.

If these are fulfilled, then the count of successes follows the binomial distribution.

If these were to be fulfilled, we would know the sample distribution of the count $\hat n$ through the binomial distribution.

Exact testing for binomial counts

To use the binomial distribution to test, the setup would take the shape of:

Null hypothesis: the probability of success is $p_0$.
Alternative hypothesis: the probability of success is [greater / not equal / less] than $p_0$.
Test statistic: the sample count $\hat n$.
p-values: generated by tail probabilities in the Binomial distribution.
Confidence intervals: pick cutoffs that make the tails at least $\alpha/2$ large each. (this is called Clopper-Pearson’s method) Then divide the cutoffs by the number of trials to get a confidence interval for $p$.

This method uses the exact sampling distribution, but the confidence intervals tend to be a little bit wider (less exact) than they could be.

Effect sizes for counts and proportions

Most common - and most straightforward to interpret - effect sizes for counts and proportions are the odds ratio and relative risk.

If the probability of success is $p$, then the odds is defined as $p/(1-p)$. To compare two different odds, we can take the ratio of the odds: \[ OR = \frac{p_1/(1-p_1)}{p_2/(1-p_2)} \]

For a concrete example: in Roulette, there are 38 possible outcomes. A bet on red is a simultaneous bet on 18 of these outcomes, for a probability of $18/38\approx0.47$. This gives us odds of $(18/38) / (20/38) \approx 0.9$ or about 9 to 10.

A bet on a column is a simultaneous bet on 12 of the outcomes for a probability of $12/38\approx0.32$. This gives us odds of $(12/38) / (26/38) \approx 0.46$ or about 1 to 2.

The odds ratio of a bet on a column over a bet on red is going to be \[ \frac{(18/38) / (20/38)}{(12/38) / (26/38)} \approx 1.95 \] increasing the odds of winning to about the double if we move from a column bet to a red bet.

Effect sizes for counts and proportions

Most common - and most straightforward to interpret - effect sizes for counts and proportions are the odds ratio and relative risk.

The relative risk is the quotient of probabilities and measures the expected increase / decrease in success between the groups (or between the null hypothesis and the observed values): \[ RR = \frac{p_1}{p_2} \]

Continuing the Roulette example:

A bet on red is a simultaneous bet on 18 of 38 outcomes, for a probability of $18/38\approx0.47$.

A bet on a column is a simultaneous bet on 12 of the outcomes for a probability of $12/38\approx0.32$.

The relative risk of a red bet over a column bet is \[ RR = \frac{18/38}{12/38} = \frac{18}{12} = 1.5 \]

We would expect to win 50% more often with a red bet than with a column bet.

Effect sizes for counts and proportions

Most common - and most straightforward to interpret - effect sizes for counts and proportions are the odds ratio and relative risk.

Suppose we compare proportions between observed and null hypothesis - or between experiment and control.

RR is	OR is	Interpretation
=1	=1	No effect
<1	<1	Lower rate than hypothesis - lower rate than control
>1	>1	Higher rate than hypothesis - higher rate than control

Binomial test of a single proportion

Input Number of successes $x$ and number of trials $n$; or a vector containing the number of successes and the number of failures. If using library(mosaic), also possible: a dataset $df$ with a variable $v$ containing observations, or a vector $x$ containins observations
Null hypothesis Population probability is equal to p.0.
Alternative hypothesis Population probability is [less than / not equal / greater than] p.0
Test statistic Number of successes n.hat

If using the library(mosaic) extension, the first entry (or level) will be used as success, all others as failure. The function relevel can be used to reorder so that the correct level is used for testing.

Requirements The data is expected to come from a binomial setting. This is not illustrated, but rather argued from the data collection and description directly.

Binomial test of a single proportion

Command: binom.test with arguments x, n, p, alternative, conf.level for x the number of successes and n the number of trials:

test = binom.test(x, n, p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
## 
## 
## data:  x out of n
## number of successes = 7, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1138798 0.6565686
## sample estimates:
## probability of success 
##                   0.35

Binomial test of a single proportion

Command: binom.test with arguments x, n, alternative, conf.level for s the number of successes and f the number of failures:

test = binom.test(c(s,f), p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
## 
## 
## data:  c(s, f)
## number of successes = 7, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1138798 0.6565686
## sample estimates:
## probability of success 
##                   0.35

Binomial test of a single proportion

Command: binom.test with arguments x, n, alternative, conf.level for df$v a categorical variable and "label" the value representing success:

df$v = relevel(df$v, "label")
test = binom.test(~v, data=df, p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
## 
## 
## data:  df$v  [with success = label]
## number of successes = 7, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1138798 0.6565686
## sample estimates:
## probability of success 
##                   0.35

Binomial test of a single proportion

Command: binom.test with arguments x, n, alternative, conf.level for df$v a categorical variable and "label" the value representing success:

test = binom.test(df$v == "label", p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
## 
## 
## data:  df$v == "label"  [with success = TRUE]
## number of successes = 7, number of trials = 20, p-value = 1
## alternative hypothesis: true probability of success is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1138798 0.6565686
## sample estimates:
## probability of success 
##                   0.35

Binomial test of a single proportion

Effect size: Odds ratio or Relative risk

\[ p = \frac{x}{n} \qquad OR = \frac{p/(1-p)}{p_0/(1-p_0)} \qquad RR = \frac{p}{p_0} \]

For the one-sample case, this requires slightly more code. With x the number of successes and n the number of trials:

observed = matrix(c(p.0, (1-p.0), x, n-x), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##       OR       RR 
## 1.076923 1.050000

Binomial test of a single proportion

Effect size: Odds ratio or Relative risk

\[ p = \frac{s}{s+f} \qquad OR = \frac{p/(1-p)}{p_0/(1-p_0)} \qquad RR = \frac{p}{p_0} \]

For the one-sample case, this requires slightly more code. With s the number of successes and f the number of failures:

observed = matrix(c(p.0, (1-p.0), s, f), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##       OR       RR 
## 1.076923 1.050000

Binomial test of a single proportion

Effect size: Odds ratio or Relative risk

\[ p = \frac{\text{sum(df\$v == "label")}}{\text{nrow(df)}} \qquad OR = \frac{p/(1-p)}{p_0/(1-p_0)} \qquad RR = \frac{p}{p_0} \]

For the one-sample case, this requires slightly more code. With df$v a categorical variable and "label" the value representing success:

observed = matrix(c(p.0, (1-p.0), table(df$v != "label")), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##       OR       RR 
## 1.076923 1.050000

Normal approximations

For large enough samples, both the sample count and the sample proportion follow approximately a normal distribution.

Large enough is usually taken to mean that one should expect at least 10 successes and at least 10 failures.

We also require the full population to be at least 20 times larger than the sample size.

The threshold comes from requiring a rate of less than 5% normal distribution taking on negative (and thus utterly unreasonable) values.

Normal approximation for one sample proportions

Given that we observe $x$ successes in $n$ trials.

Sample proportion is $\overline{p} = x/n$
Standard error is $SE_{\overline{p}} = \sqrt{\overline{p}(1-\overline{p})/n}$
Margin of error at a confidence level $\alpha$, writing $t^*=$qnorm(1-alpha/2), is $t^*\cdot SE_{\overline{p}}$
Approximate confidence interval for the true probability $p$ is $\overline{p}\pm t^*\cdot SE_{\overline{p}}$.

Normal approximation hypothesis testing for one sample proportions

We can test the null hypothesis of $p=p_0$ by computing a z-score for the normal approximation:

Given that we observe $x$ successes in $n$ trials. The sample proportion is $\overline{p} = x/n$

The z-score is \[ z = \frac{\overline{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}} \]

This is tested against the standard normal distribution $\mathcal N(0,1)$.

Here, we use the normal distribution and not the t distribution since under the null hypothesis we know the population variance.

Normal approximation test of a single proportion

Input Number of successes $x$ and number of trials $n$; or a vector containing the number of successes and the number of failures. If using library(mosaic), also possible: a dataset $df$ with a variable $v$ containing observations, or a vector $x$ containins observations
Null hypothesis Population probability is equal to p.0.
Alternative hypothesis Population probability is [less than / not equal / greater than] p.0
Test statistic z-score

The function relevel can be used to reorder so that the correct level is used for testing.

Requirements

n*p.0 > 10 and n*(1-p.0) > 10 and population size at least 20*n

Normal approximation test of a single proportion

Command: prop.test with arguments x, n, p, alternative, conf.level for x the number of successes and n the number of trials:

test = prop.test(x, n, p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
##  1-sample proportions test with continuity correction
## 
## data:  x out of n
## X-squared = 2.3666e-31, df = 1, p-value = 1
## alternative hypothesis: true p is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1359357 0.6426790
## sample estimates:
##    p 
## 0.35

Normal approximation test of a single proportion

Command: prop.test with arguments x, n, alternative, conf.level for df$v a categorical variable and "label" the value representing success:

df$v = relevel(df$v, "label")
test = prop.test(~v, data=df, p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
##  1-sample proportions test with continuity correction
## 
## data:  df$v  [with success = label]
## X-squared = 2.3666e-31, df = 1, p-value = 1
## alternative hypothesis: true p is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1359357 0.6426790
## sample estimates:
##    p 
## 0.35

Normal approximation test of a single proportion

Command: prop.test with arguments x, n, alternative, conf.level for df$v a categorical variable and "label" the value representing success:

test = prop.test(df$v == "label", p=p.0, alternative="two.sided", conf.level=0.99)
test

## 
##  1-sample proportions test with continuity correction
## 
## data:  df$v == "label"  [with success = TRUE]
## X-squared = 2.3666e-31, df = 1, p-value = 1
## alternative hypothesis: true p is not equal to 0.3333333
## 99 percent confidence interval:
##  0.1359357 0.6426790
## sample estimates:
##    p 
## 0.35

Normal approximation test of a single proportion

Effect size: Odds ratio or Relative risk

\[ p = \frac{x}{n} \qquad OR = \frac{p/(1-p)}{p_0/(1-p_0)} \qquad RR = \frac{p}{p_0} \]

For the one-sample case, this requires slightly more code. With x the number of successes and n the number of trials:

observed = matrix(c(p.0, (1-p.0), x, n-x), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##       OR       RR 
## 1.076923 1.050000

Normal approximation test of a single proportion

Effect size: Odds ratio or Relative risk

\[ p = \frac{s}{s+f} \qquad OR = \frac{p/(1-p)}{p_0/(1-p_0)} \qquad RR = \frac{p}{p_0} \]

For the one-sample case, this requires slightly more code. With s the number of successes and f the number of failures:

observed = matrix(c(p.0, (1-p.0), s, f), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##       OR       RR 
## 1.076923 1.050000

Normal approximation test of a single proportion

Effect size: Odds ratio or Relative risk

\[ p = \frac{\text{sum(df\$v == "label")}}{\text{nrow(df)}} \qquad OR = \frac{p/(1-p)}{p_0/(1-p_0)} \qquad RR = \frac{p}{p_0} \]

For the one-sample case, this requires slightly more code. With df$v a categorical variable and "label" the value representing success:

observed = matrix(c(p.0, (1-p.0), table(df$v != "label")), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##       OR       RR 
## 1.076923 1.050000

Sample size selection

From the formula for the margin of error, we can derive a required sample size to reach a particular margin of error $m$ under a particular confidence level $\alpha$ with threshold value $z^*$.

\[ m = z^*\cdot SE_p = z^*\sqrt{\frac{p(1-p)}{n}} \qquad\text{so}\qquad n = \left(\frac{z^*}{m}\right)^2p(1-p) \]

The product $p(1-p)$ can never be larger than $1/4$, which leads to the simplified formula, that might overestimate the needed sample size for very small or very large population probabilities.

\[ n = \frac{1}{4}\left(\frac{z^*}{m}\right)^2 \]

For a power analysis, the function power.prop.test allows you to calculate sample sizes for hypothesis testing.

One sample median test

The median of a variable is a value $M$ such that 50% of $x$ is less than or equal to $M$.

This means that we can build a test for medians using a proportions test: if the proportion of $x\geq M$ is significantly different from 0.5, this gives evidence against $M$ being the median. If the proportion of $x\geq M$ is larger than 0.5, it means the true median is larger, if the proportion is smaller, then so is the true median.

Null hypothesis The median of x is equal to M.0.

Alternative hypothesis The median of x is [greater / not equal / less] than M.0.

Requirements The same as the chosen option between binom.test and prop.test.

One sample median test

Using binom.test, with values in x and null hypothesis M.0:

test = binom.test(x >= M.0)

Using prop.test, with values in x and null hypothesis M.0:

test = prop.test(x >= M.0)

Takes alternative and conf.level as options.

Doing this with x >= M.0 means that for an upper-tail median test, an upper-tail proportions test can be used - and for a lower-tail median test, a lower-tail proportions test can be used.

Two sample proportions testing

A study compared instagram use among young women and young men. They surveyed 1069 participants, and (among other questions) asked if they are using instagram.

	No	Yes
1Women	209	328
2Men	298	234

Two sample proportions testing

A study compared instagram use among young women and young men. They surveyed 1069 participants, and (among other questions) asked if they are using instagram.

Sex	n	X	p
1Women	537	328	0.611
2Men	532	234	0.440
Total	1069	562	0.526

Two sample proportions testing

A study compared instagram use among young women and young men. They surveyed 1069 participants, and (among other questions) asked if they are using instagram.

Exact binomial testing no longer makes any sense: we do not know a sampling distribution for a difference of counts.

The normal approximation still works. Since $p_m$ and $p_w$ are both approximately normally distributed, so is $p_m-p_w$.

With this we can get confidence intervals and hypothesis tests for the difference in proportions.

Two sample proportions testing

Given: two success/trial pairs $x_1, n_1$ and $x_2, n_2$.

Sample proportions are $p_1 = x_1/n_1$ and $p_2 = x_2/n_2$
Difference in proportions is $D = p_1-p_2$
Standard error of the difference in proportions is \[ SE_D = \sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}} \]
Margin of error with $z^*=$qnorm(1-alpha/2) is $z^*\cdot SE_D$
Confidence interval for the difference in proportions is $D \pm z^*\cdot SE_D$

Two sample proportions testing

Given: two success/trial pairs $x_1, n_1$ and $x_2, n_2$.

Sample proportions are $p_1 = x_1/n_1$ and $p_2 = x_2/n_2$
Difference in proportions is $D = p_1-p_2$
Null hypothesis is $D = 0$
Alternative hypothesis is $D$ [greater / not equal / less] than 0

Under the null hypothesis, $p_1 = p_2 = p$. We can estimate this common value as

Common proportion is $p = (x_1+x_2)/(n_1+n_2)$
z-score is \[ z = \frac{p_1-p_2}{\sqrt{p(1-p)\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}} \]

We test this against the standard normal distribution $\mathcal N(0,1)$.

Normal approximation test of two proportions

Input Numbers of successes $x_1, x_2$ and numbers of trials $n_1,n_2$; or a two-way table or matrix containing successes and failures in its columns. Also possible: a dataset $df$ with a variable $v$ containing labels and a variable $k$ containing subpopulation labels.
Null hypothesis Difference in population probabilities is equal to p.0.
Alternative hypothesis Difference in population probabilities is [less than / not equal / greater than] p.0
Test statistic z-score

The two-way table can be constructed through:

test.table = tally(k ~ (v == "label"), data=df)

Requirements

x.1 > 5, x.2 > 5, n.1 - x.1 > 5, n.2 - x.2 > 5 and population sizes at least 20*n.1 and 20*n.2 respectively.

Normal approximation test of two proportions

Command: prop.test with arguments x, n, alternative, conf.level; using x.1, x.2, n.1, n.2:

test = prop.test(c(x.1,x.2), c(n.1,n.2), alternative="two.sided", conf.level=0.99)
test

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  c(x.1, x.2) out of c(n.1, n.2)
## X-squared = 0.66143, df = 1, p-value = 0.4161
## alternative hypothesis: two.sided
## 99 percent confidence interval:
##  -0.172063  0.372063
## sample estimates:
## prop 1 prop 2 
##   0.46   0.36

Normal approximation test of two proportions

Command: prop.test with arguments x, n, alternative, conf.level; using a two-way table test.table. This function is broken in the package mosaic.

test = stats::prop.test(test.table, alternative="two.sided", conf.level=0.99)
test

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  test.table
## X-squared = 0.66143, df = 1, p-value = 0.4161
## alternative hypothesis: two.sided
## 99 percent confidence interval:
##  -0.172063  0.372063
## sample estimates:
## prop 1 prop 2 
##   0.46   0.36

Normal approximation test of two proportions

Command: prop.test with arguments x, n, alternative, conf.level; using a data frame with columns v for categorical values and k for subpopulations:

test = prop.test((v == "label") ~ k, data = df, alternative="two.sided", conf.level=0.99)
test

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  tally((v == "label") ~ k)
## X-squared = 0.66143, df = 1, p-value = 0.4161
## alternative hypothesis: two.sided
## 99 percent confidence interval:
##  -0.172063  0.372063
## sample estimates:
## prop 1 prop 2 
##   0.46   0.36

Normal approximation test of two proportions

Effect size: Odds ratio or Relative risk

\[ p_1 = \frac{x_1}{n_1} \qquad p_2 = \frac{x_2}{n_2} \qquad OR = \frac{p_1/(1-p_1)}{p_2/(1-p_2)} \qquad RR = \frac{p_1}{p_2} \]

With x.1, x.2, n.1, n.2:

observed = matrix(c(x.1, n.1-x.1, x.2, n.2-x.2), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##        OR        RR 
## 0.6603261 0.7826087

Normal approximation test of two proportions

Effect size: Odds ratio or Relative risk

\[ p_1 = \frac{s_1}{s_1+f_1} \qquad p_2 = \frac{s_2}{s_2+f_2} \qquad OR = \frac{p_1/(1-p_1)}{p_2/(1-p_2)} \qquad RR = \frac{p_1}{p_2} \]

With s.1, s.2, f.1, f.2 counts of success and failure:

observed = matrix(c(s.1, f.1, s.2, f.2), nrow=2, byrow=TRUE)
c(oddsRatio(observed),
  relrisk(observed))

##        OR        RR 
## 0.6603261 0.7826087

Normal approximation test of two proportions

Effect size: Odds ratio or Relative risk

\[ p_1 = \frac{x_1}{n_1} \qquad p_2 = \frac{x_2}{n_2} \qquad OR = \frac{p_1/(1-p_1)}{p_2/(1-p_2)} \qquad RR = \frac{p_1}{p_2} \]

With the two-way table test.table:

c(oddsRatio(test.table),
  relrisk(test.table))

##        OR        RR 
## 0.6603261 0.7826087

Back to our example

Recall our instagram data

	No	Yes
1Women	209	328
2Men	298	234

We clear the 10 successes and 10 failures hurdle comfortably: the test can be used.

Back to our example

stats::prop.test(t(instag.table))

## 
##  2-sample test for equality of proportions with continuity
##  correction
## 
## data:  t(instag.table)
## X-squared = 30.641, df = 1, p-value = 3.104e-08
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.2324113 -0.1103909
## sample estimates:
##    prop 1    prop 2 
## 0.4122288 0.5836299

Back to our example

c(oddsRatio(instag.table), relrisk(instag.table))

##       OR       RR 
## 1.998610 1.439238

Careful! Remember that the two-way table ordered the columns as «Does not use» first, and «Does use» second.

So the probability of non-usage increases with 50% going from Women to Men.

Fixing the order means using relevel cleverly when loading data and creating the two-way table.

Lecture 21

Categorical data

Distributions of counts and proportions

Exact testing for binomial counts

Effect sizes for counts and proportions

Effect sizes for counts and proportions

Effect sizes for counts and proportions

Binomial test of a single proportion

Binomial test of a single proportion

Binomial test of a single proportion

Binomial test of a single proportion

Binomial test of a single proportion

Binomial test of a single proportion

Binomial test of a single proportion

Binomial test of a single proportion

Normal approximations

Normal approximation for one sample proportions

Normal approximation hypothesis testing for one sample proportions

Normal approximation test of a single proportion

Normal approximation test of a single proportion

Normal approximation test of a single proportion

Normal approximation test of a single proportion

Normal approximation test of a single proportion

Normal approximation test of a single proportion

Normal approximation test of a single proportion

Sample size selection

One sample median test

One sample median test

Two sample proportions testing

Two sample proportions testing

Two sample proportions testing

Two sample proportions testing

Two sample proportions testing

Normal approximation test of two proportions

Normal approximation test of two proportions

Normal approximation test of two proportions

Normal approximation test of two proportions

Normal approximation test of two proportions

Normal approximation test of two proportions

Normal approximation test of two proportions

Back to our example

Back to our example

Back to our example