Lecture 8

MVJ

12April, 2018

Datasaurus

Last lab finished with 13 datasets. For these datasets: means, standard deviations and correlations were:

d mean(x) mean(y) sd(x) sd(y) cor(x, y)
D1 54.27 47.83 16.77 26.94 -0.06
D2 54.27 47.83 16.77 26.94 -0.07
D3 54.27 47.84 16.76 26.93 -0.07
D4 54.26 47.83 16.77 26.94 -0.06
D5 54.26 47.84 16.77 26.93 -0.06
D6 54.26 47.83 16.77 26.94 -0.06
D7 54.27 47.84 16.77 26.94 -0.07
D8 54.27 47.84 16.77 26.94 -0.07
D9 54.27 47.83 16.77 26.94 -0.07
D10 54.27 47.84 16.77 26.93 -0.06
D11 54.27 47.84 16.77 26.94 -0.07
D12 54.27 47.83 16.77 26.94 -0.07
D13 54.26 47.84 16.77 26.93 -0.07

Datasaurus

Last lab finished with 13 datasets.

Linear models only depend on both means, both standard deviations and the correlation. Hence, the linear regressions for all 13 datasets will also be almost identical.

Datasaurus

Last lab finished with 13 datasets.

It may be tempting to view datasets with equal means, standard deviations, correlations, trend lines, to be inherently similar to each other.

Datasaurus

Last lab finished with 13 datasets.

It may be tempting to view datasets with equal means, standard deviations, correlations, trend lines, to be inherently similar to each other.

Anscombe’s Quartet

You might point out that none of the Datasaurus datasets look like there should be a linear regression…

More classical example is Anscombe’s Quartet.

dataset mean.x mean.y sd.x sd.y cor
1 9 7.5 3.32 2.03 0.82
2 9 7.5 3.32 2.03 0.82
3 9 7.5 3.32 2.03 0.82
4 9 7.5 3.32 2.03 0.82

Anscombe’s Quartet

Recall: numeric-numeric paired data

With numeric-numeric paired data, we can:

What about categorical data?

Categorical-numeric: subpopulation split

Pairing a categorical variable c with a numeric variable x can be interpreted as splitting data into sub-populations. Describing and modeling interactions corresponds to describing and modeling different subpopulations.

Categorical-numeric

gf_freqpoly(~age, data=titanic, color=~pclass, bins=20)

Categorical-numeric

gf_boxplot(age~pclass, data=titanic)

Categorical-numeric

favstats(age~pclass, data=titanic) %>% kable(digits=3)
pclass min Q1 median Q3 max mean sd n missing
1st 0.917 28 39 50 80 39.160 14.548 284 39
2nd 0.667 22 29 36 70 29.507 13.639 261 16
3rd 0.167 18 24 32 74 24.816 11.958 501 208

Categorical-categorical

When faced with paired categorical data, there is a limit to how much we can do.

Categorical data primarily allows us one thing: count occurrences of a given label.

For paired data, we could count occurrences of each given pair of labels.

Two-way tables

Definition A two-way table generated from two categorical variables \(x\) and \(y\) is a matrix where:

Two-way tables can be created using

Two-way tables

tally(pclass~survived, data=titanic) %>% kable
TRUE FALSE
1st 200 123
2nd 119 158
3rd 181 528
table(titanic$pclass, titanic$survived) %>% kable
FALSE TRUE
1st 123 200
2nd 158 119
3rd 528 181

N-way tables

Co-occurrence count tables can be created for any number of variables. These are increasingly difficult to print out as the number of variables increases.

tally(~pclass + survived + sex, data=titanic) 
## , , sex = female
## 
##       survived
## pclass TRUE FALSE
##    1st  139     5
##    2nd   94    12
##    3rd  106   110
## 
## , , sex = male
## 
##       survived
## pclass TRUE FALSE
##    1st   61   118
##    2nd   25   146
##    3rd   75   418

Plotting a 2-way table

2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.

As a first step, convert the table to a data frame to make it accessible for our plotting commands.

pclass.survived = tally(~pclass + survived, data=titanic)
pclass.survived.df = as.data.frame(pclass.survived)
pclass.survived.df %>% kable()
pclass survived Freq
1st TRUE 200
2nd TRUE 119
3rd TRUE 181
1st FALSE 123
2nd FALSE 158
3rd FALSE 528

Plotting a 2-way table

2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.

Next, use the gf_bar command to create a bar plot. We already have the counts, and don’t want gf_bar to count for us; we can do this by using the parameter stat="identity".

gf_bar(Freq ~ survived | pclass, data=pclass.survived.df, stat="identity")

Plotting a 2-way table

2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.

For a heat map instead, we can use gf_tile

gf_tile(pclass ~ survived, fill=~Freq, data=pclass.survived.df) +
  scale_fill_viridis_c()

Plotting an N-way table

titanic.table = tally(~pclass + survived + sex + cut(age, 4), data=titanic) %>% 
  as.data.frame()
gf_tile(pclass ~ survived | sex ~ cut.age..4., fill=~Freq, data=titanic.table) +
  scale_fill_viridis_c()

Derived data

From an N-way table, we may be interested in several different derived data tables:

Joint proportions

Calculated using prop.table.

Each entry is the proportion out of all observations that have this particular combination of labels.

pclass.survived %>% prop.table() %>% kable(digits=3)
TRUE FALSE
1st 0.153 0.094
2nd 0.091 0.121
3rd 0.138 0.403

Conditional proportions

Calculated using prop.table with a second argument picking out which variable(s) denote the subpopulations

Each entry is the proportion out of observations in that subpopulation that have this particular combination of labels.

pclass.survived %>% prop.table(1) %>% kable(digits=3)
TRUE FALSE
1st 0.619 0.381
2nd 0.430 0.570
3rd 0.255 0.745

Marginal counts

Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.

Marginal counts are generated using margin.table with a second parameter specifying which variables to keep.

pclass.survived %>% margin.table() %>% kable()
1309

Marginal counts

Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.

Marginal counts are generated using margin.table with a second parameter specifying which variables to keep.

pclass.survived %>% margin.table(1)
## pclass
## 1st 2nd 3rd 
## 323 277 709

Marginal counts

Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.

Marginal counts are generated using margin.table with a second parameter specifying which variables to keep.

tally(~pclass+survived+sex, data=titanic) %>% margin.table(c(1,3)) %>% kable()
female male
1st 144 179
2nd 106 171
3rd 216 493

Simpson’s Paradox

Data on customer service representative performances: measures whether they handled customer issues within 10 minutes or not.

Met goal? Alexis Peyton
Yes 172 (86%) 118 (59%)
No 28 (14%) 82 (41%)
Total 200 200

Who is more successful?

Simpson’s Paradox

Let’s look closer. Data was collected in two different weeks. In week 1, problems were all easy to resolve. Week 2 was immediately after a new product launch: problems were significantly harder to resolve.

Met goal? Alexis Peyton Alexis Peyton
Yes 162 (90%) 19 (95%) 10 (50%) 99 (55%)
No 18 (10%) 1 (5%) 10 (50%) 81 (45%)
Total 180 20 20 180

Who is more successful?

Simpson’s Paradox

This phenomenon is called Simpson’s Paradox: trend in subgroups of data may vanish or reverse when aggregating.

Causality

Correlations measure association of one variable to another - measures how similarly they behave. But association may or may not be a direct cause-and-effect.

Let’s take a look at some examples

Causality

Association between variables can be due to:

Two variables are confounded if their effects on the response cannot be distinguished. Confounding variables may not be known - finding all variables that need measurement is an essential challenge to research.

Causality

Establishing causation is best done by conducting an experiment where only the explanatory variable is changed: all possible confounders are controlled.

In absence of a carefully designed experiment we may propose 5 criteria for establishing causation:

  1. Strong association
  2. Consistent association
  3. Positive association (more cause yields more effect)
  4. Temporal order (cause precedes effect in time: no time travel!)
  5. Plausibility