Datasaurus

Last lab finished with 13 datasets. For these datasets: means, standard deviations and correlations were:

d	mean(x)	mean(y)	sd(x)	sd(y)	cor(x, y)
D1	54.27	47.83	16.77	26.94	-0.06
D2	54.27	47.83	16.77	26.94	-0.07
D3	54.27	47.84	16.76	26.93	-0.07
D4	54.26	47.83	16.77	26.94	-0.06
D5	54.26	47.84	16.77	26.93	-0.06
D6	54.26	47.83	16.77	26.94	-0.06
D7	54.27	47.84	16.77	26.94	-0.07
D8	54.27	47.84	16.77	26.94	-0.07
D9	54.27	47.83	16.77	26.94	-0.07
D10	54.27	47.84	16.77	26.93	-0.06
D11	54.27	47.84	16.77	26.94	-0.07
D12	54.27	47.83	16.77	26.94	-0.07
D13	54.26	47.84	16.77	26.93	-0.07

Datasaurus

Last lab finished with 13 datasets.

Linear models only depend on both means, both standard deviations and the correlation. Hence, the linear regressions for all 13 datasets will also be almost identical.

Datasaurus

Last lab finished with 13 datasets.

It may be tempting to view datasets with equal means, standard deviations, correlations, trend lines, to be inherently similar to each other.

Datasaurus

Last lab finished with 13 datasets.

It may be tempting to view datasets with equal means, standard deviations, correlations, trend lines, to be inherently similar to each other.

Anscombe’s Quartet

You might point out that none of the Datasaurus datasets look like there should be a linear regression…

More classical example is Anscombe’s Quartet.

dataset	mean.x	mean.y	sd.x	sd.y	cor
1	9	7.5	3.32	2.03	0.82
2	9	7.5	3.32	2.03	0.82
3	9	7.5	3.32	2.03	0.82
4	9	7.5	3.32	2.03	0.82

Anscombe’s Quartet

Recall: numeric-numeric paired data

With numeric-numeric paired data, we can:

Draw scatter plots with trend lines gf_point(y~x, data=dataset) %>% gf_smooth(method="lm")
Calculate correlation cor(y~x, data=dataset)
Fit a linear model lm(y~x, data=dataset)

What about categorical data?

Categorical-numeric: subpopulation split

Pairing a categorical variable c with a numeric variable x can be interpreted as splitting data into sub-populations. Describing and modeling interactions corresponds to describing and modeling different subpopulations.

Overlaid histograms or frequency curves gf_histogram(~x, data=dataset, fill=~c) or gf_freqpoly(~x, data=dataset, color=~c)
Side by side box plots gf_boxplot(x~c, data=dataset)
Subgroup-wise summary statistics favstats(x~c, data=dataset)

Categorical-numeric

gf_freqpoly(~age, data=titanic, color=~pclass, bins=20)

Categorical-numeric

gf_boxplot(age~pclass, data=titanic)

Categorical-numeric

favstats(age~pclass, data=titanic) %>% kable(digits=3)

pclass	min	Q1	median	Q3	max	mean	sd	n	missing
1st	0.917	28	39	50	80	39.160	14.548	284	39
2nd	0.667	22	29	36	70	29.507	13.639	261	16
3rd	0.167	18	24	32	74	24.816	11.958	501	208

Categorical-categorical

When faced with paired categorical data, there is a limit to how much we can do.

Categorical data primarily allows us one thing: count occurrences of a given label.

For paired data, we could count occurrences of each given pair of labels.

Two-way tables

Definition A two-way table generated from two categorical variables $x$ and $y$ is a matrix where:

each row corresponds to a label in $x$
each column corresponds to a label in $y$
each cell contains the number of cooccurrences of these two labels

Two-way tables can be created using

tally(x~y, data=dataset) (included in mosaic)
table(dataset$x, dataset$y) (using base R)

Two-way tables

tally(pclass~survived, data=titanic) %>% kable

	TRUE	FALSE
1st	200	123
2nd	119	158
3rd	181	528

table(titanic$pclass, titanic$survived) %>% kable

	FALSE	TRUE
1st	123	200
2nd	158	119
3rd	528	181

N-way tables

Co-occurrence count tables can be created for any number of variables. These are increasingly difficult to print out as the number of variables increases.

tally(~pclass + survived + sex, data=titanic)

## , , sex = female
## 
##       survived
## pclass TRUE FALSE
##    1st  139     5
##    2nd   94    12
##    3rd  106   110
## 
## , , sex = male
## 
##       survived
## pclass TRUE FALSE
##    1st   61   118
##    2nd   25   146
##    3rd   75   418

Plotting a 2-way table

2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.

As a first step, convert the table to a data frame to make it accessible for our plotting commands.

pclass.survived = tally(~pclass + survived, data=titanic)
pclass.survived.df = as.data.frame(pclass.survived)
pclass.survived.df %>% kable()

pclass	survived	Freq
1st	TRUE	200
2nd	TRUE	119
3rd	TRUE	181
1st	FALSE	123
2nd	FALSE	158
3rd	FALSE	528

Plotting a 2-way table

2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.

Next, use the gf_bar command to create a bar plot. We already have the counts, and don’t want gf_bar to count for us; we can do this by using the parameter stat="identity".

gf_bar(Freq ~ survived | pclass, data=pclass.survived.df, stat="identity")

Plotting a 2-way table

2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.

For a heat map instead, we can use gf_tile

gf_tile(pclass ~ survived, fill=~Freq, data=pclass.survived.df) +
  scale_fill_viridis_c()

Plotting an N-way table

titanic.table = tally(~pclass + survived + sex + cut(age, 4), data=titanic) %>% 
  as.data.frame()
gf_tile(pclass ~ survived | sex ~ cut.age..4., fill=~Freq, data=titanic.table) +
  scale_fill_viridis_c()

Derived data

From an N-way table, we may be interested in several different derived data tables:

Entries as proportions from the entire data set (joint proportions)
Entries as proportions within the subpopulations picked out by some variable(s) (conditional proportions)
Entries removing one of the variables from consideration by summing all entries in subpopulations picked out by some variable(s) (marginal counts)
Entries as proportions of the marginal counts (marginal proportions)

Joint proportions

Calculated using prop.table.

Each entry is the proportion out of all observations that have this particular combination of labels.

pclass.survived %>% prop.table() %>% kable(digits=3)

	TRUE	FALSE
1st	0.153	0.094
2nd	0.091	0.121
3rd	0.138	0.403

Conditional proportions

Calculated using prop.table with a second argument picking out which variable(s) denote the subpopulations

Each entry is the proportion out of observations in that subpopulation that have this particular combination of labels.

pclass.survived %>% prop.table(1) %>% kable(digits=3)

	TRUE	FALSE
1st	0.619	0.381
2nd	0.430	0.570
3rd	0.255	0.745

Marginal counts

Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.

Marginal counts are generated using margin.table with a second parameter specifying which variables to keep.

pclass.survived %>% margin.table() %>% kable()

1309

Marginal counts

Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.

Marginal counts are generated using margin.table with a second parameter specifying which variables to keep.

pclass.survived %>% margin.table(1)

## pclass
## 1st 2nd 3rd 
## 323 277 709

Marginal counts

Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.

Marginal counts are generated using margin.table with a second parameter specifying which variables to keep.

tally(~pclass+survived+sex, data=titanic) %>% margin.table(c(1,3)) %>% kable()

	female	male
1st	144	179
2nd	106	171
3rd	216	493

Simpson’s Paradox

Data on customer service representative performances: measures whether they handled customer issues within 10 minutes or not.

Met goal?	Alexis	Peyton
Yes	172 (86%)	118 (59%)
No	28 (14%)	82 (41%)
Total	200	200

Who is more successful?

Simpson’s Paradox

Let’s look closer. Data was collected in two different weeks. In week 1, problems were all easy to resolve. Week 2 was immediately after a new product launch: problems were significantly harder to resolve.

Met goal?	Alexis	Peyton	Alexis	Peyton
Yes	162 (90%)	19 (95%)	10 (50%)	99 (55%)
No	18 (10%)	1 (5%)	10 (50%)	81 (45%)
Total	180	20	20	180

Who is more successful?

Simpson’s Paradox

This phenomenon is called Simpson’s Paradox: trend in subgroups of data may vanish or reverse when aggregating.

Causality

Correlations measure association of one variable to another - measures how similarly they behave. But association may or may not be a direct cause-and-effect.

Let’s take a look at some examples

Causality

Association between variables can be due to:

Causality
Lurking variables
Random chance

Two variables are confounded if their effects on the response cannot be distinguished. Confounding variables may not be known - finding all variables that need measurement is an essential challenge to research.

Causality

Establishing causation is best done by conducting an experiment where only the explanatory variable is changed: all possible confounders are controlled.

In absence of a carefully designed experiment we may propose 5 criteria for establishing causation:

Strong association
Consistent association
Positive association (more cause yields more effect)
Temporal order (cause precedes effect in time: no time travel!)
Plausibility

Lecture 8

Datasaurus

Datasaurus

Datasaurus

Datasaurus

Anscombe’s Quartet

Anscombe’s Quartet

Recall: numeric-numeric paired data

Categorical-numeric: subpopulation split

Categorical-numeric

Categorical-numeric

Categorical-numeric

Categorical-categorical

Two-way tables

Two-way tables

N-way tables

Plotting a 2-way table

Plotting a 2-way table

Plotting a 2-way table

Plotting an N-way table

Derived data

Joint proportions

Conditional proportions

Marginal counts

Marginal counts

Marginal counts

Simpson’s Paradox

Simpson’s Paradox

Simpson’s Paradox

Causality

Causality

Causality