MVJ
12April, 2018
Last lab finished with 13 datasets. For these datasets: means, standard deviations and correlations were:
d | mean(x) | mean(y) | sd(x) | sd(y) | cor(x, y) |
---|---|---|---|---|---|
D1 | 54.27 | 47.83 | 16.77 | 26.94 | -0.06 |
D2 | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
D3 | 54.27 | 47.84 | 16.76 | 26.93 | -0.07 |
D4 | 54.26 | 47.83 | 16.77 | 26.94 | -0.06 |
D5 | 54.26 | 47.84 | 16.77 | 26.93 | -0.06 |
D6 | 54.26 | 47.83 | 16.77 | 26.94 | -0.06 |
D7 | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
D8 | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
D9 | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
D10 | 54.27 | 47.84 | 16.77 | 26.93 | -0.06 |
D11 | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
D12 | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
D13 | 54.26 | 47.84 | 16.77 | 26.93 | -0.07 |
Last lab finished with 13 datasets.
Linear models only depend on both means, both standard deviations and the correlation. Hence, the linear regressions for all 13 datasets will also be almost identical.
Last lab finished with 13 datasets.
It may be tempting to view datasets with equal means, standard deviations, correlations, trend lines, to be inherently similar to each other.
Last lab finished with 13 datasets.
It may be tempting to view datasets with equal means, standard deviations, correlations, trend lines, to be inherently similar to each other.
You might point out that none of the Datasaurus datasets look like there should be a linear regression…
More classical example is Anscombe’s Quartet.
dataset | mean.x | mean.y | sd.x | sd.y | cor |
---|---|---|---|---|---|
1 | 9 | 7.5 | 3.32 | 2.03 | 0.82 |
2 | 9 | 7.5 | 3.32 | 2.03 | 0.82 |
3 | 9 | 7.5 | 3.32 | 2.03 | 0.82 |
4 | 9 | 7.5 | 3.32 | 2.03 | 0.82 |
With numeric-numeric paired data, we can:
gf_point(y~x, data=dataset) %>% gf_smooth(method="lm")
cor(y~x, data=dataset)
lm(y~x, data=dataset)
What about categorical data?
Pairing a categorical variable c
with a numeric variable x
can be interpreted as splitting data into sub-populations. Describing and modeling interactions corresponds to describing and modeling different subpopulations.
gf_histogram(~x, data=dataset, fill=~c)
or gf_freqpoly(~x, data=dataset, color=~c)
gf_boxplot(x~c, data=dataset)
favstats(x~c, data=dataset)
gf_freqpoly(~age, data=titanic, color=~pclass, bins=20)
gf_boxplot(age~pclass, data=titanic)
favstats(age~pclass, data=titanic) %>% kable(digits=3)
pclass | min | Q1 | median | Q3 | max | mean | sd | n | missing |
---|---|---|---|---|---|---|---|---|---|
1st | 0.917 | 28 | 39 | 50 | 80 | 39.160 | 14.548 | 284 | 39 |
2nd | 0.667 | 22 | 29 | 36 | 70 | 29.507 | 13.639 | 261 | 16 |
3rd | 0.167 | 18 | 24 | 32 | 74 | 24.816 | 11.958 | 501 | 208 |
When faced with paired categorical data, there is a limit to how much we can do.
Categorical data primarily allows us one thing: count occurrences of a given label.
For paired data, we could count occurrences of each given pair of labels.
Definition A two-way table generated from two categorical variables \(x\) and \(y\) is a matrix where:
Two-way tables can be created using
tally(x~y, data=dataset)
(included in mosaic
)table(dataset$x, dataset$y)
(using base R)tally(pclass~survived, data=titanic) %>% kable
TRUE | FALSE | |
---|---|---|
1st | 200 | 123 |
2nd | 119 | 158 |
3rd | 181 | 528 |
table(titanic$pclass, titanic$survived) %>% kable
FALSE | TRUE | |
---|---|---|
1st | 123 | 200 |
2nd | 158 | 119 |
3rd | 528 | 181 |
Co-occurrence count tables can be created for any number of variables. These are increasingly difficult to print out as the number of variables increases.
tally(~pclass + survived + sex, data=titanic)
## , , sex = female
##
## survived
## pclass TRUE FALSE
## 1st 139 5
## 2nd 94 12
## 3rd 106 110
##
## , , sex = male
##
## survived
## pclass TRUE FALSE
## 1st 61 118
## 2nd 25 146
## 3rd 75 418
2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.
As a first step, convert the table to a data frame to make it accessible for our plotting commands.
pclass.survived = tally(~pclass + survived, data=titanic)
pclass.survived.df = as.data.frame(pclass.survived)
pclass.survived.df %>% kable()
pclass | survived | Freq |
---|---|---|
1st | TRUE | 200 |
2nd | TRUE | 119 |
3rd | TRUE | 181 |
1st | FALSE | 123 |
2nd | FALSE | 158 |
3rd | FALSE | 528 |
2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.
Next, use the gf_bar
command to create a bar plot. We already have the counts, and don’t want gf_bar
to count for us; we can do this by using the parameter stat="identity"
.
gf_bar(Freq ~ survived | pclass, data=pclass.survived.df, stat="identity")
2-way tables are tricky to plot. One solution can be to build a heatmap of sorts, another to use a bar plot.
For a heat map instead, we can use gf_tile
gf_tile(pclass ~ survived, fill=~Freq, data=pclass.survived.df) +
scale_fill_viridis_c()
titanic.table = tally(~pclass + survived + sex + cut(age, 4), data=titanic) %>%
as.data.frame()
gf_tile(pclass ~ survived | sex ~ cut.age..4., fill=~Freq, data=titanic.table) +
scale_fill_viridis_c()
From an N-way table, we may be interested in several different derived data tables:
Calculated using prop.table
.
Each entry is the proportion out of all observations that have this particular combination of labels.
pclass.survived %>% prop.table() %>% kable(digits=3)
TRUE | FALSE | |
---|---|---|
1st | 0.153 | 0.094 |
2nd | 0.091 | 0.121 |
3rd | 0.138 | 0.403 |
Calculated using prop.table
with a second argument picking out which variable(s) denote the subpopulations
Each entry is the proportion out of observations in that subpopulation that have this particular combination of labels.
pclass.survived %>% prop.table(1) %>% kable(digits=3)
TRUE | FALSE | |
---|---|---|
1st | 0.619 | 0.381 |
2nd | 0.430 | 0.570 |
3rd | 0.255 | 0.745 |
Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.
Marginal counts are generated using margin.table
with a second parameter specifying which variables to keep.
pclass.survived %>% margin.table() %>% kable()
1309 |
Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.
Marginal counts are generated using margin.table
with a second parameter specifying which variables to keep.
pclass.survived %>% margin.table(1)
## pclass
## 1st 2nd 3rd
## 323 277 709
Marginal counts will drop variables; an N-way table will become an (N-1)-way table. A 1-way table is a vector with counts. A 0-way table is the total number of elements.
Marginal counts are generated using margin.table
with a second parameter specifying which variables to keep.
tally(~pclass+survived+sex, data=titanic) %>% margin.table(c(1,3)) %>% kable()
female | male | |
---|---|---|
1st | 144 | 179 |
2nd | 106 | 171 |
3rd | 216 | 493 |
Data on customer service representative performances: measures whether they handled customer issues within 10 minutes or not.
Met goal? | Alexis | Peyton |
---|---|---|
Yes | 172 (86%) | 118 (59%) |
No | 28 (14%) | 82 (41%) |
Total | 200 | 200 |
Who is more successful?
Let’s look closer. Data was collected in two different weeks. In week 1, problems were all easy to resolve. Week 2 was immediately after a new product launch: problems were significantly harder to resolve.
Met goal? | Alexis | Peyton | Alexis | Peyton | |
---|---|---|---|---|---|
Yes | 162 (90%) | 19 (95%) | 10 (50%) | 99 (55%) | |
No | 18 (10%) | 1 (5%) | 10 (50%) | 81 (45%) | |
Total | 180 | 20 | 20 | 180 |
Who is more successful?
This phenomenon is called Simpson’s Paradox: trend in subgroups of data may vanish or reverse when aggregating.
Correlations measure association of one variable to another - measures how similarly they behave. But association may or may not be a direct cause-and-effect.
Association between variables can be due to:
Two variables are confounded if their effects on the response cannot be distinguished. Confounding variables may not be known - finding all variables that need measurement is an essential challenge to research.
Establishing causation is best done by conducting an experiment where only the explanatory variable is changed: all possible confounders are controlled.
In absence of a carefully designed experiment we may propose 5 criteria for establishing causation: