6 February, 2018
dataset | mean.x | mean.y | sd.x | sd.y | cor |
---|---|---|---|---|---|
a | 54.27 | 47.83 | 16.77 | 26.94 | -0.06 |
b | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
c | 54.27 | 47.84 | 16.76 | 26.93 | -0.07 |
d | 54.26 | 47.83 | 16.77 | 26.94 | -0.06 |
e | 54.26 | 47.84 | 16.77 | 26.93 | -0.06 |
f | 54.26 | 47.83 | 16.77 | 26.94 | -0.06 |
g | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
h | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
i | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
j | 54.27 | 47.84 | 16.77 | 26.93 | -0.06 |
k | 54.27 | 47.84 | 16.77 | 26.94 | -0.07 |
l | 54.27 | 47.83 | 16.77 | 26.94 | -0.07 |
m | 54.26 | 47.84 | 16.77 | 26.93 | -0.07 |
Summary statistic notoriously weak: not a complete description.
Classic example: the Anscombe quartet
Modern example: AutoCAD's Datasaurus
Basic scheme: derive high density joint probabilities from categorical observations.
transaction.id | item.name |
---|---|
1 | andouille |
1 | okra |
1 | chicken |
1 | bell pepper |
1 | celery |
2 | crawfish |
2 | quiche |
2 | toast |
Basic scheme: derive high density joint probabilities from categorical observations.
transaction.id | andouille | okra | chicken | bell.pepper | celery | crawfish | quiche | toast |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 |
3 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 |
4 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
5 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
Look for surprisingly probable combinations, as compared to size or support
Fundamental problem of association rules: locate subsets \(s_j\in S_j\) of the possible values \(S_j\) of each variable \(X_j\) such that \[ \mathbb{P}\left(\bigcap_{j=1}^p(X_j\in s_j\right) \] is large. Locate well-supported conjunctive rules.
We can simplify the fundamental problem by only considering subsets \(s_j\) that are either a single value, or all possible values; for each variable we either don't care, or require a specific choice.
Using dummy variables or one-hot encoding we can transform the problem to one only working with binary variables.
Fundamental problem of market basket analysis: locate subsets \(S\) of the integers \(1, \dots, K\) such that for dummy variables \(Z_k\), \[ \mathbb{P}\left(\prod_{k\in S}Z_k = 1\right) \]
Item set: subset \(S\) of integers \(1, \dots, K\).
Size: the number of elements of an item set.
Support: proportion \(T(S)\) of observations containing the item set.
Adapted fundamental problem: Find all item sets with support exceeding specified lower bound \(t\).
For large data with high enough minimum support that the number of item sets is low, the Apriori algorithm makes Market Basket Analysis feasible.
Plan of attack: Eliminate rare itemsets and extend common ones.
This prunes the search tree aggressively: once size \(k\) item sets with high support are isolated, only the size \(k+1\) item sets that can be formed out of those item sets need to be considered.
Agrawal and Srikant, Fast algorithms for mining association rules, VLDB '94
Input: \(N\times p\) binary matrix \(X\), Support threshold \(t\). T(itemset)
returns its support.
L[1] = {high support 1-itemsets} k = 2 count = dictionary with integer entries, defaulting to 0 while L[k-1] is not empty: candidates = { I extended with a for I in L[k-1] and a not in I } candidates -= { J if J has a subset of size k-1 that is not in L[k-1] } for t in X: for c in candidates: if c is a subset of t: count[c] ++ L[k] = { c for c in candidates if count[c] ≥ t } k ++ return union of all L[j]
The main trick with this algorithm is to pick data structures that make the hidden complexities (for instance pruning candidate itemsets) be fast.
Strength: only requires a single pass over the dataset for each size of the itemsets.
For sparse data, and/or high threshold, the algorithm terminates quickly.
\(S\) high support item set. We split \(S\) into disjoint subsets \(A, B\) and write \(A\Rightarrow B\) for the the corresponding association rule.
The support of a rule is \(T(A\Rightarrow B) = T(A\cup B)\).
The confidence or predictability \(C(A\Rightarrow B)\) is defined by \[
C(A\Rightarrow B) = \frac{T(A\Rightarrow B)}{T(A)}
\]
This estimates conditional probability \(\mathbb{P}(B | A)\). If \(A\) and \(B\) are in fact independent, we would expect \(\mathbb{P}(B|A) = \mathbb{P}(B)\).
The lift of a rule is how many times more confident the rule is than pure chance warrants \[
L(A\Rightarrow B) = \frac{C(A\Rightarrow B)}{T(B)}
\]
For an association rules analysis a second threshold \(c\) is set. Association rules \(A\Rightarrow B\) from high support itemsets are retained if \(C(A\Rightarrow B) > c\).
Output from Apriori tends to produce a database that can be queried on rule components, by confidence, by lift and by support. Eg
Display all rules in which ice skates are the consequent, the confidence is over 80%, the support is over 2%.
This would search for \(A\Rightarrow \{\)ice skates\(\}\), such that over 2% of all transactions contain all the elements in the rule, and such that the support for the rule is at least 80% of the support of the non-ice skate itemset in the rule.
Class=edible |
Class=poisonous |
CapShape=bell |
CapShape=conical |
CapShape=flat |
CapShape=knobbed |
CapShape=sunken |
CapShape=convex |
CapSurf=fibrous |
CapSurf=grooves |
CapSurf=smooth |
CapSurf=scaly |
CapColor=buff |
CapColor=cinnamon |
CapColor=red |
CapColor=gray |
CapColor=brown |
CapColor=pink |
CapColor=green |
CapColor=purple |
CapColor=white |
CapColor=yellow |
Bruises=no |
Bruises=bruises |
Odor=almond |
Odor=creosote |
Odor=foul |
Odor=anise |
Odor=musty |
Odor=none |
Odor=pungent |
Odor=spicy |
Odor=fishy |
GillAttached=attached |
GillAttached=free |
GillSpace=close |
GillSpace=crowded |
GillSize=broad |
GillSize=narrow |
GillColor=buff |
GillColor=red |
GillColor=gray |
GillColor=chocolate |
GillColor=black |
GillColor=brown |
GillColor=orange |
GillColor=pink |
GillColor=green |
GillColor=purple |
GillColor=white |
GillColor=yellow |
StalkShape=enlarging |
StalkShape=tapering |
StalkRoot=bulbous |
StalkRoot=club |
StalkRoot=equal |
StalkRoot=rooted |
SurfaceAboveRing=fibrous |
SurfaceAboveRing=silky |
SurfaceAboveRing=smooth |
SurfaceAboveRing=scaly |
SurfaceBelowRing=fibrous |
SurfaceBelowRing=silky |
SurfaceBelowRing=smooth |
SurfaceBelowRing=scaly |
ColorAboveRing=buff |
ColorAboveRing=cinnamon |
ColorAboveRing=red |
ColorAboveRing=gray |
ColorAboveRing=brown |
ColorAboveRing=orange |
ColorAboveRing=pink |
ColorAboveRing=white |
ColorAboveRing=yellow |
ColorBelowRing=buff |
ColorBelowRing=cinnamon |
ColorBelowRing=red |
ColorBelowRing=gray |
ColorBelowRing=brown |
ColorBelowRing=orange |
ColorBelowRing=pink |
ColorBelowRing=white |
ColorBelowRing=yellow |
VeilType=partial |
VeilColor=brown |
VeilColor=orange |
VeilColor=white |
VeilColor=yellow |
RingNumber=none |
RingNumber=one |
RingNumber=two |
RingType=evanescent |
RingType=flaring |
RingType=large |
RingType=none |
RingType=pendant |
Spore=buff |
Spore=chocolate |
Spore=black |
Spore=brown |
Spore=orange |
Spore=green |
Spore=purple |
Spore=white |
Spore=yellow |
Population=brown |
Population=yellow |
Habitat=woods |
Habitat=grasses |
Habitat=leaves |
Habitat=meadows |
Habitat=paths |
Habitat=urban |
Habitat=waste |
R has the package arules
. Python does not have a widely used implementation.
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {} | => | {GillAttached=free} | 0.9741507 | 0.9741507 | 1.000000 | 7914 |
[2] | {} | => | {VeilColor=white} | 0.9753816 | 0.9753816 | 1.000000 | 7924 |
[3] | {} | => | {VeilType=partial} | 1.0000000 | 1.0000000 | 1.000000 | 8124 |
[4] | {Habitat=grasses} | => | {GillAttached=free} | 0.2644018 | 1.0000000 | 1.026535 | 2148 |
[5] | {Habitat=grasses} | => | {VeilColor=white} | 0.2644018 | 1.0000000 | 1.025240 | 2148 |
[6] | {Habitat=grasses} | => | {VeilType=partial} | 0.2644018 | 1.0000000 | 1.000000 | 2148 |
## set of 12765 rules ## ## rule length distribution (lhs + rhs):sizes ## 1 2 3 4 5 6 7 8 9 10 ## 3 105 686 1998 3193 3118 2098 1071 400 93 ## ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 1.000 5.000 6.000 5.697 7.000 10.000 ## ## summary of quality measures: ## support confidence lift count ## Min. :0.2501 Min. :0.9500 Min. :0.9746 Min. :2032 ## 1st Qu.:0.2659 1st Qu.:1.0000 1st Qu.:1.0252 1st Qu.:2160 ## Median :0.2935 Median :1.0000 Median :1.0265 Median :2384 ## Mean :0.3098 Mean :0.9932 Mean :1.2450 Mean :2516 ## 3rd Qu.:0.3269 3rd Qu.:1.0000 3rd Qu.:1.3926 3rd Qu.:2656 ## Max. :1.0000 Max. :1.0000 Max. :2.9265 Max. :8124 ## ## mining info: ## data ntransactions support confidence ## . 8124 0.25 0.95
mushroom.ap.poisonous = subset(mushroom.ap, subset=rhs %in% "Class=poisonous")
lhs | rhs | support | confidence | lift | count | ||
---|---|---|---|---|---|---|---|
[1] | {Odor=foul} | => | {Class=poisonous} | 0.27 | 1 | 2.07 | 2160 |
[2] | {Odor=foul,GillSpace=close} | => | {Class=poisonous} | 0.27 | 1 | 2.07 | 2160 |
[3] | {Odor=foul,RingNumber=one} | => | {Class=poisonous} | 0.27 | 1 | 2.07 | 2160 |
[4] | {Odor=foul,GillAttached=free} | => | {Class=poisonous} | 0.27 | 1 | 2.07 | 2160 |
[5] | {Odor=foul,VeilColor=white} | => | {Class=poisonous} | 0.27 | 1 | 2.07 | 2160 |
[6] | {Odor=foul,VeilType=partial} | => | {Class=poisonous} | 0.27 | 1 | 2.07 | 2160 |
Implicitly null model: uniform distribution
Unable to find rules with very high \(L(A\Rightarrow B)\) but very low \(\mathbb{P}(A)\) and/or \(\mathbb{P}(B)\). Example: caviar
\(\Rightarrow\) vodka
– very high lift, very much more likely than expected to co-occur, but so rare to occur at all that it gets filtered out immediately.