6 February, 2018

Homework

Homework

dataset mean.x mean.y sd.x sd.y cor
a 54.27 47.83 16.77 26.94 -0.06
b 54.27 47.83 16.77 26.94 -0.07
c 54.27 47.84 16.76 26.93 -0.07
d 54.26 47.83 16.77 26.94 -0.06
e 54.26 47.84 16.77 26.93 -0.06
f 54.26 47.83 16.77 26.94 -0.06
g 54.27 47.84 16.77 26.94 -0.07
h 54.27 47.84 16.77 26.94 -0.07
i 54.27 47.83 16.77 26.94 -0.07
j 54.27 47.84 16.77 26.93 -0.06
k 54.27 47.84 16.77 26.94 -0.07
l 54.27 47.83 16.77 26.94 -0.07
m 54.26 47.84 16.77 26.93 -0.07

Homework

Anscombe quartet – Datasaurus

Summary statistic notoriously weak: not a complete description.

Classic example: the Anscombe quartet
Modern example: AutoCAD's Datasaurus

Association rules

Basic scheme: derive high density joint probabilities from categorical observations.

transaction.id item.name
1 andouille
1 okra
1 chicken
1 bell pepper
1 celery
2 crawfish
2 quiche
2 toast

Basic scheme: derive high density joint probabilities from categorical observations.

transaction.id andouille okra chicken bell.pepper celery crawfish quiche toast
1 1 1 1 1 1 0 0 0
2 0 0 0 0 0 1 1 1
3 1 0 0 1 1 1 0 0
4 0 0 1 1 1 0 0 0
5 0 1 1 1 0 1 1 1

Plan of attack

Look for surprisingly probable combinations, as compared to size or support

Fundamental problem of association rules: locate subsets \(s_j\in S_j\) of the possible values \(S_j\) of each variable \(X_j\) such that \[ \mathbb{P}\left(\bigcap_{j=1}^p(X_j\in s_j\right) \] is large. Locate well-supported conjunctive rules.

Market Basket Analysis

We can simplify the fundamental problem by only considering subsets \(s_j\) that are either a single value, or all possible values; for each variable we either don't care, or require a specific choice.

Using dummy variables or one-hot encoding we can transform the problem to one only working with binary variables.

Fundamental problem of market basket analysis: locate subsets \(S\) of the integers \(1, \dots, K\) such that for dummy variables \(Z_k\), \[ \mathbb{P}\left(\prod_{k\in S}Z_k = 1\right) \]

Market Basket Analysis

Item set: subset \(S\) of integers \(1, \dots, K\).

Size: the number of elements of an item set.

Support: proportion \(T(S)\) of observations containing the item set.

Adapted fundamental problem: Find all item sets with support exceeding specified lower bound \(t\).

Apriori Algorithm

For large data with high enough minimum support that the number of item sets is low, the Apriori algorithm makes Market Basket Analysis feasible.

Plan of attack: Eliminate rare itemsets and extend common ones.

This prunes the search tree aggressively: once size \(k\) item sets with high support are isolated, only the size \(k+1\) item sets that can be formed out of those item sets need to be considered.

Agrawal and Srikant, Fast algorithms for mining association rules, VLDB '94

Apriori Algorithm

Apriori Algorithm

Apriori Algorithm

Apriori Algorithm

Input: \(N\times p\) binary matrix \(X\), Support threshold \(t\). T(itemset) returns its support.

L[1] = {high support 1-itemsets}
k = 2
count = dictionary with integer entries, defaulting to 0
while L[k-1] is not empty:
  candidates = { I extended with a for I in L[k-1] and a not in I }
  candidates -= { J if J has a subset of size k-1 that is not in L[k-1] }
  for t in X:
    for c in candidates:
      if c is a subset of t: 
        count[c] ++
  L[k] = { c for c in candidates if count[c] ≥ t }
  k ++
return union of all L[j]

Apriori Algorithm

The main trick with this algorithm is to pick data structures that make the hidden complexities (for instance pruning candidate itemsets) be fast.

Strength: only requires a single pass over the dataset for each size of the itemsets.

For sparse data, and/or high threshold, the algorithm terminates quickly.

From item sets to association rules

\(S\) high support item set. We split \(S\) into disjoint subsets \(A, B\) and write \(A\Rightarrow B\) for the the corresponding association rule.

The support of a rule is \(T(A\Rightarrow B) = T(A\cup B)\).
The confidence or predictability \(C(A\Rightarrow B)\) is defined by \[ C(A\Rightarrow B) = \frac{T(A\Rightarrow B)}{T(A)} \]

This estimates conditional probability \(\mathbb{P}(B | A)\). If \(A\) and \(B\) are in fact independent, we would expect \(\mathbb{P}(B|A) = \mathbb{P}(B)\).
The lift of a rule is how many times more confident the rule is than pure chance warrants \[ L(A\Rightarrow B) = \frac{C(A\Rightarrow B)}{T(B)} \]

From item sets to association rules

For an association rules analysis a second threshold \(c\) is set. Association rules \(A\Rightarrow B\) from high support itemsets are retained if \(C(A\Rightarrow B) > c\).

Output from Apriori tends to produce a database that can be queried on rule components, by confidence, by lift and by support. Eg

Display all rules in which ice skates are the consequent, the confidence is over 80%, the support is over 2%.

This would search for \(A\Rightarrow \{\)ice skates\(\}\), such that over 2% of all transactions contain all the elements in the rule, and such that the support for the rule is at least 80% of the support of the non-ice skate itemset in the rule.

Example

Example

Class=edible
Class=poisonous
CapShape=bell
CapShape=conical
CapShape=flat
CapShape=knobbed
CapShape=sunken
CapShape=convex
CapSurf=fibrous
CapSurf=grooves
CapSurf=smooth
CapSurf=scaly
CapColor=buff
CapColor=cinnamon
CapColor=red
CapColor=gray
CapColor=brown
CapColor=pink
CapColor=green
CapColor=purple
CapColor=white
CapColor=yellow
Bruises=no
Bruises=bruises
Odor=almond
Odor=creosote
Odor=foul
Odor=anise
Odor=musty
Odor=none
Odor=pungent
Odor=spicy
Odor=fishy
GillAttached=attached
GillAttached=free
GillSpace=close
GillSpace=crowded
GillSize=broad
GillSize=narrow
GillColor=buff
GillColor=red
GillColor=gray
GillColor=chocolate
GillColor=black
GillColor=brown
GillColor=orange
GillColor=pink
GillColor=green
GillColor=purple
GillColor=white
GillColor=yellow
StalkShape=enlarging
StalkShape=tapering
StalkRoot=bulbous
StalkRoot=club
StalkRoot=equal
StalkRoot=rooted
SurfaceAboveRing=fibrous
SurfaceAboveRing=silky
SurfaceAboveRing=smooth
SurfaceAboveRing=scaly
SurfaceBelowRing=fibrous
SurfaceBelowRing=silky
SurfaceBelowRing=smooth
SurfaceBelowRing=scaly
ColorAboveRing=buff
ColorAboveRing=cinnamon
ColorAboveRing=red
ColorAboveRing=gray
ColorAboveRing=brown
ColorAboveRing=orange
ColorAboveRing=pink
ColorAboveRing=white
ColorAboveRing=yellow
ColorBelowRing=buff
ColorBelowRing=cinnamon
ColorBelowRing=red
ColorBelowRing=gray
ColorBelowRing=brown
ColorBelowRing=orange
ColorBelowRing=pink
ColorBelowRing=white
ColorBelowRing=yellow
VeilType=partial
VeilColor=brown
VeilColor=orange
VeilColor=white
VeilColor=yellow
RingNumber=none
RingNumber=one
RingNumber=two
RingType=evanescent
RingType=flaring
RingType=large
RingType=none
RingType=pendant
Spore=buff
Spore=chocolate
Spore=black
Spore=brown
Spore=orange
Spore=green
Spore=purple
Spore=white
Spore=yellow
Population=brown
Population=yellow
Habitat=woods
Habitat=grasses
Habitat=leaves
Habitat=meadows
Habitat=paths
Habitat=urban
Habitat=waste

Example

R has the package arules. Python does not have a widely used implementation.

lhs rhs support confidence lift count
[1] {} => {GillAttached=free} 0.9741507 0.9741507 1.000000 7914
[2] {} => {VeilColor=white} 0.9753816 0.9753816 1.000000 7924
[3] {} => {VeilType=partial} 1.0000000 1.0000000 1.000000 8124
[4] {Habitat=grasses} => {GillAttached=free} 0.2644018 1.0000000 1.026535 2148
[5] {Habitat=grasses} => {VeilColor=white} 0.2644018 1.0000000 1.025240 2148
[6] {Habitat=grasses} => {VeilType=partial} 0.2644018 1.0000000 1.000000 2148

Example

## set of 12765 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3    4    5    6    7    8    9   10 
##    3  105  686 1998 3193 3118 2098 1071  400   93 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   5.697   7.000  10.000 
## 
## summary of quality measures:
##     support         confidence          lift            count     
##  Min.   :0.2501   Min.   :0.9500   Min.   :0.9746   Min.   :2032  
##  1st Qu.:0.2659   1st Qu.:1.0000   1st Qu.:1.0252   1st Qu.:2160  
##  Median :0.2935   Median :1.0000   Median :1.0265   Median :2384  
##  Mean   :0.3098   Mean   :0.9932   Mean   :1.2450   Mean   :2516  
##  3rd Qu.:0.3269   3rd Qu.:1.0000   3rd Qu.:1.3926   3rd Qu.:2656  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2.9265   Max.   :8124  
## 
## mining info:
##  data ntransactions support confidence
##     .          8124    0.25       0.95

Example

mushroom.ap.poisonous = subset(mushroom.ap, 
    subset=rhs %in% "Class=poisonous")
lhs rhs support confidence lift count
[1] {Odor=foul} => {Class=poisonous} 0.27 1 2.07 2160
[2] {Odor=foul,GillSpace=close} => {Class=poisonous} 0.27 1 2.07 2160
[3] {Odor=foul,RingNumber=one} => {Class=poisonous} 0.27 1 2.07 2160
[4] {Odor=foul,GillAttached=free} => {Class=poisonous} 0.27 1 2.07 2160
[5] {Odor=foul,VeilColor=white} => {Class=poisonous} 0.27 1 2.07 2160
[6] {Odor=foul,VeilType=partial} => {Class=poisonous} 0.27 1 2.07 2160

Weaknesses and assumptions

Implicitly null model: uniform distribution

Unable to find rules with very high \(L(A\Rightarrow B)\) but very low \(\mathbb{P}(A)\) and/or \(\mathbb{P}(B)\). Example: caviar \(\Rightarrow\) vodka – very high lift, very much more likely than expected to co-occur, but so rare to occur at all that it gets filtered out immediately.