Lecture 2

6 February, 2018

Homework

dataset	mean.x	mean.y	sd.x	sd.y	cor
a	54.27	47.83	16.77	26.94	-0.06
b	54.27	47.83	16.77	26.94	-0.07
c	54.27	47.84	16.76	26.93	-0.07
d	54.26	47.83	16.77	26.94	-0.06
e	54.26	47.84	16.77	26.93	-0.06
f	54.26	47.83	16.77	26.94	-0.06
g	54.27	47.84	16.77	26.94	-0.07
h	54.27	47.84	16.77	26.94	-0.07
i	54.27	47.83	16.77	26.94	-0.07
j	54.27	47.84	16.77	26.93	-0.06
k	54.27	47.84	16.77	26.94	-0.07
l	54.27	47.83	16.77	26.94	-0.07
m	54.26	47.84	16.77	26.93	-0.07

Homework

Anscombe quartet – Datasaurus

Summary statistic notoriously weak: not a complete description.

Classic example: the Anscombe quartet
Modern example: AutoCAD's Datasaurus

Association rules

Basic scheme: derive high density joint probabilities from categorical observations.

transaction.id	item.name
1	andouille
1	okra
1	chicken
1	bell pepper
1	celery
2	crawfish
2	quiche
2	toast

Basic scheme: derive high density joint probabilities from categorical observations.

transaction.id	andouille	okra	chicken	bell.pepper	celery	crawfish	quiche	toast
1	1	1	1	1	1	0	0	0
2	0	0	0	0	0	1	1	1
3	1	0	0	1	1	1	0	0
4	0	0	1	1	1	0	0	0
5	0	1	1	1	0	1	1	1

Plan of attack

Look for surprisingly probable combinations, as compared to size or support

Fundamental problem of association rules: locate subsets \(s_j\in S_j\) of the possible values \(S_j\) of each variable \(X_j\) such that \[ \mathbb{P}\left(\bigcap_{j=1}^p(X_j\in s_j\right) \] is large. Locate well-supported conjunctive rules.

Market Basket Analysis

We can simplify the fundamental problem by only considering subsets \(s_j\) that are either a single value, or all possible values; for each variable we either don't care, or require a specific choice.

Using dummy variables or one-hot encoding we can transform the problem to one only working with binary variables.

Fundamental problem of market basket analysis: locate subsets \(S\) of the integers \(1, \dots, K\) such that for dummy variables \(Z_k\), \[ \mathbb{P}\left(\prod_{k\in S}Z_k = 1\right) \]

Market Basket Analysis

Item set: subset \(S\) of integers \(1, \dots, K\).

Size: the number of elements of an item set.

Support: proportion \(T(S)\) of observations containing the item set.

Adapted fundamental problem: Find all item sets with support exceeding specified lower bound \(t\).

Apriori Algorithm

For large data with high enough minimum support that the number of item sets is low, the Apriori algorithm makes Market Basket Analysis feasible.

Plan of attack: Eliminate rare itemsets and extend common ones.

This prunes the search tree aggressively: once size \(k\) item sets with high support are isolated, only the size \(k+1\) item sets that can be formed out of those item sets need to be considered.

Agrawal and Srikant, Fast algorithms for mining association rules, VLDB '94

Apriori Algorithm

Apriori Algorithm

Apriori Algorithm

Apriori Algorithm

Input: \(N\times p\) binary matrix \(X\), Support threshold \(t\). T(itemset) returns its support.

L[1] = {high support 1-itemsets}
k = 2
count = dictionary with integer entries, defaulting to 0
while L[k-1] is not empty:
  candidates = { I extended with a for I in L[k-1] and a not in I }
  candidates -= { J if J has a subset of size k-1 that is not in L[k-1] }
  for t in X:
    for c in candidates:
      if c is a subset of t: 
        count[c] ++
  L[k] = { c for c in candidates if count[c] ≥ t }
  k ++
return union of all L[j]

Apriori Algorithm

The main trick with this algorithm is to pick data structures that make the hidden complexities (for instance pruning candidate itemsets) be fast.

Strength: only requires a single pass over the dataset for each size of the itemsets.

For sparse data, and/or high threshold, the algorithm terminates quickly.

From item sets to association rules

\(S\) high support item set. We split \(S\) into disjoint subsets \(A, B\) and write \(A\Rightarrow B\) for the the corresponding association rule.

The support of a rule is \(T(A\Rightarrow B) = T(A\cup B)\).
The confidence or predictability \(C(A\Rightarrow B)\) is defined by \[ C(A\Rightarrow B) = \frac{T(A\Rightarrow B)}{T(A)} \]

This estimates conditional probability \(\mathbb{P}(B | A)\). If \(A\) and \(B\) are in fact independent, we would expect \(\mathbb{P}(B|A) = \mathbb{P}(B)\).
The lift of a rule is how many times more confident the rule is than pure chance warrants \[ L(A\Rightarrow B) = \frac{C(A\Rightarrow B)}{T(B)} \]

From item sets to association rules

For an association rules analysis a second threshold \(c\) is set. Association rules \(A\Rightarrow B\) from high support itemsets are retained if \(C(A\Rightarrow B) > c\).

Output from Apriori tends to produce a database that can be queried on rule components, by confidence, by lift and by support. Eg

Display all rules in which ice skates are the consequent, the confidence is over 80%, the support is over 2%.

This would search for \(A\Rightarrow \{\)ice skates\(\}\), such that over 2% of all transactions contain all the elements in the rule, and such that the support for the rule is at least 80% of the support of the non-ice skate itemset in the rule.

Example

Class=edible

Class=poisonous

CapShape=bell

CapShape=conical

CapShape=flat

CapShape=knobbed

CapShape=sunken

CapShape=convex

CapSurf=fibrous

CapSurf=grooves

CapSurf=smooth

CapSurf=scaly

CapColor=buff

CapColor=cinnamon

CapColor=red

CapColor=gray

CapColor=brown

CapColor=pink

CapColor=green

CapColor=purple

CapColor=white

CapColor=yellow

Bruises=no

Bruises=bruises

Odor=almond

Odor=creosote

Odor=foul

Odor=anise

Odor=musty

Odor=none

Odor=pungent

Odor=spicy

Odor=fishy

GillAttached=attached

GillAttached=free

GillSpace=close

GillSpace=crowded

GillSize=broad

GillSize=narrow

GillColor=buff

GillColor=red

GillColor=gray

GillColor=chocolate

GillColor=black

GillColor=brown

GillColor=orange

GillColor=pink

GillColor=green

GillColor=purple

GillColor=white

GillColor=yellow

StalkShape=enlarging

StalkShape=tapering

StalkRoot=bulbous

StalkRoot=club

StalkRoot=equal

StalkRoot=rooted

SurfaceAboveRing=fibrous

SurfaceAboveRing=silky

SurfaceAboveRing=smooth

SurfaceAboveRing=scaly

SurfaceBelowRing=fibrous

SurfaceBelowRing=silky

SurfaceBelowRing=smooth

SurfaceBelowRing=scaly

ColorAboveRing=buff

ColorAboveRing=cinnamon

ColorAboveRing=red

ColorAboveRing=gray

ColorAboveRing=brown

ColorAboveRing=orange

ColorAboveRing=pink

ColorAboveRing=white

ColorAboveRing=yellow

ColorBelowRing=buff

ColorBelowRing=cinnamon

ColorBelowRing=red

ColorBelowRing=gray

ColorBelowRing=brown

ColorBelowRing=orange

ColorBelowRing=pink

ColorBelowRing=white

ColorBelowRing=yellow

VeilType=partial

VeilColor=brown

VeilColor=orange

VeilColor=white

VeilColor=yellow

RingNumber=none

RingNumber=one

RingNumber=two

RingType=evanescent

RingType=flaring

RingType=large

RingType=none

RingType=pendant

Spore=buff

Spore=chocolate

Spore=black

Spore=brown

Spore=orange

Spore=green

Spore=purple

Spore=white

Spore=yellow

Population=brown

Population=yellow

Habitat=woods

Habitat=grasses

Habitat=leaves

Habitat=meadows

Habitat=paths

Habitat=urban

Habitat=waste

Example

R has the package arules. Python does not have a widely used implementation.

	lhs		rhs	support	confidence	lift	count
[1]	{}	=>	{GillAttached=free}	0.9741507	0.9741507	1.000000	7914
[2]	{}	=>	{VeilColor=white}	0.9753816	0.9753816	1.000000	7924
[3]	{}	=>	{VeilType=partial}	1.0000000	1.0000000	1.000000	8124
[4]	{Habitat=grasses}	=>	{GillAttached=free}	0.2644018	1.0000000	1.026535	2148
[5]	{Habitat=grasses}	=>	{VeilColor=white}	0.2644018	1.0000000	1.025240	2148
[6]	{Habitat=grasses}	=>	{VeilType=partial}	0.2644018	1.0000000	1.000000	2148

Example

## set of 12765 rules
## 
## rule length distribution (lhs + rhs):sizes
##    1    2    3    4    5    6    7    8    9   10 
##    3  105  686 1998 3193 3118 2098 1071  400   93 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   5.697   7.000  10.000 
## 
## summary of quality measures:
##     support         confidence          lift            count     
##  Min.   :0.2501   Min.   :0.9500   Min.   :0.9746   Min.   :2032  
##  1st Qu.:0.2659   1st Qu.:1.0000   1st Qu.:1.0252   1st Qu.:2160  
##  Median :0.2935   Median :1.0000   Median :1.0265   Median :2384  
##  Mean   :0.3098   Mean   :0.9932   Mean   :1.2450   Mean   :2516  
##  3rd Qu.:0.3269   3rd Qu.:1.0000   3rd Qu.:1.3926   3rd Qu.:2656  
##  Max.   :1.0000   Max.   :1.0000   Max.   :2.9265   Max.   :8124  
## 
## mining info:
##  data ntransactions support confidence
##     .          8124    0.25       0.95

Example

mushroom.ap.poisonous = subset(mushroom.ap, 
    subset=rhs %in% "Class=poisonous")

	lhs		rhs	support	confidence	lift	count
[1]	{Odor=foul}	=>	{Class=poisonous}	0.27	1	2.07	2160
[2]	{Odor=foul,GillSpace=close}	=>	{Class=poisonous}	0.27	1	2.07	2160
[3]	{Odor=foul,RingNumber=one}	=>	{Class=poisonous}	0.27	1	2.07	2160
[4]	{Odor=foul,GillAttached=free}	=>	{Class=poisonous}	0.27	1	2.07	2160
[5]	{Odor=foul,VeilColor=white}	=>	{Class=poisonous}	0.27	1	2.07	2160
[6]	{Odor=foul,VeilType=partial}	=>	{Class=poisonous}	0.27	1	2.07	2160

Weaknesses and assumptions

Implicitly null model: uniform distribution

Unable to find rules with very high \(L(A\Rightarrow B)\) but very low \(\mathbb{P}(A)\) and/or \(\mathbb{P}(B)\). Example: caviar \(\Rightarrow\) vodka – very high lift, very much more likely than expected to co-occur, but so rare to occur at all that it gets filtered out immediately.