Bayes Classifier¶

Suppose we have $K$ classes $C_1,\dots,C_K$ and $N$ features $\mathbf{x} = (x_1,\dots,x_N)$.

A Bayesian approach would seek the conditional probabilities

$$ \mathbb{P}(C_k|\mathbf{x}) = \frac{\mathbb{P}(C_k)\mathbb{P}(\mathbf{x}|C_k)} {\mathbb{P}(\mathbf{x})} $$

The denominator is basically irrelevant to find the distribution on the classes, and the numerator is the joint probability

\begin{align*} \mathbb{P}(x_1,\dots,x_N,C_k) &= \mathbb{P}(x_1|x_2,\dots,x_N,C_k)\mathbb{P}(x_2,\dots,x_N,C_k) \\ &= ... \\ &= \mathbb{P}(x_1|x_2,\dots,x_N,C_k)\mathbb{P}(x_2|x_3,\dots,x_N,C_k) \cdots\mathbb{P}(x_{N-1}|x_N,C_k) \end{align*}

Optimal Bayes Classifier¶

If we were to know the joint probability, we could use the maximum likelihood principle to choose:

$$ \hat{f}(\mathbf{x}) = \mathrm{arg}\max_{C_k}\mathbb{P}(C_k|\mathbf{x}) $$

A theorem by Wald (1950) tells us that every admissible classifier is an optimal Bayes classifier wrt some model.

admissible classifier: a classifier such that no other classifier outperforms it on every possible input

Naive Bayes Classifier¶

The Naive Bayes Classifier assumes that the $x_1,\dots,x_N$ are conditionally independent: given the class $C_k$, the features are independent as random variables. This allows us to drastically simplify the joint probability:

$$ \mathbb{P}(C_k|x_1,\dots,x_N)\propto \mathbb{P}(C_k)\prod_{i=1}^N\mathbb{P}(x_i|C_k) $$

These individual probabilities can be observed in a training set to estimate a classifier.

Types of Naive Bayes Classifiers¶

We still need to establish what, exactly, $\mathbb{P}(x_i|C_k)$ means. There are several common options:

  • Gaussian Naive Bayes: expects continuous features, and assumes a Gaussian distribution for the individual probabilities.
  • Multinomial Naive Bayes: expects multinomial distributions
  • Bernoulli Naive Bayes: expects binary feature vectors; penalizes non-occurring features where multinomial would just ignore non-occurring features

Naive Bayes in sklearn¶

  • naive_bayes.GaussianNB
  • naive_bayes.MultinomialNB
  • naive_bayes.BernoulliNB

All three can be used incrementally using the partial_fit method.

Empirical Bayes¶

In order to use a Bayesian approach on a single observation, we need to be able to model our prior belief. For large samples, however, the prior density can often be cancelled out, allowing estimation that draws on the remaining sample to replace the prior.

In practice, this tends to result in $\mathbb{E}(\theta|\mathbf{x})$ being a weighted average of the sample estimate and the prior estimate.