A primer on Bayesian statistics¶

In the Bayesian perspective on statistics, the parameters of a model are considered random variables - the task of Bayesian modeling is to update the distributions on the parameters when faced with new data.

The fundamental tool for these updates is Bayes theorem. This theorem relates the likelihood of a parameter given some data to the probability of data given the parameter:

$$ \mathbb{P}(\theta | x) = \frac{\mathbb{P}(x|\theta)\mathbb{P}(\theta)}{\mathbb{P}(x)} $$

For many applications, we can disregard the factor $\mathbb{P}(x)$ to rewrite the formula to

$$ \mathbb{P}(\theta | x) \propto \mathbb{P}(x|\theta)\mathbb{P}(\theta) $$

A primer on Bayesian statistics¶

Since the Bayes theorem update relies on $\mathbb{P}(\theta)$, we need to have some idea of how we believe the parameters to be distributed before we start. This is called the prior distribution. A different choice of prior can change the results, but with large enough data most reasonable priors tend to similar end results.

Worked example¶

We consider data on diabetes: 442 patients measured for several predictors and the disease progression over a year quantified.

In [2]:

fig1

Out[2]:

Worked example¶

A linear regression seems like a good place to start here. This model could be specified as:

$$ DPI \sim \mathcal{N}(\alpha BMI + \beta, \sigma^2) $$

We can pick priors for $\alpha, \beta$ and $\sigma^2$ - say

$$ \alpha \sim \mathcal{N}(0,1e6) \qquad \beta \sim \mathcal{N}(0,1e6) \qquad \log\sigma^2 \sim \mathcal{N}(0,1e6) $$

With this model, we can evaluate $\mathbb{P}(x|\alpha,\beta,\sigma^2)$ and $\mathbb{P}(\alpha,\beta\sigma^2)$ for any particular choice of values. Once we can compute these, we can use Bayes theorem to find a most likely choice of parameters given the data.

In [3]:

import pymc3 as pm
model = pm.Model()

with model:
    alpha = pm.Normal("alpha", mu=0, sd=1e6)
    beta = pm.Normal("beta", mu=0, sd=1e6)
    log_sigma = pm.Normal("log_sigma", mu=0, sd=1e6)
    dpi = pm.Normal("dpi", mu=alpha*BMI+beta, sd=exp(log_sigma), observed=DPI)

In [4]:

with model:
    trace = pm.sample(11000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [log_sigma, beta, alpha]
Sampling 2 chains: 100%|██████████| 23000/23000 [00:41<00:00, 558.22draws/s] 
The acceptance probability does not match the target. It is 0.9140088884583417, but should be close to 0.8. Try to increase the number of tuning steps.

In [5]:

for i in range(10):
    pt = trace.point(1000+i)
    plot(BMI, pt["alpha"]*BMI+pt["beta"], 'k', alpha=0.5)
plot(BMI, trace["alpha"][1000:].mean()*BMI+trace["beta"][1000:].mean(), 'r')
fill_between(x=sort(BMI), 
             y1=trace["alpha"][1000:].mean()*sort(BMI)+trace["beta"][1000:].mean()+2*exp(trace["log_sigma"].mean()),
             y2=trace["alpha"][1000:].mean()*sort(BMI)+trace["beta"][1000:].mean()-2*exp(trace["log_sigma"].mean()), 
             color='r', alpha=0.25)
scatter(BMI, DPI)

Out[5]:

<matplotlib.collections.PathCollection at 0x11fe6aef0>

Analysis vs Computation¶

Classically, Bayesian analysis has focused very strongly on methods that make the computations analytically feasible. The primary tool for this is using conjugate priors to the chosen model. With a conjugate prior, the hyperparameters go through a straightforward update to integrate the new observations.

A computationally heavier alternative is to simulate an empirical distribution for the posterior distributions. This can be done with Markov Chain Monte Carlo, or with Variational Inference, and in all these cases benefit from being able to do gradient descent on the likelihood functions.