Lecture 7

7 February, 2018

Bounding variance

Once we can find unbiased estimators, the only remaining source of (mean squared) error is variance.

Many of the quality notions for estimators deal with variance, lower bounds on possible variances to attain, and how to compare the variances of different estimators.

First step is to measure how much a sample could possibly tell us about a distribution in the first place. A sharp peak is easier to find than a flatter, more spread out distribution.

Log Likelihood and Fisher Information

We define the score of a random variable \(X\) conditioned on a parameter \(\theta\) to be the partial derivative of the log likelihood: \(\frac{\partial}{\partial\theta}\log\mathcal{L}(\theta|X)\).

The variance of the score coincides (under some mild conditions) with the second derivative of the score, and is called the Fisher information of \(\theta\): \[ \mathcal{I}_X(\theta) = \mathbb{E}\left[\left(\frac{\partial}{\partial\theta}\log\mathcal{L}(\theta|X)\right)^2\middle|\theta\right] = -\mathbb{E}\left[\frac{\partial^2}{\partial\theta^2}\log\mathcal{L}(\theta,X)\middle|\theta\right] \]

This means that Fisher information measures how sharp the curvature of the log likelihood graph is.

Example

Properties of Fisher Information

Claim If \(T\) is a sufficient statistic, \(\mathcal{I}_T(\theta) = \mathcal{I}_X(\theta)\).

If \(T\) is sufficient, \(\mathcal{L}(\theta|X)=g(T,\theta)h(X)\), so \[ \mathcal{I}_T(\theta) = -\frac{\partial^2}{\partial\theta^2}\log\mathcal{L}(\theta|X) = -\frac{\partial^2}{\partial\theta^2}\left(\log g(T,\theta)+\log h(X)\right) = \mathcal{I}_T(\theta) \]

Claim For \(X\) and \(Y\), \(\mathcal{I}_{X,Y}(\theta)=\mathcal{I}_X(\theta)+\mathcal{I}_{Y|X}(\theta)\). If independent, \(\mathcal{I}_{X,Y}(\theta)=\mathcal{I}_X(\theta)+\mathcal{I}_{Y}(\theta)\).

\[ \mathcal{I}_{X,Y}(\theta) = -\frac{\partial^2}{\partial\theta^2}\log p(X,Y|\theta) = -\frac{\partial^2}{\partial\theta^2}\log p(Y|X,\theta)p(X|\theta) = \\ -\frac{\partial^2}{\partial\theta^2}\log p(Y|X,\theta) - \frac{\partial^2}{\partial\theta^2}\log p(X|\theta) = \mathcal{I}_{Y|X}(\theta) + \mathcal{I}_X(\theta) \]

Cramér-Rao bound

The first appearance of Fisher information gives a general bound on the variance of an estimator:

Theorem The variance of any unbiased estimator \(\hat\theta\) of a parameter \(\theta\) is bounded below

\[ \mathbb{V}(\hat\theta) \geq \frac{1}{\mathcal{I}(\theta)} \]

In the case we described earlier, estimating some \(g(\theta)\) using some statistic \(T(X)\), the bound takes the shape

\[ \mathbb{V}(T) \geq \frac{g'(\theta)^2}{\mathcal{I}(\theta)} \]

Cramér-Rao bound

\[ \mathbb{V}(T) \geq \frac{g'(\theta)^2}{\mathcal{I}(\theta)} \]

Proof Write \(V\) for the score:

\[ V = \frac{\partial}{\partial\theta}\log\mathcal{L}(\theta|X) = \frac{\frac{\partial}{\partial\theta}\mathcal{L}(\theta|X)}{\mathcal{L}(\theta|X)} \]

The expectation \(\mathbb{E}V=0\) because \[ \mathbb{E}V = \int\mathcal{L}(\theta|x)\left[\frac{\frac{\partial}{\partial\theta}\mathcal{L}(\theta|X)}{\mathcal{L}(\theta|X)}\right]dx = \frac{\partial}{\partial\theta}\int\mathcal{L}(\theta|x)dx = \\ \frac{\partial}{\partial\theta}\int p(x|\theta)dx = \frac{\partial}{\partial\theta} 1 = 0 \]

Cramér-Rao bound

\[ \mathbb{V}(T) \geq \frac{g'(\theta)^2}{\mathcal{I}(\theta)} \]

Proof …

The covariance of \(V\) and \(T\) is \[ Cov(V,T) = \mathbb{E}(VT)-\mathbb{E}V\cdot\mathbb{E}T = \mathbb{E}(VT) = \\ \int T(x)\frac{\frac{\partial}{\partial\theta}\mathcal{L}(\theta|x)}{\mathcal{L}(\theta|x)}p(x|\theta)dx = \frac{\partial}{\partial\theta}\int t(x)p(x,\theta)dx = \\ \frac{\partial}{\partial\theta}\mathbb{E}T = g'(\theta) \]

Cramér-Rao bound

\[ \mathbb{V}(T) \geq \frac{g'(\theta)^2}{\mathcal{I}(\theta)} \]

Proof …

So we know \(Cov(V,T)=g'(\theta)\). Now, the Cauchy-Schwarz inequality shows

\[ |g'(\theta)| = |Cov(V,T)| \leq \sqrt{\mathbb{V}T\cdot\mathbb{V}V} \]

Square and rearrange \[ \mathbb{V}T \geq \frac{g'(\theta)^2}{\mathbb{V}V} = \frac{g'(\theta)^2}{\mathcal{I}(\theta)} \]

Example

Suppose \(X\sim\text{Bernoulli}(p)\). \(\mathbb{E}X=p\). Then

\[ \ell(p|x) = \log\mathcal{L}(p|x) = \log(p^x(1-p)^{1-x}) = \\ x\log p + (1-x)\log(1-p) \\ \ell'(p|x) = \frac{x}{p}-\frac{1-x}{1-p} \quad \ell''(p|x) = -\frac{x}{p^2}-\frac{1-x}{(1-p)^2} \]

The Fisher information is \[ \mathcal{I}(p) = -\mathbb{E}\ell''(x|p) = \frac{\mathbb{E}X}{p^2}+\frac{1-\mathbb{E}X}{(1-p)^2} = \\ \frac{p}{p^2}+\frac{1-p}{(1-p)^2} = \frac{1}{p}+\frac{1}{1-p}=\frac{1}{p(1-p)} \]

Example

Suppose \(X\sim\text{Bernoulli}(p)\). \(\mathcal{I}(p)=1/p(1-p)\).

Example

Suppose \(X\sim\text{Bernoulli}(p)\). The Cramer-Rao bound is \(\mathbb{V}\hat p\geq=1/\mathcal{I}(p) = p(1-p)\).

Proof details

Along our argument, we used

Claim \(|Cov(V,T)| \leq \sqrt{\mathbb{V}T\cdot\mathbb{V}V}\).

Proof Write \(\sigma_X=\sqrt{\mathbb{V}X}\), then \[ \mathbb{E}[(X-\mathbb{E}X)\sigma_Y\pm(Y-\mathbb{E}Y)\sigma_X]^2 = \\ \mathbb{E}[(X-\mathbb{E}X)^2\sigma_X^2 + (Y-\mathbb{E}Y)^2\sigma_Y^2 \pm 2\sigma_X\sigma_Y(X-\mathbb{E}X)(Y-\mathbb{E}Y)] \\ \text{Extract constants, we get} \\ \sigma_Y^2\mathbb{E}(X-\mathbb{E}X)^2 + \sigma_X^2\mathbb{E}(Y-\mathbb{E}Y)^2 \pm 2\sigma_X\sigma_Y\mathbb{E}[(X-\mathbb{E}X)(Y-\mathbb{E}Y)] = \\ \sigma^2_Y\mathbb{V}X + \sigma^2_X\mathbb{V}Y\pm2\sigma_X\sigma_Y Cov(X,Y) = \\ 2\sigma_X\sigma_Y(\sigma_X\sigma_Y\pm Cov(X,Y)) \] This is non-negative because it is the expectation of a square.

Proof details

Claim \(|Cov(V,T)| \leq \sqrt{\mathbb{V}T\cdot\mathbb{V}V}\).

Proof… We derived \(2\sigma_X\sigma_Y(\sigma_X\sigma_Y\pm Cov(X,Y)) \geq 0\).

If \(\sigma_X\sigma_Y>0\), then we need \(|Cov(X,Y)|\leq\sigma_X\sigma_Y\) for \(\sigma_X\sigma_Y - |Cov(X,Y)| \geq 0\).

If \(\sigma_X\sigma_Y<0\), then we need \(|Cov(X,Y)|\leq \sigma_X\sigma_Y\) for \(\sigma_X\sigma_Y + |Cov(X,Y)| \leq 0\).

Generalization

One route to proving Cramér-Rao goes through proving a far more general inequality: the Hammersley-Chapman-Robbins inequality.

Claim \[ \mathbb{V}T \geq \frac{[g(\theta+\Delta)-g(\theta)]^2} {\mathbb{E}\left[\frac{\mathcal{L}(X|\theta+\Delta)}{\mathcal{L}(X|\theta)}-1\right]^2} \]

The connection to Cramér-Rao comes through introducing quotients by \(\Delta\) to create difference quotients. The derivatives show up in the limit \(\Delta\to0\).

Efficiency

How close does a specific unbiased estimator get to this theoretical bound?

The answer is efficiency:

\[ \text{eff}(\hat\theta, \theta) = \frac{1/\mathcal{I}(\theta)}{\mathbb{V}\hat\theta} = \frac{\text{possible}}{\text{actual}} \]

To compare two estimators of the same quantity, we can calculate their relative efficiency

\[ \text{eff}(\hat\theta_1, \hat\theta_2) = \frac{\text{eff}(\hat\theta_1)}{\text{eff}(\hat\theta_2)} = \frac{\mathbb{V}\hat\theta_2}{\mathbb{V}\hat\theta_1} \]

Example

Recall the estimators of \(p\) given the success count \(N\) from 100 Bernoulli trials:

Estimator	Bias	Variance	MSE
\(N/p\)	\(0\)	\(pq/100\)	\(pq/100\)
\((N+3)/100\)	\(3/100\)	\((100pq+9)/100^2\)	\((100pq+18)/100^2\)
\((N+3)/106\)	\(0\)	\((9-8p)(1+8p)/106^2\)	\((9-8p)(1+8p)/106^2\)

Fisher information in one Bernoulli trial is \(1/pq\), so the information in \(100\) independent trials is \(100/pq\). Cramér-Rao bound is \(pq/100\).

Example

Multivariate Fisher information

The Fisher information for higher dimensional cases is akin to the Hessian matrix:

\[ \mathcal{I}(\theta)_{i,j} = \mathbb{E}\left[ \frac{\partial}{\partial\theta_i}\log p(X)\cdot \frac{\partial}{\partial\theta_i}\log p(X) \right] = \\ -\mathbb{E}\left[ \frac{\partial^2}{\partial\theta_i\partial\theta_j}\log p(X)\right] \]

Multivariate variance bounds

Suppose \(\delta\) is an unbiased estimator of \(g(\theta)\) with \(\theta\) multidimensional

\[ \mathbb{V}\delta \geq \nabla g(\theta)^T\mathcal{I}(\theta)^{-1}\nabla g(\theta) \]

Example

Let \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\).

\[ \ell(X|\mu,\sigma^2) = -\frac{1}{2}\left[ n\log(2\pi) + n\log\sigma^2) + \sum (x_j-\mu)^2/\sigma^2 \right] \\ \frac{\partial}{\partial\mu} \ell(X|\mu,\sigma^2) = \sum (x_j-\mu)/\sigma^2 \\ \frac{\partial}{\partial\sigma^2} \ell(X|\mu,\sigma^2) = \frac{-n}{2\sigma^2} + \sum(x_j-\mu)^2/2(\sigma^2)^2 \]

Example

Let \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\).

\[ \mathbb{E}\frac{\partial^2}{\partial\mu^2} \ell(X|\mu,\sigma^2) = \mathbb{E}[-n/\sigma^2] = -n/\sigma^2 \\ \mathbb{E}\frac{\partial^2}{\partial\mu\partial\sigma^2} \ell(X|\mu,\sigma^2) = \mathbb{E}\left[-\sum(x_j-\mu)/(\sigma^2)^2\right] = 0 \\ \mathbb{E}\frac{\partial^2}{\partial(\sigma^2)^2} \ell(X|\mu,\sigma^2) = \mathbb{E}\left[-\frac{n}{(\sigma^2)^2} +2\frac{\sum(x_j-\mu)^2}{(\sigma^2)^3} \right] = \\ \frac{n}{2(\sigma^2)^2} - \frac{2n\sigma^2}{2(\sigma^2)^3} = \frac{n-2n}{2(\sigma^2)^2} = \frac{-n}{(\sigma^2)^2} \\ \]

Example

Let \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\).

\[ \mathcal{I}(\mu,\sigma^2) = \begin{pmatrix} n/\sigma^2 & 0 \\ 0 & \frac{n}{(\sigma^2)^2} \end{pmatrix} \]

Any estimator of \(\mu\) will have variance at least \(\sigma^2/n\).

Any estimator of \(\sigma^2\) will have variance at least \(\sigma^4/n\).