Estimators: Information, Efficiency

Mikael Vejdemo-Johansson

Information - the uncertainty of a MLE

It is quite reasonable to ask what the sampling distribution of a Maximum Likelihood Estimator is - if the estimate tends to be very tightly focused on some value, or will end up jumping around a lot.

To compute the standard error of an MLE, Fisher (1922) introduced a measure that ended up named for him.

Consider a simple case: we have a single observation \(x\sim\mathcal{N}(\mu,\sigma^2)\) and are estimating \(\mu\), with \(\sigma^2\) known. We want to define an amount of information \(\mathcal{I}_x(\mu)\) that measures the information content of this single sample.

Code

library(tidyverse)
library(ggformula)
theme_set(theme_light())
gf_function(dnorm, args=list(mean=0, sd=1.0), xlim=c(-2,2), color=~"sd=1.0") %>%
gf_function(dnorm, args=list(mean=0, sd=0.5), xlim=c(-2,2), color=~"sd=0.5") %>%
gf_function(dnorm, args=list(mean=0, sd=0.25), xlim=c(-2,2), color=~"sd=0.25") %>%
  gf_labs(y="PDF")

As the variance increases, the information content in any one observation decreases. One possibly useful choice could be \(\mathcal{I}_x(\mu)\propto 1/\sigma^2\).

Fisher Information

As you may have noticed by now, in Maximum Likelihood Estimation, we keep computing the same function over and over:

For a PDF or PMF \(f({x} | {\theta})\), we often find the MLE by computing \(\frac{d}{d\theta}\log(f(x|\theta))\)

Definition

The Fisher information \(I(\theta)\) in a single observation is the variance of the random variable \(U=\frac{d}{d\theta}\log(f(X|\theta))\).

\[ I(\theta) = \VV\left[\frac{d}{d\theta}\log(f(X|\theta))\right] \]

The Fisher information tries to quantify how sensitive \(X\) is to changes in \(\theta\). If small changes in \(\theta\) lead to large changes in \(X\), then there is a lot of information in the observed values of \(X\) - but if \(\theta\) can change without changing the outcomes much, there is not much information.

This quantity \(U = \frac{d}{d\theta}\ell(\theta) = \frac{d}{d\theta}\log(f(X|\theta))\) is called the score, and measures the sensitivity of \(\ell\) to changes in its parameters. The score has expected value \(0\) at the true value of \(\theta\):

\[ \begin{align*} \EE\left[\frac{d}{d\theta}\log(f(X|\theta)\middle|\theta\right] &= \int\frac{\frac{d}{d\theta}f(x|\theta)}{\color{DarkMagenta}{f(x|\theta)}}\color{DarkMagenta}{f(x|\theta)}dx && \text{since } (\log(f))' = f'/f \\ &= \frac{d}{d\theta}\int f(x|\theta)dx && \text{subject to some technical conditions} \\ &= \frac{d}{d\theta}1 = 0 && \text{since }f\text{ is a PDF} \end{align*} \]

The corresponding discrete argument is in the book.

Properties of Fisher Information

If \(X_1,\dots,X_n\sim\mathcal{D}(\theta)\) is an iid random sample. The Fisher Information \(\mathcal{I}_n(\theta)\) of the entire sample is: \[ \mathcal{I}_n(\theta) = \VV\left[\frac{d}{d\theta}\log(PDF(X_1,\dots,X_N|\theta))\right] = \VV\left[\frac{d}{d\theta}\log(PDF(X_1|\theta)\cdot\dots\cdot PDF(X_N|\theta))\right] = \\ \VV\left[\frac{d}{d\theta}\sum_{i=1}^n\log(PDF(X_i|\theta))\right] = \VV\left[\sum_{i=1}^n\frac{d}{d\theta}\log(PDF(X_i|\theta))\right] = \sum_{i=1}^n\VV\left[\frac{d}{d\theta}\log(PDF(X_i|\theta))\right] = n\mathcal{I}(\theta) \]
If \(\log(PDF(x|\theta))\) is twice differentiable, and under certain regularity conditions, \[ \mathcal{I}(\theta) = -\EE\left[\frac{d^2}{d\theta^2}\log(PDF(X|\theta))\middle|\theta\right] \] Indeed, by using \(\ell' = PDF'/PDF\), the product rule and the chain rule: \[ \begin{align*} \frac{d^2}{d\theta^2}\ell(\theta) &= \frac{d}{d\theta}\left(\frac{1}{PDF(x|\theta)}\cdot\frac{d}{d\theta}PDF(x|\theta)\right) \\ &= \frac{d}{d\theta}\left(\frac{1}{PDF(x|\theta)}\right)\left(\frac{d}{d\theta}PDF(x|\theta)\right) + \left(\frac{1}{PDF(x|\theta)}\right)\left(\frac{d}{d\theta}\frac{d}{d\theta}PDF(x|\theta)\right) \\ &= -\frac{1}{PDF(x|\theta)^2}\left(\frac{d}{d\theta}PDF(x|\theta)\right)^2 + \frac{\frac{d^2}{d\theta^2}PDF(x|\theta)}{PDF(x|\theta)} \\ &= -\left(\frac{d}{d\theta}\log(PDF(x|\theta))\right)^2 + \frac{\frac{d^2}{d\theta^2}PDF(x|\theta)}{PDF(x|\theta)} \end{align*} \]

Taking expectations, and remembering that the score has expected value \(0\), we get

\[ \EE\left[\frac{d^2}{d\theta^2}\ell(\theta)\right] = -\mathcal{I}(\theta) + \int\frac{\frac{d^2}{d\theta^2}PDF(x|\theta)}{\color{DarkRed}{PDF(x|\theta)}}\color{DarkRed}{PDF(x|\theta)}dx = -\mathcal{I}(\theta) + \frac{d^2}{d\theta^2}\int PDF(x|\theta)dx = -\mathcal{I}(\theta) \]

Example: Bernoulli

Let \(X\sim Bernoulli(p)\), with \(PDF(x|p) = p^x(1-p)^{1-x}\).

\[ \frac{d}{dp}\log(PDF(X|p)) = \frac{d}{dp}\left(X\log p + (1-X)\log(1-p)\right) = \frac{X}{p} - \frac{1-X}{1-p} = \frac{X-p}{p(1-p)} \\ \mathcal{I}(p) = \VV\left[\frac{d}{dp}\log(PDF(X|p))\right] = \VV\left[\frac{X-p}{p(1-p)}\right] = \frac{\VV[X]}{(p(1-p))^2} = \frac{p(1-p)}{(p(1-p))^2} = \frac{1}{p(1-p)} \]

Alternatively

\[ \frac{d^2}{dp^2}\log(PDF(X|p)) = \frac{d}{dp}\left(\frac{X}{p}-\frac{1-X}{1-p}\right) = -\frac{X}{p^2}-\frac{1-X}{(1-p)^2} \\ \mathcal{I}(p) = -\EE\left[\frac{d^2}{dp^2}\log(PDF(X|p))\right] = \frac{p}{p^2}+\frac{1-p}{(1-p)^2} = \frac{1}{p}+\frac{1}{1-p} = \frac{1}{p(1-p)} \]

Example: Normal, known variance

As in our first motivating example, let \(X\sim\mathcal{N}(\mu,\sigma^2)\) with known \(\sigma^2\).

The score is: \[ \frac{d}{d\mu}\log(PDF(x | \mu,\sigma^2)) = \frac{d}{d\mu}\left(-\log(\sqrt{2\pi\sigma^2}) - \frac{1}{2\sigma^2}(x-\mu)^2\right) = \frac{1}{\sigma^2}(x-\mu) \]

So the Fisher information is: \[ \mathcal{I}(\mu) = \VV\left[\frac{1}{\sigma^2}(x-\mu)\right] = \EE\left[\left(\frac{1}{\sigma^2}(x-\mu)\right)^2\right] = \frac{1}{\sigma^4}\EE[(x-\mu)^2] = \frac{1}{\sigma^4}\VV[x] = \frac{\sigma^2}{\sigma^4} = \frac{1}{\sigma^2} \]

Just like we set out for our information measure to be in the initial motivating example.

Cramér-Rao: How good can an estimator get?

Harald Cramér in Sweden and CR Rao in India independently derived this lower bound based on Fisher information:

Theorem

Let \(X_1,\dots,X_n\sim\mathcal{D}(\theta)\) iid where the support¹ does not depend on \(\theta\). If the statistic \(T=t(X_1,\dots,X_n)\) is an unbiased estimator for \(\theta\), then

\[ \VV[T] \geq \frac{1}{\mathcal{I}_n(\theta)} \]

Proof

Consider the covariance \(Cov(T, \ell'(\theta))\) of the estimator with the score function.

\[ Cov(T(\boldsymbol{x}), \ell'(\theta|\boldsymbol{x})) = \EE[T(\boldsymbol{x})\cdot\ell'(\theta|\boldsymbol{x})] - \EE[T(\boldsymbol{x})]\cdot\EE[\ell'(\theta|\boldsymbol{x})] = \EE[T(\boldsymbol{x})\cdot\ell'(\theta|\boldsymbol{x})] \\ \begin{align*} \EE[T(\boldsymbol{x})\cdot\ell'(\theta|\boldsymbol{x})] &= \iiint T(x_1,\dots,x_n)\cdot\ell'(\theta|\boldsymbol{x})PDF(x_1,\dots,x_n|\theta)dx_1\dots dx_n \\ &= \iiint T(x_1,\dots,x_n)\cdot \left(\frac{\frac{d}{d\theta}PDF(x_1,\dots,x_n|\theta)} {\color{DarkRed}{PDF(x_1,\dots,x_n|\theta)}} \right) \color{DarkRed}{PDF(x_1,\dots,x_n|\theta)} dx_1\dots dx_n \\ &= \iiint T(x_1,\dots,x_n)\frac{d}{d\theta}PDF(x_1,\dots,x_n|\theta)dx_1\dots dx_n \\ &= \frac{d}{d\theta}\iiint T(x_1,\dots,x_n) PDF(x_1,\dots,x_n|\theta) dx_1\dots dx_n \\ &= \frac{d}{d\theta}\EE[T(\boldsymbol{x})] = \frac{d}{d\theta}\theta = 1 \end{align*} \]

Cramér-Rao: How good can an estimator get?

Harald Cramér in Sweden and CR Rao in India independently derived this lower bound based on Fisher information:

Theorem

Let \(X_1,\dots,X_n\sim\mathcal{D}(\theta)\) iid where the support¹ does not depend on \(\theta\). If the statistic \(T=t(X_1,\dots,X_n)\) is an unbiased estimator for \(\theta\), then

\[ \VV[T] \geq \frac{1}{\mathcal{I}_n(\theta)} \]

Proof (cont…)

We established \(Cov(T, \ell') = \EE[T(\boldsymbol{x})\cdot\ell'(\theta|\boldsymbol{x})] = 1\).

Covariance is connected to the correlation \(\rho\) through \(\rho(X,Y) = \frac{Cov(X,Y)}{\sigma_X\sigma_Y}\). Remember that the correlation coefficient is bound by \(\pm1\) so that \(-1\leq\rho\leq1\). Squaring this, we get

\[ 1 \geq \rho(T,\ell')^2 = \frac{Cov(T,\ell')^2}{\VV[T]\VV[\ell']} = \frac{1}{\VV[T]\VV[\ell']} \]

Multiplying by \(\VV[T]\) and identifying \(\VV[\ell']=\mathcal{I}_n(\theta)\) finishes the proof.

Efficiency: How close are you to the Cramér-Rao lower bound?

Definition

The efficiency of an unbiased estimator \(T\) of a parameter \(\theta\) is the ratio \(\frac{1/\mathcal{I}_n(\theta)}{\VV[T]}\). An efficient estimator is one that achieves the Cramér-Rao lower bound, and is automatically an MVUE.

Example: Bernoulli

Let \(X_1,\dots,X_n\sim Bernoulli(p)\) iid. Our derivation earlier of \(\mathcal{I}(p)=\frac{1}{p(1-p)}\) tells us \(\mathcal{I}_n(p)=\frac{n}{p(1-p)}\).

Let \(T=\hat p=\overline{X}=\sum X_i/n\). This is an unbiased estimator, and \[ \VV[T] = \VV\left[\sum X_i/n\right] = \sum\frac{1}{n^2}\VV[X_i] = n\frac{p(1-p)}{n^2} = \frac{p(1-p)}{n} = \frac{1}{\mathcal{I}_n(p)} \]

This proves that \(\hat{p}\) has efficiency \(1\), is an efficient estimator, and therefore is an MVUE.

Example: Normal

Let \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\) iid, with \(\sigma^2\) known. Our earlier derivation of \(\mathcal{I}(\mu)=\frac{1}{\sigma^2}\) tells us \(\mathcal{I}_n(\mu)=\frac{n}{\sigma^2}\).

Let \(T=\overline{X}\). This is an unbiased estimator, and \(\VV[T] = \frac{\sigma^2}{n} = \frac{1}{\mathcal{I}_n(\mu)}\), so \(\overline{X}\) is efficient and an MVUE.

Large Sample MLE

The Cramér-Rao lower bound helps us prove very nice properties for large sample MLE:

Theorem

Let \(X_1,\dots,X_n\sim\mathcal{D}(\theta)\) iid, and assume that the set of possible values does not depend on \(\theta\).

Then for large \(n\), the sampling distribution of the MLE \(\hat\theta\) is approximately a normal distribution with mean \(\theta\) and variance \(1/(n\mathcal{I}(\theta))\).

The limiting distribution of \(\sqrt{n}(\hat\theta-\theta)\) is \(\mathcal{N}(0,1/\mathcal{I}(\theta))\).

Proof

The derivative of the score function \(U(\theta)=\frac{d}{d\theta}\ell(\theta)\) at the true \(\theta\) is approximately equal to the difference quotient \[ U'(\theta) \approx \frac{U(\hat\theta)-U(\theta)}{\hat\theta-\theta} \] Because \(\hat\theta\) is consistent, \(\hat\theta\to\theta\) as \(n\to\infty\). Because \(\hat\theta\) is the MLE, \(U(\hat\theta)=0\), and so in the limit: \[ \begin{align*} \hat\theta-\theta &= \frac{U(\theta)}{-U'(\theta)} \\ \sqrt{n}(\hat\theta-\theta) &= \frac{\sqrt{n}U(\theta)}{-U'(\theta)} = \frac{(\sqrt{n}/(n\sqrt{\mathcal{I}(\theta))})U(\theta)}{-1/(n\sqrt{\mathcal{I}(\theta)}))} = \frac{U(\theta)/\sqrt{n\mathcal{I}(\theta)}}{-(1/n)U'(\theta)/\sqrt{\mathcal{I}(\theta)}} \\ &= \frac {\frac{1}{n}\left(\sum\frac{d}{d\theta}\log(PDF(X_i|\theta))\right)/\sqrt{\mathcal{I}(\theta)/n}} {\frac{1}{n}\left(-\sum\frac{d^2}{d\theta^2}\log(PDF(X_i|\theta))\right)/\sqrt{\mathcal{I}(\theta)}} \end{align*} \]

Large Sample MLE

The Cramér-Rao lower bound helps us prove very nice properties for large sample MLE:

Theorem

Let \(X_1,\dots,X_n\sim\mathcal{D}(\theta)\) iid, and assume that the set of possible values does not depend on \(\theta\).

Then for large \(n\), the sampling distribution of the MLE \(\hat\theta\) is approximately a normal distribution with mean \(\theta\) and variance \(1/(n\mathcal{I}(\theta))\).

The limiting distribution of \(\sqrt{n}(\hat\theta-\theta)\) is \(\mathcal{N}(0,1/\mathcal{I}(\theta))\).

Proof (cont…)

We established

\[ \sqrt{n}(\hat\theta-\theta) = \frac {\frac{1}{n}\left(\sum\frac{d}{d\theta}\log(PDF(X_i|\theta))\right)/\sqrt{\mathcal{I}(\theta)/n}} {\frac{1}{n}\left(-\sum\frac{d^2}{d\theta^2}\log(PDF(X_i|\theta))\right)/\sqrt{\mathcal{I}(\theta)}} \]

Each \(\frac{d}{d\theta}\log(PDF(X_i|\theta))\) is an iid random variable with mean \(0\) and variance \(\mathcal{I}(\theta)\). The entire numerator is a z-score normalization, and by the central limit theorem is approximately \(\mathcal{N}(0,1)\).

Each \(-\frac{d^2}{d\theta^2}\log(PDF(X_i|\theta))\) is an iid random variable with mean \(\mathcal{I}(\theta)\). The whole denominator converges to \(\sqrt{\mathcal{I}(\theta)}\).

The ratio of a random variable distributed as \(\mathcal{N}(0,1)\) by the quantity \(\sqrt{\mathcal{I}(\theta)}\) is distributed as \(\mathcal{N}(0,1/\mathcal{I}(\theta))\). QED.

Your Homework

7.8 - Nicholas Basile

\(\hat{p} = \frac{80-12}{80} = \frac{68}{80} = 0.85\) - 85% not defective.
\(0.85\cdot 0.85 = 0.7225\) - two components - 72.25% of all systems that work properly.
\[ \begin{align*} \EE[\hat{p}^2] &= \VV[\hat{p}]+\EE[\hat{p}]^2 \\ \EE[\hat{p}] &= 0.85 \quad \text{by a.} \\ \VV[\hat{p}] &= \frac{p(1-p)}{n} = \frac{0.85\cdot0.15}{80}\approx 0.001 \\ \EE[\hat{p}^2] &\approx 0.001 + 0.85^2 \approx 0.7235 \EE[\hat{p}^2] &\neq p^2 \end{align*} \] Because the variance is not 0, \(\EE[\hat{p}^2]\neq p^2\), therefore \(\hat{p}^2\) is biased.

7.10 - James Lopez

\[ \begin{align*} \EE[\overline{X}^2] &= \VV[\overline{X}] + \EE[\overline{X}]^2 \\ \VV[\overline{X}] &= \left(\frac{\sigma}{\sqrt{n}}\right)^2 = \frac{\sigma^2}{n} \\ \EE[\overline{X}]^2 &= \mu^2 \\ \EE[\overline{X}^2] &= \frac{\sigma^2}{n} + \mu^2 \\ \EE[\overline{X}^2] &\neq \mu^2 \qquad \text{so it is not unbiased.} \end{align*} \]

Since \(\EE[\overline{x}^2]\neq\mu^2\) not an unbiased estimator.

\[ \begin{align*} \EE[\overline{X}^2] &= \frac{\sigma^2}{n}+\mu^2 \\ \EE[S^2] &= \sigma^2 \\ \EE[\overline{X}^2-kS^2] &= \mu^2 \\ \frac{\sigma^2}{n}+\mu^2-k\sigma^2 &= \mu^2 \\ \frac{\sigma^2}{n} &= k\sigma^2 \\ \frac{1}{n} &= k \end{align*} \]

7.14 - Victoria Paukova

If \(X_1=237, X_2=375, X_3=202, X_4=525, X_5=418\),

\[ \max(X_i)-\min(X_i)+1 = 525-202+1 = 324 \]

The estimate will be exactly equal to the true value when the \(\max(X_i)\) actually corresponds to the first serial number and \(\min(X_i)\) corresponds to the last serial number. Since we know the serial numbers are an inclusive set, it’s not possible to overestimate the jet fighter amount. It is a biased estimator because it is very unlikely that \(\max(X_i)=\beta\) and \(\min(X_i)=\alpha\) in the sample.

7.21 a. - James Lopez

\(n=20\), \(x=3\), \(\log\) is base \(e\).

\[ \begin{align*} \ell(p) &= \log\left[{n\choose x}p^x(1-p)^{n-x}\right] & \log ab &= \log a + \log b \\ &= \log{n\choose x}+x\log p + (n-x)\log(1-p) & \log a^x &= x\log a \\ \ell'(p) &= \frac{d}{dp}\left[\log{n\choose x}+x\log p+(n-x)\log(1-p)\right] \\ &= 0 + x\frac{1}{p}\cdot 1 + (n-x)\frac{1}{1-p}\cdot (-1) \\ &= \frac{x}{p} - \frac{n-x}{1-p} \\ \frac{x}{p}-\frac{n-x}{1-p} &= 0 \\ \frac{x}{p} &= \frac{n-x}{1-p} \\ x(1-p) &= p(n-x) \\ x-px &= pn-px \\ x &= pn \\ p = \frac{x}{n} \end{align*} \]

MLE of \(p\) is \(\hat{p} = \frac{x}{n} = \frac{3}{20} = 0.15\).

7.21 - James Lopez

\[ \begin{align*} \EE[\hat{p}] &= \EE\left[\frac{x}{n}\right] \\ &= \frac{1}{n}\EE[x] \end{align*} \]

For the binomial distribution, \(\EE[x] = np\).

\[ \begin{align*} \EE[\hat{p}] &= \frac{1}{n}\cdot np \\ &= p \end{align*} \]

The estimator is unbiased.

\[ \begin{align*} (1-p)^5 &= (1-\hat{p})^5 \\ &= (1-0.15)^5 \\ &= 0.85^5 \\ &\approx 0.443705 \approx 0.444 \end{align*} \]

7.24 - Maxim Kleyer

\[ \begin{align*} f(x_1,\dots,x_n,y_1,\dots,y_n | \lambda_1,\lambda_2) &= f(x_1|\lambda_1)\cdot\dots\cdot f(x_n|\lambda_1)\cdot f(y_1|\lambda_2)\cdot\dots\cdot f(y_n|\lambda_2) = \\ &= \frac{e^{-n\lambda_1}\cdot\lambda_1^{\sum_{i=1}^nx_i}}{(x_1)!\cdot\dots\cdot(x_n!)} \cdot \frac{e^{-n\lambda_2}\cdot\lambda_2^{\sum_{i=1}^ny_i}}{(y_1)!\cdot\dots\cdot(y_n!)} = \\ &= \frac{e^{-n\lambda_1}\cdot\lambda_1^{n\overline{x}}}{(x_1)!\cdot\dots\cdot(x_n!)} \cdot \frac{e^{-n\lambda_2}\cdot\lambda_2^{n\overline{y}}}{(y_1)!\cdot\dots\cdot(y_n!)} \quad\text{ take log} \\ &= \ln(e^{-n\lambda_1}\cdot\lambda_1^{n\overline{x}}\cdot e^{-n\lambda_2}\cdot\lambda_2^{n\overline{y}}) - \ln((x_1!)\cdot\dots\cdot(x_n!)\cdot(y_1!)\cdot\dots\cdot(y_n!)) \\ &= n\overline{x}\ln\lambda_1 + n\overline{y}\ln\lambda_2 - n\lambda_1 - n\lambda_2 + [\dots] \quad\text{take }\frac{d}{d\lambda_1} \\ &= \frac{n\overline{x}}{\lambda_1} + 0 - n + 0 + 0 \Rightarrow \frac{n\overline{x}}{\lambda_1} - n = 0 \\ \end{align*} \\ \Rightarrow\boxed{\hat\lambda_1 = \overline{x}} \]

Similarly \(\boxed{\hat\lambda_2=\overline{y}}\).

Now, \(\lambda_1-\lambda_2\Rightarrow\hat\lambda_1-\hat\lambda_2=\boxed{\overline{x}-\overline{y}}\).

7.25 - James Lopez

\[ \begin{align*} \mathcal{L}(p) &= {x-1\choose r-1}p^r(1-p)^{x-r} \\ \ell(p) &= \log\left[{x-1\choose r-1}p^r(1-p)^{x-r}\right] \\ &= \log{x-1\choose r-1} + r\log(p) + (x-r)\log(1-p) \\ \ell'(p) &= \frac{d}{dp}\left[\log{x-1\choose r-1} + r\log(p) + (x-r)\log(1-p)\right] \\ &= 0 + \frac{r}{p}\cdot 1 + \frac{x-r}{1-p}\cdot(-1) = \frac{r}{p} - \frac{x-r}{1-p} \\ \end{align*} \]

\[ \begin{align*} 0 &= \frac{r}{p} - \frac{x-r}{1-p} \\ \frac{x-r}{1-p} &= \frac{r}{p} \\ p(x-r) &= r(1-p) \\ xp-rp &= r-rp \\ xp &= r \\ p &= \frac{r}{x} \end{align*} \]

MLE of \(\hat{p} = \frac{r}{x} = \frac{3}{20} = 0.15\), which is the same MLE as 7.21.

In exercise 17, the unbiased estimator is \(\hat{p}=\frac{r-1}{x+r-1}=\frac{2}{19}\approx0.105263\) which is not the same.