Estimators: Sufficiency

Mikael Vejdemo-Johansson

Sufficiency: is our estimator meaningful?

Let \(X_1,\dots,X_n\sim Poisson(\lambda)\) iid1.

Here are four statistics we may (or may not) consider using to estimate \(\lambda\)

\[ \begin{align*} T_1 &= 4 & T_2 &= X_1 & T_3 &= \sum X_i & T_4 &= \{X_1,\dots,X_n\} \end{align*} \]

\(T_1\) is not going to be useful. For one thing, it doesn’t change with changes in \(\lambda\).

\(T_2\) is somewhat useful. It is an unbiased estimator for \(\lambda\), but it has a relatively high variance. It also ignores a lot of the data.

\(T_3\) is definitely useful. By dividing by \(n\) we get \(\overline{X}\) which is an MVUE for \(\lambda\)

\(T_4\) certainly contains all the information in the data, but can hardly be said to summarize much of anything.

Sufficiency: what more information do we need?

Let \(X_1,\dots,X_n\sim Poisson(\lambda)\) iid.

\[ \begin{align*} T_1 &= 4 & T_2 &= X_1 & T_3 &= \sum X_i & T_4 &= \{X_1,\dots,X_n\} \end{align*} \]

Let’s look at the conditional joint distributions:

\[ PMF(x_1,\dots,x_n | \lambda) = \left(\frac{\lambda^{x_1}e^{-\lambda}}{x_1!}\right)\cdot\dots\cdot \left(\frac{\lambda^{x_n}e^{-\lambda}}{x_n!}\right) = \frac{\lambda^{\sum x_i}e^{-n\lambda}}{\prod x_i!} \]

What happens to this probability if we condition on either of the suggested statistics?

\[ PMF(x_1,\dots,x_n | \lambda, t_1) = \frac{\PP(X_1=x_1,\dots,X_n=x_n,T_1=t_1)}{\PP(T_1=t_1)} = \\ =\begin{cases} PMF(x_1,\dots,x_n|\lambda) & \text{if } t_1 = 4 \\ 0 & \text{otherwise} \end{cases} \]

Sufficiency: what more information do we need?

Let \(X_1,\dots,X_n\sim Poisson(\lambda)\) iid.

\[ \begin{align*} T_1 &= 4 & T_2 &= X_1 & T_3 &= \sum X_i & T_4 &= \{X_1,\dots,X_n\} \end{align*} \]

Let’s look at the conditional joint distributions:

\[ PMF(x_1,\dots,x_n | \lambda) = \left(\frac{\lambda^{x_1}e^{-\lambda}}{x_1!}\right)\cdot\dots\cdot \left(\frac{\lambda^{x_n}e^{-\lambda}}{x_n!}\right) = \frac{\lambda^{\sum x_i}e^{-n\lambda}}{\prod x_i!} \]

What happens to this probability if we condition on either of the suggested statistics?

\[ PMF(x_1,\dots,x_n | \lambda, t_2) = \frac{\PP(X_1=x_1,\dots,X_n=x_n,T_2=t_2)}{\PP(T_2=t_2)} = \\ \frac{\prod \PP(X_i=x_i)}{\PP(X_1=t_2)} = PMF(x_2,\dots,x_n|\lambda) \]

Sufficiency: what more information do we need?

Let \(X_1,\dots,X_n\sim Poisson(\lambda)\) iid.

\[ \begin{align*} T_1 &= 4 & T_2 &= X_1 & T_3 &= \sum X_i & T_4 &= \{X_1,\dots,X_n\} \end{align*} \]

Let’s look at the conditional joint distributions:

\[ PMF(x_1,\dots,x_n | \lambda) = \left(\frac{\lambda^{x_1}e^{-\lambda}}{x_1!}\right)\cdot\dots\cdot \left(\frac{\lambda^{x_n}e^{-\lambda}}{x_n!}\right) = \frac{\lambda^{\sum x_i}e^{-n\lambda}}{\prod x_i!} \]

What happens to this probability if we condition on either of the suggested statistics?

Sums of independent Poisson variables are Poisson with \(\lambda=\sum\lambda_i\). So \(\PP(T_3 = t_3)\) is Poisson distributed with intensity \(n\lambda\).

\[ PMF(x_1,\dots,x_n | \lambda, t_3) = \frac{\PP(X_1=x_1,\dots,X_n=x_n,T_3=t_3)}{\PP(T_3=t_3)} = \\ \frac{\prod PMF_{Poisson(\lambda)}(x_i)}{PMF_{Poisson(n\lambda)}(t_3)} = \frac{\frac{\lambda^{\sum x_i}e^{-n\lambda}}{\prod x_i!}} {\frac{(n\lambda)^{\sum x_i}e^{-n\lambda}}{(\sum x_i)!}} = \frac{(\sum x_i)!}{n^{\sum x_i}\prod x_i!} \]

Sufficiency: what more information do we need?

Let \(X_1,\dots,X_n\sim Poisson(\lambda)\) iid.

\[ \begin{align*} T_1 &= 4 & T_2 &= X_1 & T_3 &= \sum X_i & T_4 &= \{X_1,\dots,X_n\} \end{align*} \]

Let’s look at the conditional joint distributions:

\[ PMF(x_1,\dots,x_n | \lambda) = \left(\frac{\lambda^{x_1}e^{-\lambda}}{x_1!}\right)\cdot\dots\cdot \left(\frac{\lambda^{x_n}e^{-\lambda}}{x_n!}\right) = \frac{\lambda^{\sum x_i}e^{-n\lambda}}{\prod x_i!} \]

What happens to this probability if we condition on either of the suggested statistics?

\(T_4\) just repeats the entire dataset all over again, so the conditional probability is \(1\).

Sufficiency: the statistic contains all information needed

Definition

A statistic \(T=t(X_1,\dots,X_n)\) is said to be sufficient for making inferences about a parameter \(\theta\) if the distribution of \(X_1,\dots,X_n\) given \(T=t\) does not depend on \(\theta\), no matter the value of \(t\).

\(T_1\) is not sufficient for \(\lambda\), because the PMF is unchanged.

\(T_2\) is not sufficient for \(\lambda\), because the conditional probability still depends on \(\lambda\) for \(X_2,\dots,X_n\).

\(T_3\) is sufficient for \(\lambda\). As complex as \(\frac{(\sum x_i)!}{n^{\sum x_i}\prod x_i!}\) is, it does not depend at all on \(\lambda\).

\(T_4\) is sufficient for \(\lambda\).

Neyman Factorization Theorem

Theorem

\(T = t(X_1,\dots,X_n)\) is a sufficient statistic for \(\theta\) if and only if the joint PDF/PMF can be written as a product of two factors, the first of which depends on the data only through \(t(x_1,\dots,x_n)\), and the second factor does not depend on \(\theta\):

\[ f(x_1,\dots,x_n | \theta) = \color{SlateBlue}{g(t(x_1,\dots,x_n), \theta)} \cdot \color{DarkMagenta}{h(x_1,\dots,x_n)} \]

Proof (discrete distribution)

\(\Rightarrow\): Suppose that \(T=t(\boldsymbol{x})\) is sufficient. Then \(\PP(\boldsymbol{X}=\boldsymbol{x} | T=t)\) does not depend on \(\theta\). Pick a value \(t=t(\boldsymbol{x})\) (…because otherwise \(\PP=0\)). Then the event \(\boldsymbol{X}=\boldsymbol{x}\) is identical to the event that both \(\boldsymbol{X}=\boldsymbol{x}\) and \(T=t\). Thus \[ PMF(\boldsymbol{x}|\theta) = \PP(\boldsymbol{X}=\boldsymbol{x}|\theta) = \PP(\boldsymbol{X}=\boldsymbol{x}, T=t|\theta) = \\ = \PP(\boldsymbol{X}|T=t, \theta)\cdot\PP(T=t|\theta) = \color{DarkMagenta}{\PP(\boldsymbol{X}|T=t)}\cdot\color{SlateBlue}{\PP(T=t|\theta)} \]

where the key step is \(\PP(\boldsymbol{X}|T=t,\theta) = \PP(\boldsymbol{X}|T=t)\), which follows from the definition of sufficiency. The final factorization does not involve \(\theta\) in the first factor (acting as \(h\)), and does not involve any \(X_i\) in the second factor (acting as \(g(t,\theta)\)).

Neyman Factorization Theorem

Theorem

\(T = t(X_1,\dots,X_n)\) is a sufficient statistic for \(\theta\) if and only if the joint PDF/PMF can be written as a product of two factors, the first of which depends on the data only through \(t(x_1,\dots,x_n)\), and the second factor does not depend on \(\theta\):

\[ f(x_1,\dots,x_n | \theta) = \color{SlateBlue}{g(t(x_1,\dots,x_n), \theta)} \cdot \color{DarkMagenta}{h(x_1,\dots,x_n)} \]

Proof (discrete distribution)

\(\Leftarrow\): Assume that \(PMF(\boldsymbol{x}|\theta) = g(t,\theta)\cdot h(\boldsymbol{x})\). We will prove that the conditional probability does not involve \(\theta\).

\[ \PP(\boldsymbol{X}=\boldsymbol{x}|T=t, \theta) = \frac{\PP(\boldsymbol{X}=\boldsymbol{x}, T=t | \theta)}{\PP(T=t | \theta)} = \frac{\PP(\boldsymbol{X}=\boldsymbol{x} | \theta)}{\PP(T=t | \theta)} = \\ = \frac{\color{SlateBlue}{g(t,\theta)}\cdot \color{DarkMagenta}{h(\boldsymbol{x})}}{\sum_{\boldsymbol{u}: t(\boldsymbol{u}) = t}\PP(\boldsymbol{X}=\boldsymbol{u}|\theta)} = \frac{\color{SlateBlue}{g(t,\theta)}\cdot \color{DarkMagenta}{h(\boldsymbol{x})}}{\sum_{\boldsymbol{u}: t(\boldsymbol{u}) = t}\color{SlateBlue}{g(t(\boldsymbol{u}),\theta)}\cdot \color{DarkMagenta}{h(\boldsymbol{u})}} = \frac{h(\boldsymbol{x})}{\sum_{\boldsymbol{u}: t(\boldsymbol{u}) = t} h(\boldsymbol{u})} \]

Sufficiency for \(Uniform(0,\theta)\)

Consider iid \(X_1,\dots,X_n\sim Uniform(0,\theta)\) with

\[ PDF(\boldsymbol{x}|\theta) = \begin{cases} 1/\theta^n & \text{if }0\leq x_1\leq \theta; 0\leq x_2\leq \theta; \dots; 0\leq x_n\leq\theta \\ 0 & \text{otherwise} \end{cases} \]

Using the set indicator function \[ \boldsymbol{1}(S)(s) = \begin{cases} 1 & \text{if }s\in S \\ 0 & \text{otherwise} \end{cases} \]

we can factor this function. The support of the function is the hypercube given by the conditions \(0 \leq x_i \leq \theta\) for each \(i\). Notice that

\(0\leq x_i\) for all \(x_i\) is equivalent to \(0\leq\min x_i\).

So we can write \(\boldsymbol{1}\{\boldsymbol{x} | 0\leq\min\boldsymbol{x}\}\) for this condition.

Similarly, we can write \(\boldsymbol{1}\{\boldsymbol{x} | \max\boldsymbol{x}\leq\theta\}\) for the joint upper limit condition.

So we can write:

\[ PDF(\boldsymbol{x}) = \frac{1}{\theta^n}\cdot \boldsymbol{1}\{\boldsymbol{x} | 0\leq\min\boldsymbol{x}\}\cdot \boldsymbol{1}\{\boldsymbol{x} | \max\boldsymbol{x}\leq\theta\} = \\ \color{SlateBlue}{\left(\frac{1}{\theta^n}\cdot \boldsymbol{1}\{\boldsymbol{x} | \max\boldsymbol{x}\leq\theta\}\right)}\cdot\color{DarkMagenta}{\boldsymbol{1}\{\boldsymbol{x} | 0\leq\min\boldsymbol{x}\}} \]

Conclusion: \(\max\boldsymbol{x}\) is sufficient for \(\theta\).

Joint Sufficiency - Vector-valued Statistics

Many interesting distributions involve more than one statistic - calling for an expansion of our definition to a vector-valued notion of sufficiency.

Definition

Suppose the joint PMF or PDF involves unknown parameters \(\theta_1,\dots,\theta_k\). The \(m\) statistics \(T_1=t_1(X_1,\dots,X_n), \dots, T_m=t_m(X_1,\dots,X_n)\) are said to be jointly sufficient for the parameters if the conditional distribution given values for all the \(T_i\) does not depend on any of the \(\theta_j\) - for all possible values of the statistics.

Equivalently, we can simply consider \(\boldsymbol{\theta}\) a \(k\)-dimensional vector of parameters and \(\boldsymbol{T} = \boldsymbol{t}(\boldsymbol{x})\) a statistic with values in an \(m\)-dimensional space.

With vector arguments and vector-valued functions as appropriate, Neyman’s Factorization Theorem holds:

Theorem

\(\boldsymbol{T} = \boldsymbol{t}(\boldsymbol{X})\) is a sufficient statistic for \(\boldsymbol{\theta}\) if and only if the joint PDF/PMF can be written as a product of two factors, the first of which depends on the data only through \(t(x_1,\dots,x_n)\), and the second factor does not depend on \(\theta\):

\[ f(\boldsymbol{x} | \boldsymbol{\theta}) = \color{SlateBlue}{g(\boldsymbol{t}(\boldsymbol{x}), \boldsymbol{\theta})} \cdot \color{DarkMagenta}{h(\boldsymbol{x})} \]

Order Statistics sufficient for any parameters

Let \(X_1,\dots,X_n\sim\mathcal{D}(\boldsymbol{\theta})\) be an iid random sample from some continuous distribution with some parameter vector \(\boldsymbol{\theta}\).

Let \(T_1 = X_{(1)}, T_2 = X_{(2)}, \dots, T_n = X_{(n)}\). For any sorted sequence of \(n\) numbers \(t_1 < \dots < t_n\), the conditional probability is

\[ \PP(\boldsymbol{X}=\boldsymbol{x} | \boldsymbol{T} = \boldsymbol{t}) = \begin{cases} \frac{1}{n!} & \text{if }\boldsymbol{x}\text{ is a permutation of }\boldsymbol{t} \\ 0 & \text{otherwise} \end{cases} \]

None of the parameters show up in this conditional probability, so the collection of all order statistics is jointly sufficient for any set of parameters.

Normal Distribution

Let \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\) iid.

The joint PDF is1 \[ \begin{align*} PDF(\boldsymbol{x}|\mu,\sigma^2) &= \prod_{i=1}^n\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-(x_i-\mu)^2/(2\sigma^2)\right] \\ &= \color{SlateBlue}{\left(\frac{1}{\sigma^n}\cdot \exp\left[-\left(\sum x_i^2-2\mu\sum x_i+n\mu^2\right)/(2\sigma^2)\right]\right)} \cdot \color{DarkMagenta}{\left(\frac{1}{2\pi}\right)^{n/2}} \end{align*} \]

and this factorization shows us that \(\sum X_i^2\) and \(\sum X_i\) are jointly sufficient for \(\mu\) and \(\sigma^2\).

Since \(\sum(X_i-\overline{X})^2 = \sum X_i^2-n\overline{X}^2 = \sum X_i^2-\left(\sum X_i\right)^2/n\) it is an easy calculation to recover \(\sum X_i\) and \(\sum X_i^2\) from \(S^2\) and \(\overline{X}\) and to recover \(S^2, \overline{X}\) from \(\sum X_i, \sum X_i^2\).

The two sets of jointly sufficient statistics can be converted to each other.

Exponential Families

Many commonly used probability distributions have very similar structure to their likelihood functions.

Definition

A parametrized probability distribution family is an exponential family if the PDF/PMF can be written:

\[ f(\boldsymbol{x}|\boldsymbol{\theta}) = \color{DarkGoldenrod}{h(\boldsymbol{x})}\cdot \exp\left[ \color{CornflowerBlue}{\boldsymbol{\eta}(\boldsymbol{\theta})}\cdot \color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})} - \color{Maroon}{A(\boldsymbol{\theta})} \right] \]

\(\color{CornflowerBlue}{\boldsymbol{\eta}(\boldsymbol{\theta})}\cdot\color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})}\) is a dot product between two vector-valued functions.

\(\color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})}\) is a sufficient statistic of the distribution. (Why? Can you write out the factorization?)

\(\color{CornflowerBlue}{\boldsymbol{\eta}}\) is called the natural parameter of the distribution.

\(\color{Maroon}{A(\boldsymbol{\theta})}\) is the logarithm of the normalization factor needed to make \(f\) a probability distribution in the first place. Moments of \(\color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})}\) can be found by differentiating \(\color{Maroon}{A(\boldsymbol{\theta})}\).

Exponential Families

\[ f(\boldsymbol{x}|\boldsymbol{\theta}) = \color{DarkGoldenrod}{h(\boldsymbol{x})}\cdot \exp\left[ \color{CornflowerBlue}{\boldsymbol{\eta}(\boldsymbol{\theta})}\cdot \color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})} - \color{Maroon}{A(\boldsymbol{\theta})} \right] \]

Distribution \(\color{CornflowerBlue}{\boldsymbol{\theta}}\) \(\color{DarkGoldenrod}{h(\boldsymbol{x})}\) \(\color{CornflowerBlue}{\boldsymbol{\eta}(\boldsymbol{\theta})}\) \(\color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})}\) \(\color{Maroon}{A(\boldsymbol{\theta})}\)
Bernoulli \(p\) 1 \(\log\left(\frac{p}{1-p}\right)\) \(x\) \(-\log(1-p)\)
Binomial (n known) \(p\) \(n\choose x\) \(\log\left(\frac{p}{1-p}\right)\) \(x\) \(-n\log(1-p)\)
Poisson \(\lambda\) \(\frac{1}{x!}\) \(\log\lambda\) \(x\) \(\lambda\)
Exponential \(\lambda\) \(1\) \(-\lambda\) \(x\) \(-\log\lambda\)
Chi-squared \(\nu\) \(\exp\left[-\frac{x}{2}\right]\) \(\frac{\nu}{2}-1\) \(\log x\) \(\log\Gamma\left(\frac{\nu}{2}\right)+\frac{\nu}{2}\log 2\)
Normal (\(\sigma^2\) known) \(\mu\) \(\frac{\exp\left[-\frac{x^2}{2\sigma^2}\right]}{\sqrt{2\pi\sigma^2}}\) \(\frac{\mu}{\sigma}\) \(\frac{x}{\sigma}\) \(\frac{\mu^2}{2\sigma^2}\)
Normal \(\mu,\sigma^2\) \(\frac{1}{\sqrt{2\pi}}\) \(\begin{pmatrix}\frac{\mu}{\sigma^2} \\ -\frac{1}{2\sigma^2}\end{pmatrix}\) \(\begin{pmatrix}x\\x^2\end{pmatrix}\) \(\frac{\mu^2}{2\sigma^2}+\log\sigma\)

Try it out yourself - write out the product and exponential and verify that you recover the PDF/PMF you know.

Exponential Families

\[ X_1,\dots,X_m\sim\mathcal{D}(\boldsymbol\theta) \\ f(\boldsymbol{x}|\boldsymbol{\theta}) = \color{DarkGoldenrod}{h(\boldsymbol{x})}\cdot \exp\left[ \color{CornflowerBlue}{\boldsymbol{\eta}(\boldsymbol{\theta})}\cdot \color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})} - \color{Maroon}{A(\boldsymbol{\theta})} \right] \]

Distribution \(\color{CornflowerBlue}{\boldsymbol{\theta}}\) PDF/PMF \(\color{DarkGoldenrod}{h(\boldsymbol{x})}\) \(\color{CornflowerBlue}{\boldsymbol{\eta}(\boldsymbol{\theta})}\) \(\color{DarkGreen}{\boldsymbol{T}(\boldsymbol{x})}\) \(\color{Maroon}{A(\boldsymbol{\theta})}\)
Bernoulli \(p\) \(p^{\sum x_i}(1-p)^{m-\sum x_i}\) 1 \(\log\left(\frac{p}{1-p}\right)\) \(\sum x_i\) \(-m\log(1-p)\)
Binomial (n known) \(p\) \(\prod{n\choose x_i}p^{\sum x_i}(1-p)^{nm-\sum x_i}\) \(\prod{n\choose x_i}\) \(\log\left(\frac{p}{1-p}\right)\) \(\sum x_i\) \(-nm\log(1-p)\)
Poisson \(\lambda\) \(\frac{\lambda^{\sum x_i}e^{-m\lambda}}{\prod x_i!}\) \(\prod\frac{1}{x_i!}\) \(\log\lambda\) \(\sum x_i\) \(m\lambda\)
Exponential \(\lambda\) \(\lambda^me^{-\lambda\sum x_i}\) \(1\) \(-\lambda\) \(\sum x_i\) \(-m\log\lambda\)
Chi-squared \(\nu\) \(\frac{1}{(2^{\nu/2}\Gamma(\nu/2))^m}\left(\prod x_i\right)^{\nu/2-1}e^{-\sum x_i/2}\) \(\exp\left[-\frac{\sum x_i}{2}\right]\) \(\frac{\nu}{2}-1\) \(\sum\log x_i\) \(m\log\Gamma\left(\frac{\nu}{2}\right)+m\frac{\nu}{2}\log 2\)
Normal (\(\sigma^2\) known) \(\mu\) \(\frac{1}{\sqrt{2\pi\sigma^2}^m}e^{-\frac{1}{2\sigma^2}\sum(x_i-\mu)^2}\) \(\frac{\exp\left[-\frac{\sum x_i^2}{2\sigma^2}\right]}{\sqrt{2\pi\sigma^2}^m}\) \(\frac{\mu}{\sigma}\) \(\frac{\sum x_i}{\sigma}\) \(m\frac{\mu^2}{2\sigma^2}\)
Normal \(\mu,\sigma^2\) \(\frac{1}{\sqrt{2\pi\sigma^2}^m}e^{-\frac{1}{2\sigma^2}\sum(x_i-\mu)^2}\) \(\frac{1}{\sqrt{2\pi}^m}\) \(\begin{pmatrix}\frac{\mu}{\sigma^2} \\ -\frac{1}{2\sigma^2}\end{pmatrix}\) \(\begin{pmatrix}\sum x_i\\\sum x_i^2\end{pmatrix}\) \(m\frac{\mu^2}{2\sigma^2}+m\log\sigma\)

Notice: \(\color{DarkGoldenRod}{h(x_1,\dots,x_m)=\prod h(x_i)}\), \(\color{DarkGreen}{\boldsymbol{T}(x_1,\dots,x_m) = \sum\boldsymbol{T}(x_i)}\), and \(\color{Maroon}{A(\boldsymbol\theta) \mapsto mA(\boldsymbol\theta)}\). However, \(\color{CornflowerBlue}{\boldsymbol\eta(\boldsymbol\theta)}\) is unchanged.

Minimal Sufficient Statistic

As we saw with \(\sum X_i, \sum X_i^2\) and \(S^2, \overline{X}\) for \(\mathcal{N}(\mu,\sigma^2)\) earlier, sometimes sufficient statistic can be calculated from other choices of statistics.

Definition

We say that a (possibly vector-valued) statistic \(T\) is a minimal sufficient statistic if for every other sufficient statistic \(S\), there is a function \(f\) so that \(T=f(S)\).

The minimal sufficient statistic is the one with the smallest dimensionality, representing the greatest possible reduction of the data without information loss.

Rao-Blackwell: Improving an estimator

Since sufficient statistics contain all the information in the data about the value of a parameter \(\theta\), we should be able to use that.

Theorem (Rao-Blackwell)

Suppose \(X_1,\dots,X_n\sim\mathcal{D}(\theta)\) iid, and that \(T\) is sufficient for \(\theta\). We are looking for an estimator for \(h(\theta)\) for some function \(h\).

If \(U\) is an estimator for \(h(\theta)\) that does not involve \(T\), then the estimator \(U^*=\EE[U | T]\) has MSE not greater than \(U\).

If \(U\) is unbiased, then \(U^*\) is unbiased.

Proof

First, \(U^*\) is an estimator (ie a statistic that does not depend on \(\theta\)) - because given that \(T\) is sufficient, the distribution of any statistic conditional on \(T\) does not involve \(\theta\). This is then certainly true for the conditional expected value.

We prove the reduced MSE for the case of an unbiased estimator. In this case, it is a consequence of the conditional expectation-conditional variance formula in Section 5.3:

\[ \VV[U] = \VV[\EE[U|T]] + \EE[\VV[U|T]] = \VV[U^*] + \EE[\VV[U|T]] \]

Since \(\VV[U|T]\) is a variance, it is non-negative, and \(\VV[U]\geq\VV[U^*]\) follows immediately.

Rao-Blackwell: Improving an estimator

Example

Let \(X_1,\dots,X_n\sim Poisson(\lambda)\) iid. We try to estimate \(e^{-\lambda}\) - the probability of observing 0.

(we might, for instance, be modeling production defects with a Poisson distribution model)

As a naïve first estimator, let’s use \(U=\boldsymbol{1}\{X_1=0\}\) - simply guess \(\PP=1\) if the first observation is free from defects, and \(\PP=0\) if there is a defect in the first observation.

\[ \EE[U] = 1\cdot\PP(X_1=0) + 0\cdot\PP(X_1>0) = \PP(X_1=0) = e^{-\lambda}\cdot\lambda^0/0! = e^{-\lambda} \]

So this \(U\), while simplistic, is unbiased for estimating the probability.

For the Poisson distribution (…being an exponential family…) a sufficient statistic is \(T=\sum X_i\), and \(U\) is not a function of \(T\).

So we get a Rao-Blackwell improved estimator \(U^*=\EE[U|\sum X_i]=\PP(X_1=0 | \sum X_i)\). Suppose \(\sum X_i=t\). Then \(U^*\) measures the probability that the first observation is 0, and the n-1 following observations sum up to \(t\), so:

\[ \PP(X_1=0 | \sum X_i=t) = \frac{\PP(\{X_1=0\}\cap \left\{\sum X_i=t\right\})}{\PP\left(\sum X_i=t\right)} = \frac{\PP(\{X_1=0\}\cap \left\{\sum_{i=2}^n X_i=t\right\})}{\PP\left(\sum X_i=t\right)} \]

As stated earlier, the sum of \(n\) independent \(X_i\) is distributed as \(Poisson(n\lambda)\), and the sum of \(n-1\) of them is distributed as \(Poisson((n-1)\lambda)\). From independence follows that the intersection probability is the product of probabilities, so we get:

\[ \PP(X_1=0 | \sum X_i=t) = \frac{\frac{\color{DarkMagenta}{e^{-\lambda}}\lambda^0}{0!}\cdot\frac{\color{DarkMagenta}{e^{-(n-1)\lambda}}[(n-1)\color{CornflowerBlue}{\lambda}]^t}{\color{DarkGreen}{t!}}}{\frac{\color{DarkMagenta}{e^{-n\lambda}}(n\color{CornflowerBlue}{\lambda})^t}{\color{DarkGreen}{t!}}} = \left(\frac{n-1}{n}\right)^t = \left(1-\frac{1}{n}\right)^t \]

Rao-Blackwell and Maximum Likelihood

In the last example, as \(n\to\infty\), we get \((1-1/n)^t = [(1-1/n)^{n}]^{\overline{x}}\to e^{-\overline{x}}\), which is the MLE for \(e^{-\lambda}\).

In general, MLEs make full use of the sample information: if \(\boldsymbol{T}(\boldsymbol{x})\) is (jointly) sufficient for \(\boldsymbol{\theta}\), Neyman’s theorem tells us:

\[ f(\boldsymbol{x} | \boldsymbol{\theta}) = \color{SlateBlue}{g(\boldsymbol{T}(\boldsymbol{x}), \boldsymbol{\theta})} \cdot \color{DarkMagenta}{h(\boldsymbol{x})} \]

The \(\boldsymbol{\theta}\) that maximizes \(f(\boldsymbol{x} | \boldsymbol{\theta})\) will also maximize \(g(\boldsymbol{T}(\boldsymbol{x}), \boldsymbol{\theta})\) - since these functions differ by a constant. Hence the MLE can be found using only the sufficient statistic and not the data itself.