20 February, 2018

Rao-Blackwell

The Rao-Blackwell Theorem

Theorem Suppose \(T\) is a sufficient statistic for \(\theta\), \(\delta\) is an unbiased estimator for \(g(\theta)\). Define \(\eta(T)=\mathbb{E}[\delta(X)|T]\).

If the loss function is convex, then the risk of \(\eta\) is bounded above by the risk of \(\delta\). If the loss is strictly convex, then the inequality is strict unless \(\delta(X)=\eta(T)\).

Theorem Suppose \(\hat\theta\) is an unbiased estimator for \(\theta\) such that \(\mathbb{V}(\hat\theta)\leq\infty\). If \(T\) is a sufficient statistic for \(\theta\), define \(\hat\theta^* = \mathbb{E}[\hat\theta|T]\). Then for all \(\theta\), \[ \mathbb{E}(\hat\theta^*=\theta) \qquad\text{and}\qquad \mathbb{V}(\hat\theta^*)\leq\mathbb{V}(\hat\theta) \]

Equivalence, for \(g(\theta)=\theta\), follows from the bias/variance decomposition of squared error loss.

Convex loss

The theorem statement uses the term convex loss.

A set \(C\) is convex if for any \(t\in[0,1]\), and any points \(x, y\in C\), the point \(tx+(1-t)y\in C\) – ie the entire line from \(x\) to \(y\) is contained in \(C\).

A function \(f\) defined on a convex set is convex if for any \(x, y\in C\), \[ f[tx + (1-t)y] \leq tf(x) + (1-t)f(y) \] \(f\) is strictly convex if \(\leq\) in this inequality can be replaced with \(<\) whenever \(x\neq y\).

Convexity for twice differentiable one-dimensional \(f\) means that \(f''(x)\geq0\) for all \(x\in C\).

Convex support

Theorem If \(f\) is convex on \((a,b)\subset\mathbb{R}\) and \(t\in C\), then there is some constant \(c\) depending on \(t\) such that \[ f(t) + c(x-t) \leq f(x) \qquad \forall x\in C \] The inequality is strict for \(x\neq t\) if \(f\) is strictly convex.

Convex support

Jensen's inequality

Theorem If \(C\) is an open interval, \(f\) convex on \(C\), \(\mathbb{P}(X\in C)=1\) and \(\mathbb{E}X\) is finite, then \[ f(\mathbb{E}X) \leq \mathbb{E}f(X) \]

If \(f\) is strictly convex, the inequality is strict unless \(X\) is almost surely constant.

Proof Set \(t=\mathbb{E}X\) for the convex support theorem. \[ f(\mathbb{E}X) + c(x-\mathbb{E}X) \leq f(x) \qquad \forall x\in C \qquad \text{so with prob. 1} \\ f(\mathbb{E}X) + c(X-\mathbb{E}X) \leq f(X) \\ \mathbb{E}\left[f(\mathbb{E}X) + c(X-\mathbb{E}X)\right] \leq \mathbb{E}f(X) \\ \]

Rao-Blackwell's Theorem

Theorem Suppose \(T\) is a sufficient statistic for \(\theta\), \(\delta\) is an unbiased estimator for \(g(\theta)\). Define \(\eta(T)=\mathbb{E}[\delta(X)|T]\). If \(L\) is convex, then \[ R(\theta,\eta) \leq R(\theta,\delta) \]

Proof Jensen's inequality tells us \[ L(\theta,\eta(T)) = L(\theta,\mathbb{E}[\delta(X)|T]) \leq \mathbb{E}[L(\theta, \delta(X))|T] \text{ take expectations}\\ \mathbb{E}L(\theta,\eta(T)) \leq \mathbb{E}\mathbb{E}[L(\theta,\delta(X))|T] = \mathbb{E}(L(\theta,\delta(X))) \text{ so} \\ R(\theta,\eta(T)) \leq R(\theta,\delta(X)) \]

Rao-Blackwell's Theorem

Theorem Suppose \(\hat\theta\) is an unbiased estimator for \(\theta\) such that \(\mathbb{V}(\hat\theta)\leq\infty\). If \(T\) is a sufficient statistic for \(\theta\), define \(\hat\theta^* = \mathbb{E}[\hat\theta|T]\). Then for all \(\theta\), \[ \mathbb{E}(\hat\theta^*=\theta) \qquad\text{and}\qquad \mathbb{V}(\hat\theta^*)\leq\mathbb{V}(\hat\theta) \]

Proof Set \(\delta=\hat\theta\) and \(g(\theta)=\theta\). Use mean square loss, which is a convex loss function. Then \[ \mathbb{E}(\theta-\hat\theta^*)^2 \leq \mathbb{E}(\theta-\hat\theta)^2 \]

Now, since \(\hat\theta\) and \(\hat\theta^*\) are both unbiased, the bias/variance decomposition yields \(\mathbb{E}(\theta-\hat\theta^*)^2=\mathbb{V}\hat\theta^*\) and \(\mathbb{E}(\theta-\hat\theta)^2=\mathbb{V}\hat\theta\).

So what?

Rao-Blackwell implies a couple of really useful facts

  • Rao-Blackwell produces an estimator that only depends on the sufficient statistic chosen.
  • Since \(\mathbb{E}[h(T)|T] = h(T)\), Rao-Blackwell will not improve further by re-application.
  • If we pick a good sufficient statistic to start with, Rao-Blackwell returns the best estimator using that statistic.
  • Best here is realized with minimally sufficient statistics: these produce minimum variance unbiased estimators (MVUE)

Example

\(X_1,\dots,X_n\sim\text{Bernoulli}(p)\). The joint probability is \[ \mathbb{P}(x_1,\dots,x_n|p) = p^{\sum x_i}(1-p)^{n-\sum x_i} \]

We know by now that \(\sum x_i\) is a minimally sufficient statistic. \(\mathbb{E}\sum x_i=np\) so \[ \mathbb{E}\overline X = \mathbb{E}\left[\frac{\sum x_i}{n}\right] = p \] Thus \(\overline X\) is an unbiased estimator of \(p\). It is produced as the expected value of a minimally sufficient statistic, so cannot be further improved by Rao-Blackwell. Therefore it is an MVUE.

Return of Lehmann-Scheffé

In addition to the Lehmann-Scheffé criterion for minimal sufficiency, there is a Lehmann-Scheffé theorem:

Theorem Suppose \(X_i\) is a sample from a distribution with density function \(p(x|\theta)\), \(Y=Y(X)\) is a sufficient statistic for \(\theta\) and \(Y\) complete.

If \(\phi\) is a function such that \(\mathbb{E}\phi(Y)=\theta\), then \(\phi(Y)\) is the unique MVUE of \(\theta\).

Proof If \(Z\) is an unbiased estimator of \(\theta\), then by Rao-Blackwell, \(\phi(Y)=\mathbb{E}[Z|Y]\) defines an unbiased estimator of \(\theta\) with lower variance than \(Z\). This tells us an MVUE exists.

Lehman-Scheffé

Theorem \(Y=Y(X)\) is a sufficient statistic for \(\theta\) and \(Y\) complete.

If \(\mathbb{E}\phi(Y)=\theta\) then \(\phi(Y)\) is the unique MVUE of \(\theta\).

Proof

If \(Z'\) is another MVUE, then since they estimate the same quantity, \(\mathbb{E}[\phi(Y)-Z']=0\). By completeness, \(Z'=\phi(Y)\) almost always.

Example

\(X_1,\dots,X_n\sim\text{Uniform}((\theta,\theta+1))\). Recall \(R=X_{(n)}-X_{(1)}\) and \(M=(X_{(n)}+X_{(1)})/2\) together form a minimally sufficient statistic.

Recall that the joint PDF for \(R,M\) is \[ p(r,m|\theta) = \begin{cases} n(n-1)r^{n-2} & \text{if } \theta < m-r/2, m+r/2 < \theta+1 \\ 0 & \text{otherwise} \end{cases} \]

So \[ \mathbb{E}M = \int_{0}^{1}\int_{\theta+r/2}^{\theta+1-r/2}m\cdot n(n-1)r^{n-2}dmdr = \\ -\frac{n(n-1)}{2}\int_{0}^{1}(r-1)(2\theta+1)r^{n-2}dr = \theta + \frac{1}{2} \]

Example

\(X_1,\dots,X_n\sim\text{Uniform}((\theta,\theta+1))\). Recall \(R=X_{(n)}-X_{(1)}\) and \(M=(X_{(n)}+X_{(1)})/2\) together form a minimally sufficient statistic.

We calculated \(\mathbb{E}M=\theta+1/2\). So \(\mathbb{E}(M-1/2)=\theta\).

So \(M-1/2\) is an unbiased estimator of \(\theta\). It is produced as an expectation. So it is an MVUE of \(\theta\).

MVUE from an estimator

If \(T\) is any unbiased estimator of \(\theta\), Rao-Blackwell suggests that the expected value of \(T\) will be an MVUE.

Example \(X_1,\dots,X_n\sim\text{Bernoulli}(p)\). The joint density function is \[ p(x_1,\dots,x_n|\theta) = \theta^{\sum x_i}(1-\theta)^{n-\sum x_i} \]

This is an exponential family with full rank, and \(\sum x_i\) is a complete sufficient statistic.

MVUE from an estimator

\(X_i\sim\text{Bernoulli}(p)\). Let's estimate \(g(p)=p^2\). One unbiased estimator is \(\delta=X_1\cdot X_2\).

The joint probability for \(X_1, X_2, T\) is \[ \mathbb{P}(X_1=X_2=1,T=t) = \mathbb{P}\left(X_1=X_2=1, \sum_{i=3}^n X_i=t-2\right) = \\ p^2\cdot{n-2\choose t-2}p^{t-2}(1-p)^{n-t} \]

MVUE from an estimator

\(X_i\sim\text{Bernoulli}(p)\). Let's estimate \(g(p)=p^2\). One unbiased estimator is \(\delta=X_1\cdot X_2\). The joint probability is \(\mathbb{P}(X_1=X_2=1,T=t) = p^2\cdot{n-2\choose t-2}p^{t-2}(1-p)^{n-t}\).

\[ \eta(t) = \mathbb{E}[X_1\cdot X_2|T=t] = \frac{\mathbb{P}(X_1=X_2=1, T=t)}{\mathbb{P}(T=t)} = \\ \frac{p^2{n-2\choose t-2}p^{t-2}(1-p)^{n-t}} {{n\choose t}p^t(1-p)^{n-t}} = \frac{t}{n}\cdot\frac{t-1}{n-1} \]

This gives us the MVUE: \(T(T-1)/(n^2-n)\) is MVUE for \(p^2\).

ARE biased estimators bad?

Example

\(X_1,\dots,X_n\sim\text{Uniform}(0,\theta)\) with \(T=\max X_i\) a complete sufficient statistic.

\[ \mathbb{E}(T) = \int_0^\theta tnt^{n-1}\theta^{-n}dt = n\theta^{-n}\int_0^\theta t^ndt = n\theta^{-n}\left[\frac{t^{n+1}}{n+1}\right]_0^\theta = \\ \frac{n}{n+1}\cdot\frac{\theta^{n+1}}{\theta^n} = \frac{n}{n+1}\theta \]

So \(\mathbb{E}(n+1)T/n = \theta\), gives our MVUE.

Example

\(X_1,\dots,X_n\sim\text{Uniform}(0,\theta)\) with \(T=\max X_i\) a complete sufficient statistic. \((n+1)T/n\) is the MVUE.

Is \((n+1)/n\) the constant multiplier with universally lowest squared error loss?

We'll calculate first \[ \mathbb{E}T^2 = \int_0^\theta t^2nt^{n-1}\theta^{-n}dt = n\theta^{-n}\int_0^\theta t^{n+1}dt = n\theta^{-n}\left[\frac{t^{n+2}}{n+2}\right]_0^\theta = \\ \frac{n}{n+2}\cdot\frac{\theta^{n+2}}{\theta^n} = \frac{n}{n+2}\theta^2 \]

Example

\(X_1,\dots,X_n\sim\text{Uniform}(0,\theta)\) with \(T=\max X_i\) a complete sufficient statistic. \((n+1)T/n\) is the MVUE.

The risk of \(\delta_a = aT\) under squared error loss is \[ R(\theta,\delta_a) = \mathbb{E}(aT-\theta)^2 = \\ a^2\mathbb{E}T^2 - 2a\theta\mathbb{E}T + \theta^2 = a^2\frac{n}{n+2}\theta^2 -2a\theta\frac{n}{n+1}\theta + \theta^2 \]

The factor \[ \frac{na^2}{n+2}-\frac{2na}{n+1} + 1 \] is minimized for \(a=(n+2)/(n+1)\). The estimator \((n+2)T/(n+1)\) is biased.

Normal and Standard Normal Distributions

The Normal Distribution

We have seen it repeatedly before: the distribution \(\mathcal{N}(\mu,\sigma^2)\) has density function \[ p_\mathcal{N}(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \]

The normal distribution is linear in the mean: if \(X_1\sim\mathcal{N}(\mu_1,\sigma_1^2)\) and \(X_2\sim\mathcal{N}(\mu_2,\sigma_2^2)\) are independent, then \(X_1+X_2\sim\mathcal{N}(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2)\).

This is what we expect from linearity of \(\mathbb{E}\) and not-quite-linearity of \(\mathbb{V}\).

Moment generating functions

A powerful method to prove equality of probability distributions is the moment generating function (or Fourier transform) of a distribution.

Definition The moment generating function \(\mathbb{M}_X(t)\) of a random variable \(X\) is defined as \[ \mathbb{M}_X(t) = \mathbb{E}\exp[tX] \]

The moments of a random variable are \(m_n=\mathbb{E}X^n\). The coefficient of \(t^n\) in \(\mathbb{M}_X(t)\) is \(m_n/n!\).

The moment generating function doesn't necessarily exist. The characteristic function \(\phi_X(t)=\mathbb{M}_X(it)\) always does.

Some distributions

Distribution \(\mathbb{M}_X(t)\)
Binomial \((1-p+pe^t)^n\)
Poisson \(\exp[\lambda(e^t-1)]\)
Uniform (continuous) \((e^{tb}-e^{ta})/t(b-a)\)
Uniform (discrete) \((e^{at}-e^{(b+1)t})/(b-a+1)(1-e^t)\)
Normal \(\exp[t\mu+\sigma^2t^2/2]\)
Chi-squared \((1-2t)^{-k/2}\)
Exponential \((1-t\lambda^-1)^-1\)

Equality of distributions

Theorem If \(\mathbb{M}_X(u)\) and \(\mathbb{M}_Y(u)\) are finite and agree for all \(u\) in some set with nonempty interior, then \(X\) and \(Y\) have the same distribution.

Linearity of the normal distribution

Now we will prove the linearity claims before with a handful of other useful claims.

Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(aX+b\sim\mathcal{N}(a\mu+b, a^2\sigma^2)\) for constants \(a\geq 0, b\).

Proof The distribution function is \(\mathbb{P}(aX+b\leq y) =\) \[ \mathbb{P}(X \leq (y-b)/a) = \int_{-\infty}^{(y-b)/a} \frac{\exp[-(x-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}dx \] Substitute \(u=ax+b\). Then \(dx=du/a\) and \((x-\mu)^2/2\sigma^2 = (u-b-a\mu)^2/(2\sigma^2a^2)\)

Linearity of the normal distribution

Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(aX+b\sim\mathcal{N}(\color{green}{a\mu+b}, \color{orange}{a^2\sigma^2})\) for constants \(a\geq 0, b\).

Proof… The distribution function for \(aX+b\) is \[ \mathbb{P}(aX+b\leq y) = \int_{-\infty}^{(y-b)/a} \frac{\exp[-(x-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}dx =\\ \int_{-\infty}^{y} \frac{\exp[-(x-(\color{green}{a\mu+b})^2/2\color{orange}{a^2\sigma^2}]}{\sqrt{2\pi\color{orange}{\sigma^2}}}\frac{du}{\color{orange}{\sqrt{a^2}}} \]

Linearity of the normal distribution

To deal with negative \(a\), let's look specifically at the case of \(a=-1\) and \(b=0\).

Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(-X\sim\mathcal{N}(-\mu,\sigma^2)\).

Proof The distribution function is \(\mathbb{P}(-X\leq x) =\) \(\mathbb{P}(X \geq -x) =\) \[ \int_{-x}^\infty \frac{\exp[-(x-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}dx = [u=-x, du=-dx] = \\ \int_{x}^{-\infty} \frac{\exp[-(-u-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}(-1)du = \\ \int_{-\infty}^{x} \frac{\exp[-(u-(-\mu))^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}du \]

Moment generating functions

Claim If \(Z\sim\mathcal{N}(0,1)\) then \(\mathbb{M}_Z(t)=e^{t^2/2}\).

Proof \[ \mathbb{E}e^{tZ} = \int e^{tz}\frac{e^{-z^2/2}}{\sqrt{2\pi}}dz \] Note \(tz-z^2/2 = -(z-t)^2/2 + t^2/2\) so \[ \mathbb{E}e^{tZ} = e^{t^2/2}\int\frac{e^{-(z-t)^2/2}}{\sqrt{2\pi}}dz \]

The integrand is a normal density (the density for \(\mathcal(t,1)\)) so integrates to 1.

Moment generating functions

Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(\mathbb{M}_X(t)=e^{t\mu+t^2\sigma^2/2}\).

Proof Set \(Z=(X-\mu)/\sigma\). Then \(X=\mu+\sigma Z\) and \[ \mathbb{M}_X(t) = \mathbb{E}e^{tX} = \mathbb{E}e^{t(\mu+\sigma Z)} = e^{t\mu}\mathbb{E}^{t\sigma Z} = \\ e^{t\mu}\mathbb{M}_Z(t\sigma) = e^{t\mu}e^{u^2\sigma^2/2} = e^{t\mu+t^2\sigma^2/2} \]

Linearity of the normal distribution

Now for the sum of two different normal variables.

Claim If \(X_1\sim\mathcal{N}(\mu_1,\sigma_1^2)\) and \(X_2\sim\mathcal{N}(\mu_2,\sigma_2^2)\) then \(X_1+X_2\sim\mathcal{N}(\color{green}{\mu_1+\mu_2},\color{orange}{\sigma_1^2+\sigma_2^2})\).

Proof We compare moment generating functions \[ \mathbb{M}_{X_1+X_2}(t) = \mathbb{E}e^{t(X_1+X_2)} = \mathbb{E}\left[e^{tX_1}e^{tX_2}\right] = \mathbb{M}_{X_1}(t)\cdot\mathbb{M}_{X_2}(t) = \\ e^{t\mu_1+t^2\sigma_1^2/2}\cdot e^{t\mu_2+t^2\sigma_2^2/2} = e^{t(\color{green}{\mu_1+\mu_2})+t^2(\color{orange}{\sigma_1^2+\sigma_2^2})/2} \]

Sampling statistics

We know from previous work that \(T=(\sum X_i, \sum X_i^2)\) is a complete sufficient statistic. Note:

\[ \overline{X} = T_1/n \qquad\qquad T_1 = n\overline{X} \\ S^2 = \left(\sum X_i^2-n\overline{X}^2\right)\Big{/}(n-1) = (T_2-T_1^2/n)/(n-1) \\ T_2 = (n-1)S^2+n\overline{X}^2 \]

It follows that the sample mean \(\overline{X}\) and the sample variance \(S^2\) are also complete sufficient for the normal distribution.

Sampling distributions

If \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\) then the previous slides prove

Theorem \(\overline{X}\sim\mathcal{N}(\mu,\sigma^2/n)\).

Definition The \(\chi^2\) (chi-square) distribution on \(p\) degrees of freedom is the distribution of the sum \(Z_1^2+\dots+Z_n^2\) for \(Z_i\sim\mathcal{N}(0,1)\).

Theorem \[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \]

This last theorem requires significant work with properties of the Gamma distribution. We will skip it.

The Central Limit Theorem

Theorem Let \(X_1,\dots,X_n\) be iid random variables with \(\mathbb{E}X_i=\mu\) and \(\mathbb{V}X_i=\sigma^2<\infty\). Then writing \(U_n=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\), \[ \lim_{n\to\infty} \mathbb{P}\left(U_n\leq x\right) = \int_{-\infty}^x \frac{e^{-t^2/2}}{\sqrt{2\pi}}dt \] is a standard normal distribution function.

But how quickly does it converge?

Barry-Esseen-Shevtsova theorem

Theorem Let \(X_i\) iid with \(\mathbb{E}X_i=0\), \(0<\mathbb{E}X_i^2=\sigma^2<\infty\) and \(\mathbb{E}|X_i|^3=\rho<\infty\). Then there is a constant \(C\) such that \[ \left|\mathbb{P}\left(\frac{\overline X\sqrt{n}}{\sigma}\leq x\right) - \mathbb{P}_{\mathcal{N}(0,1)}(Y\leq x)\right| \leq \frac{C\rho}{\sigma^3\sqrt{n}} \]

As of 2012, the best estimate is \(C<0.4748\). This bound follows from Shevtsova: \[ \left|\mathbb{P}\left(\frac{\overline X\sqrt{n}}{\sigma}\leq x\right) - \mathbb{P}_{\mathcal{N}(0,1)}(Y\leq x)\right| \leq \frac{0.33554(\rho+.415\sigma^3)}{\sigma^3\sqrt{n}} \]

Barry-Esseen's bound

Estimated percentage point error bounds for \(\sigma\) and \(\rho\)

n 1.00 5.00 10.00 20.00 30.00 40.00
σ.1.ρ.0.1 4.75 2.12 1.50 1.06 0.87 0.75
σ.10.ρ.0.1 0.00 0.00 0.00 0.00 0.00 0.00
σ.1.ρ.1 47.48 21.23 15.01 10.62 8.67 7.51
σ.10.ρ.1 0.05 0.02 0.02 0.01 0.01 0.01
σ.0.1.ρ.10 474800.00 212337.02 150144.94 106168.51 86686.22 75072.47
σ.1.ρ.10 474.80 212.34 150.14 106.17 86.69 75.07
σ.10.ρ.10 0.47 0.21 0.15 0.11 0.09 0.08

Third moment

Distribution \(\mu\) \(\sigma\) \(\rho\)
Binomial \(n, p\) \(np\) \(\sqrt{npq}\) \((1-2p)npq\)
Poisson \(\lambda\) \(\lambda\) \(\lambda\) \(\lambda\)
Exponential \(\lambda\) \(1/\lambda\) \(1/\lambda^2\) \(2/\lambda^3\)
Normal \(\mu, \sigma^2\) \(\mu\) \(\sigma^2\) \(0\)
Uniform \(a,b\) \((a+b)/2\) \((b-a)^2/12\) \(0\)
Student's t \(\nu\) \(0\) \(\nu/(\nu-2)\) \(0\)