20 February, 2018
Theorem Suppose \(T\) is a sufficient statistic for \(\theta\), \(\delta\) is an unbiased estimator for \(g(\theta)\). Define \(\eta(T)=\mathbb{E}[\delta(X)|T]\).
If the loss function is convex, then the risk of \(\eta\) is bounded above by the risk of \(\delta\). If the loss is strictly convex, then the inequality is strict unless \(\delta(X)=\eta(T)\).
Theorem Suppose \(\hat\theta\) is an unbiased estimator for \(\theta\) such that \(\mathbb{V}(\hat\theta)\leq\infty\). If \(T\) is a sufficient statistic for \(\theta\), define \(\hat\theta^* = \mathbb{E}[\hat\theta|T]\). Then for all \(\theta\), \[ \mathbb{E}(\hat\theta^*=\theta) \qquad\text{and}\qquad \mathbb{V}(\hat\theta^*)\leq\mathbb{V}(\hat\theta) \]
Equivalence, for \(g(\theta)=\theta\), follows from the bias/variance decomposition of squared error loss.
The theorem statement uses the term convex loss.
A set \(C\) is convex if for any \(t\in[0,1]\), and any points \(x, y\in C\), the point \(tx+(1-t)y\in C\) – ie the entire line from \(x\) to \(y\) is contained in \(C\).
A function \(f\) defined on a convex set is convex if for any \(x, y\in C\), \[ f[tx + (1-t)y] \leq tf(x) + (1-t)f(y) \] \(f\) is strictly convex if \(\leq\) in this inequality can be replaced with \(<\) whenever \(x\neq y\).
Convexity for twice differentiable one-dimensional \(f\) means that \(f''(x)\geq0\) for all \(x\in C\).
Theorem If \(f\) is convex on \((a,b)\subset\mathbb{R}\) and \(t\in C\), then there is some constant \(c\) depending on \(t\) such that \[ f(t) + c(x-t) \leq f(x) \qquad \forall x\in C \] The inequality is strict for \(x\neq t\) if \(f\) is strictly convex.
Theorem If \(C\) is an open interval, \(f\) convex on \(C\), \(\mathbb{P}(X\in C)=1\) and \(\mathbb{E}X\) is finite, then \[ f(\mathbb{E}X) \leq \mathbb{E}f(X) \]
If \(f\) is strictly convex, the inequality is strict unless \(X\) is almost surely constant.
Proof Set \(t=\mathbb{E}X\) for the convex support theorem. \[ f(\mathbb{E}X) + c(x-\mathbb{E}X) \leq f(x) \qquad \forall x\in C \qquad \text{so with prob. 1} \\ f(\mathbb{E}X) + c(X-\mathbb{E}X) \leq f(X) \\ \mathbb{E}\left[f(\mathbb{E}X) + c(X-\mathbb{E}X)\right] \leq \mathbb{E}f(X) \\ \]
Theorem Suppose \(T\) is a sufficient statistic for \(\theta\), \(\delta\) is an unbiased estimator for \(g(\theta)\). Define \(\eta(T)=\mathbb{E}[\delta(X)|T]\). If \(L\) is convex, then \[ R(\theta,\eta) \leq R(\theta,\delta) \]
Proof Jensen's inequality tells us \[ L(\theta,\eta(T)) = L(\theta,\mathbb{E}[\delta(X)|T]) \leq \mathbb{E}[L(\theta, \delta(X))|T] \text{ take expectations}\\ \mathbb{E}L(\theta,\eta(T)) \leq \mathbb{E}\mathbb{E}[L(\theta,\delta(X))|T] = \mathbb{E}(L(\theta,\delta(X))) \text{ so} \\ R(\theta,\eta(T)) \leq R(\theta,\delta(X)) \]
Theorem Suppose \(\hat\theta\) is an unbiased estimator for \(\theta\) such that \(\mathbb{V}(\hat\theta)\leq\infty\). If \(T\) is a sufficient statistic for \(\theta\), define \(\hat\theta^* = \mathbb{E}[\hat\theta|T]\). Then for all \(\theta\), \[ \mathbb{E}(\hat\theta^*=\theta) \qquad\text{and}\qquad \mathbb{V}(\hat\theta^*)\leq\mathbb{V}(\hat\theta) \]
Proof Set \(\delta=\hat\theta\) and \(g(\theta)=\theta\). Use mean square loss, which is a convex loss function. Then \[ \mathbb{E}(\theta-\hat\theta^*)^2 \leq \mathbb{E}(\theta-\hat\theta)^2 \]
Now, since \(\hat\theta\) and \(\hat\theta^*\) are both unbiased, the bias/variance decomposition yields \(\mathbb{E}(\theta-\hat\theta^*)^2=\mathbb{V}\hat\theta^*\) and \(\mathbb{E}(\theta-\hat\theta)^2=\mathbb{V}\hat\theta\).
Rao-Blackwell implies a couple of really useful facts
\(X_1,\dots,X_n\sim\text{Bernoulli}(p)\). The joint probability is \[ \mathbb{P}(x_1,\dots,x_n|p) = p^{\sum x_i}(1-p)^{n-\sum x_i} \]
We know by now that \(\sum x_i\) is a minimally sufficient statistic. \(\mathbb{E}\sum x_i=np\) so \[ \mathbb{E}\overline X = \mathbb{E}\left[\frac{\sum x_i}{n}\right] = p \] Thus \(\overline X\) is an unbiased estimator of \(p\). It is produced as the expected value of a minimally sufficient statistic, so cannot be further improved by Rao-Blackwell. Therefore it is an MVUE.
In addition to the Lehmann-Scheffé criterion for minimal sufficiency, there is a Lehmann-Scheffé theorem:
Theorem Suppose \(X_i\) is a sample from a distribution with density function \(p(x|\theta)\), \(Y=Y(X)\) is a sufficient statistic for \(\theta\) and \(Y\) complete.
If \(\phi\) is a function such that \(\mathbb{E}\phi(Y)=\theta\), then \(\phi(Y)\) is the unique MVUE of \(\theta\).
Proof If \(Z\) is an unbiased estimator of \(\theta\), then by Rao-Blackwell, \(\phi(Y)=\mathbb{E}[Z|Y]\) defines an unbiased estimator of \(\theta\) with lower variance than \(Z\). This tells us an MVUE exists.
Theorem \(Y=Y(X)\) is a sufficient statistic for \(\theta\) and \(Y\) complete.
If \(\mathbb{E}\phi(Y)=\theta\) then \(\phi(Y)\) is the unique MVUE of \(\theta\).
Proof…
If \(Z'\) is another MVUE, then since they estimate the same quantity, \(\mathbb{E}[\phi(Y)-Z']=0\). By completeness, \(Z'=\phi(Y)\) almost always.
\(X_1,\dots,X_n\sim\text{Uniform}((\theta,\theta+1))\). Recall \(R=X_{(n)}-X_{(1)}\) and \(M=(X_{(n)}+X_{(1)})/2\) together form a minimally sufficient statistic.
Recall that the joint PDF for \(R,M\) is \[ p(r,m|\theta) = \begin{cases} n(n-1)r^{n-2} & \text{if } \theta < m-r/2, m+r/2 < \theta+1 \\ 0 & \text{otherwise} \end{cases} \]
So \[ \mathbb{E}M = \int_{0}^{1}\int_{\theta+r/2}^{\theta+1-r/2}m\cdot n(n-1)r^{n-2}dmdr = \\ -\frac{n(n-1)}{2}\int_{0}^{1}(r-1)(2\theta+1)r^{n-2}dr = \theta + \frac{1}{2} \]
\(X_1,\dots,X_n\sim\text{Uniform}((\theta,\theta+1))\). Recall \(R=X_{(n)}-X_{(1)}\) and \(M=(X_{(n)}+X_{(1)})/2\) together form a minimally sufficient statistic.
We calculated \(\mathbb{E}M=\theta+1/2\). So \(\mathbb{E}(M-1/2)=\theta\).
So \(M-1/2\) is an unbiased estimator of \(\theta\). It is produced as an expectation. So it is an MVUE of \(\theta\).
If \(T\) is any unbiased estimator of \(\theta\), Rao-Blackwell suggests that the expected value of \(T\) will be an MVUE.
Example \(X_1,\dots,X_n\sim\text{Bernoulli}(p)\). The joint density function is \[ p(x_1,\dots,x_n|\theta) = \theta^{\sum x_i}(1-\theta)^{n-\sum x_i} \]
This is an exponential family with full rank, and \(\sum x_i\) is a complete sufficient statistic.
\(X_i\sim\text{Bernoulli}(p)\). Let's estimate \(g(p)=p^2\). One unbiased estimator is \(\delta=X_1\cdot X_2\).
The joint probability for \(X_1, X_2, T\) is \[ \mathbb{P}(X_1=X_2=1,T=t) = \mathbb{P}\left(X_1=X_2=1, \sum_{i=3}^n X_i=t-2\right) = \\ p^2\cdot{n-2\choose t-2}p^{t-2}(1-p)^{n-t} \]
\(X_i\sim\text{Bernoulli}(p)\). Let's estimate \(g(p)=p^2\). One unbiased estimator is \(\delta=X_1\cdot X_2\). The joint probability is \(\mathbb{P}(X_1=X_2=1,T=t) = p^2\cdot{n-2\choose t-2}p^{t-2}(1-p)^{n-t}\).
\[ \eta(t) = \mathbb{E}[X_1\cdot X_2|T=t] = \frac{\mathbb{P}(X_1=X_2=1, T=t)}{\mathbb{P}(T=t)} = \\ \frac{p^2{n-2\choose t-2}p^{t-2}(1-p)^{n-t}} {{n\choose t}p^t(1-p)^{n-t}} = \frac{t}{n}\cdot\frac{t-1}{n-1} \]
This gives us the MVUE: \(T(T-1)/(n^2-n)\) is MVUE for \(p^2\).
\(X_1,\dots,X_n\sim\text{Uniform}(0,\theta)\) with \(T=\max X_i\) a complete sufficient statistic.
\[ \mathbb{E}(T) = \int_0^\theta tnt^{n-1}\theta^{-n}dt = n\theta^{-n}\int_0^\theta t^ndt = n\theta^{-n}\left[\frac{t^{n+1}}{n+1}\right]_0^\theta = \\ \frac{n}{n+1}\cdot\frac{\theta^{n+1}}{\theta^n} = \frac{n}{n+1}\theta \]
So \(\mathbb{E}(n+1)T/n = \theta\), gives our MVUE.
\(X_1,\dots,X_n\sim\text{Uniform}(0,\theta)\) with \(T=\max X_i\) a complete sufficient statistic. \((n+1)T/n\) is the MVUE.
Is \((n+1)/n\) the constant multiplier with universally lowest squared error loss?
We'll calculate first \[ \mathbb{E}T^2 = \int_0^\theta t^2nt^{n-1}\theta^{-n}dt = n\theta^{-n}\int_0^\theta t^{n+1}dt = n\theta^{-n}\left[\frac{t^{n+2}}{n+2}\right]_0^\theta = \\ \frac{n}{n+2}\cdot\frac{\theta^{n+2}}{\theta^n} = \frac{n}{n+2}\theta^2 \]
\(X_1,\dots,X_n\sim\text{Uniform}(0,\theta)\) with \(T=\max X_i\) a complete sufficient statistic. \((n+1)T/n\) is the MVUE.
The risk of \(\delta_a = aT\) under squared error loss is \[ R(\theta,\delta_a) = \mathbb{E}(aT-\theta)^2 = \\ a^2\mathbb{E}T^2 - 2a\theta\mathbb{E}T + \theta^2 = a^2\frac{n}{n+2}\theta^2 -2a\theta\frac{n}{n+1}\theta + \theta^2 \]
The factor \[ \frac{na^2}{n+2}-\frac{2na}{n+1} + 1 \] is minimized for \(a=(n+2)/(n+1)\). The estimator \((n+2)T/(n+1)\) is biased.
We have seen it repeatedly before: the distribution \(\mathcal{N}(\mu,\sigma^2)\) has density function \[ p_\mathcal{N}(x|\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left[-\frac{(x-\mu)^2}{2\sigma^2}\right] \]
The normal distribution is linear in the mean: if \(X_1\sim\mathcal{N}(\mu_1,\sigma_1^2)\) and \(X_2\sim\mathcal{N}(\mu_2,\sigma_2^2)\) are independent, then \(X_1+X_2\sim\mathcal{N}(\mu_1+\mu_2,\sigma_1^2+\sigma_2^2)\).
This is what we expect from linearity of \(\mathbb{E}\) and not-quite-linearity of \(\mathbb{V}\).
A powerful method to prove equality of probability distributions is the moment generating function (or Fourier transform) of a distribution.
Definition The moment generating function \(\mathbb{M}_X(t)\) of a random variable \(X\) is defined as \[ \mathbb{M}_X(t) = \mathbb{E}\exp[tX] \]
The moments of a random variable are \(m_n=\mathbb{E}X^n\). The coefficient of \(t^n\) in \(\mathbb{M}_X(t)\) is \(m_n/n!\).
The moment generating function doesn't necessarily exist. The characteristic function \(\phi_X(t)=\mathbb{M}_X(it)\) always does.
Distribution | \(\mathbb{M}_X(t)\) |
---|---|
Binomial | \((1-p+pe^t)^n\) |
Poisson | \(\exp[\lambda(e^t-1)]\) |
Uniform (continuous) | \((e^{tb}-e^{ta})/t(b-a)\) |
Uniform (discrete) | \((e^{at}-e^{(b+1)t})/(b-a+1)(1-e^t)\) |
Normal | \(\exp[t\mu+\sigma^2t^2/2]\) |
Chi-squared | \((1-2t)^{-k/2}\) |
Exponential | \((1-t\lambda^-1)^-1\) |
Theorem If \(\mathbb{M}_X(u)\) and \(\mathbb{M}_Y(u)\) are finite and agree for all \(u\) in some set with nonempty interior, then \(X\) and \(Y\) have the same distribution.
Now we will prove the linearity claims before with a handful of other useful claims.
Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(aX+b\sim\mathcal{N}(a\mu+b, a^2\sigma^2)\) for constants \(a\geq 0, b\).
Proof The distribution function is \(\mathbb{P}(aX+b\leq y) =\) \[ \mathbb{P}(X \leq (y-b)/a) = \int_{-\infty}^{(y-b)/a} \frac{\exp[-(x-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}dx \] Substitute \(u=ax+b\). Then \(dx=du/a\) and \((x-\mu)^2/2\sigma^2 = (u-b-a\mu)^2/(2\sigma^2a^2)\)
Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(aX+b\sim\mathcal{N}(\color{green}{a\mu+b}, \color{orange}{a^2\sigma^2})\) for constants \(a\geq 0, b\).
Proof… The distribution function for \(aX+b\) is \[ \mathbb{P}(aX+b\leq y) = \int_{-\infty}^{(y-b)/a} \frac{\exp[-(x-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}dx =\\ \int_{-\infty}^{y} \frac{\exp[-(x-(\color{green}{a\mu+b})^2/2\color{orange}{a^2\sigma^2}]}{\sqrt{2\pi\color{orange}{\sigma^2}}}\frac{du}{\color{orange}{\sqrt{a^2}}} \]
To deal with negative \(a\), let's look specifically at the case of \(a=-1\) and \(b=0\).
Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(-X\sim\mathcal{N}(-\mu,\sigma^2)\).
Proof The distribution function is \(\mathbb{P}(-X\leq x) =\) \(\mathbb{P}(X \geq -x) =\) \[ \int_{-x}^\infty \frac{\exp[-(x-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}dx = [u=-x, du=-dx] = \\ \int_{x}^{-\infty} \frac{\exp[-(-u-\mu)^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}(-1)du = \\ \int_{-\infty}^{x} \frac{\exp[-(u-(-\mu))^2/2\sigma^2]}{\sqrt{2\pi\sigma^2}}du \]
Claim If \(Z\sim\mathcal{N}(0,1)\) then \(\mathbb{M}_Z(t)=e^{t^2/2}\).
Proof \[ \mathbb{E}e^{tZ} = \int e^{tz}\frac{e^{-z^2/2}}{\sqrt{2\pi}}dz \] Note \(tz-z^2/2 = -(z-t)^2/2 + t^2/2\) so \[ \mathbb{E}e^{tZ} = e^{t^2/2}\int\frac{e^{-(z-t)^2/2}}{\sqrt{2\pi}}dz \]
The integrand is a normal density (the density for \(\mathcal(t,1)\)) so integrates to 1.
Claim If \(X\sim\mathcal{N}(\mu,\sigma^2)\) then \(\mathbb{M}_X(t)=e^{t\mu+t^2\sigma^2/2}\).
Proof Set \(Z=(X-\mu)/\sigma\). Then \(X=\mu+\sigma Z\) and \[ \mathbb{M}_X(t) = \mathbb{E}e^{tX} = \mathbb{E}e^{t(\mu+\sigma Z)} = e^{t\mu}\mathbb{E}^{t\sigma Z} = \\ e^{t\mu}\mathbb{M}_Z(t\sigma) = e^{t\mu}e^{u^2\sigma^2/2} = e^{t\mu+t^2\sigma^2/2} \]
Now for the sum of two different normal variables.
Claim If \(X_1\sim\mathcal{N}(\mu_1,\sigma_1^2)\) and \(X_2\sim\mathcal{N}(\mu_2,\sigma_2^2)\) then \(X_1+X_2\sim\mathcal{N}(\color{green}{\mu_1+\mu_2},\color{orange}{\sigma_1^2+\sigma_2^2})\).
Proof We compare moment generating functions \[ \mathbb{M}_{X_1+X_2}(t) = \mathbb{E}e^{t(X_1+X_2)} = \mathbb{E}\left[e^{tX_1}e^{tX_2}\right] = \mathbb{M}_{X_1}(t)\cdot\mathbb{M}_{X_2}(t) = \\ e^{t\mu_1+t^2\sigma_1^2/2}\cdot e^{t\mu_2+t^2\sigma_2^2/2} = e^{t(\color{green}{\mu_1+\mu_2})+t^2(\color{orange}{\sigma_1^2+\sigma_2^2})/2} \]
We know from previous work that \(T=(\sum X_i, \sum X_i^2)\) is a complete sufficient statistic. Note:
\[ \overline{X} = T_1/n \qquad\qquad T_1 = n\overline{X} \\ S^2 = \left(\sum X_i^2-n\overline{X}^2\right)\Big{/}(n-1) = (T_2-T_1^2/n)/(n-1) \\ T_2 = (n-1)S^2+n\overline{X}^2 \]
It follows that the sample mean \(\overline{X}\) and the sample variance \(S^2\) are also complete sufficient for the normal distribution.
If \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\) then the previous slides prove
Theorem \(\overline{X}\sim\mathcal{N}(\mu,\sigma^2/n)\).
Definition The \(\chi^2\) (chi-square) distribution on \(p\) degrees of freedom is the distribution of the sum \(Z_1^2+\dots+Z_n^2\) for \(Z_i\sim\mathcal{N}(0,1)\).
Theorem \[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^2(n-1) \]
This last theorem requires significant work with properties of the Gamma distribution. We will skip it.
Theorem Let \(X_1,\dots,X_n\) be iid random variables with \(\mathbb{E}X_i=\mu\) and \(\mathbb{V}X_i=\sigma^2<\infty\). Then writing \(U_n=\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\), \[ \lim_{n\to\infty} \mathbb{P}\left(U_n\leq x\right) = \int_{-\infty}^x \frac{e^{-t^2/2}}{\sqrt{2\pi}}dt \] is a standard normal distribution function.
But how quickly does it converge?
Theorem Let \(X_i\) iid with \(\mathbb{E}X_i=0\), \(0<\mathbb{E}X_i^2=\sigma^2<\infty\) and \(\mathbb{E}|X_i|^3=\rho<\infty\). Then there is a constant \(C\) such that \[ \left|\mathbb{P}\left(\frac{\overline X\sqrt{n}}{\sigma}\leq x\right) - \mathbb{P}_{\mathcal{N}(0,1)}(Y\leq x)\right| \leq \frac{C\rho}{\sigma^3\sqrt{n}} \]
As of 2012, the best estimate is \(C<0.4748\). This bound follows from Shevtsova: \[ \left|\mathbb{P}\left(\frac{\overline X\sqrt{n}}{\sigma}\leq x\right) - \mathbb{P}_{\mathcal{N}(0,1)}(Y\leq x)\right| \leq \frac{0.33554(\rho+.415\sigma^3)}{\sigma^3\sqrt{n}} \]
Estimated percentage point error bounds for \(\sigma\) and \(\rho\)
n | 1.00 | 5.00 | 10.00 | 20.00 | 30.00 | 40.00 |
σ.1.ρ.0.1 | 4.75 | 2.12 | 1.50 | 1.06 | 0.87 | 0.75 |
σ.10.ρ.0.1 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
σ.1.ρ.1 | 47.48 | 21.23 | 15.01 | 10.62 | 8.67 | 7.51 |
σ.10.ρ.1 | 0.05 | 0.02 | 0.02 | 0.01 | 0.01 | 0.01 |
σ.0.1.ρ.10 | 474800.00 | 212337.02 | 150144.94 | 106168.51 | 86686.22 | 75072.47 |
σ.1.ρ.10 | 474.80 | 212.34 | 150.14 | 106.17 | 86.69 | 75.07 |
σ.10.ρ.10 | 0.47 | 0.21 | 0.15 | 0.11 | 0.09 | 0.08 |
Distribution | \(\mu\) | \(\sigma\) | \(\rho\) |
---|---|---|---|
Binomial \(n, p\) | \(np\) | \(\sqrt{npq}\) | \((1-2p)npq\) |
Poisson \(\lambda\) | \(\lambda\) | \(\lambda\) | \(\lambda\) |
Exponential \(\lambda\) | \(1/\lambda\) | \(1/\lambda^2\) | \(2/\lambda^3\) |
Normal \(\mu, \sigma^2\) | \(\mu\) | \(\sigma^2\) | \(0\) |
Uniform \(a,b\) | \((a+b)/2\) | \((b-a)^2/12\) | \(0\) |
Student's t \(\nu\) | \(0\) | \(\nu/(\nu-2)\) | \(0\) |