Lecture 4

7 February, 2018

Smoothing

The book uses smoothing from chapter 1.10 a lot.

Here are some ways to work with it. \(A, C\) events, \(B_j\) a family of disjoint events.

\[ \mathbb{P}(A) = \sum_j \mathbb{P}(A | B_j) \\ \mathbb{P}(A|C) = \sum_j\mathbb{P}(A|B_j,C)\mathbb{P}(B_j|C) \]

\(X, Y\) random variables. \(\mathbb{E}_Xf(X,Y)=\int f(x,Y) d\mathbb{P}_X(x)\) \[ \mathbb{E}X = \mathbb{E}_Y\left[\mathbb{E}_X(X|Y)\right] \qquad \mathbb{E}X = \sum_j \mathbb{E}(X|B_j) \mathbb{P}(B_j) \]

Recall

An estimator \(T(X)\) is a statistic that claims to predict some \(g(\theta)\) for a parameter \(\theta\) from a sample \(X\).

The loss of an estimator is a measure of error as a function of \(\theta\); the risk is the expectation of the loss. Estimators have a variance/bias tradeoff: the squared error risk decomposes as \(\mathbb{V}(T(X)) + \text{Bias}(T(X),\theta)^2\).

The likelihood of a sample \(x\) given a parameter \(\theta\) is the conditional likelihood \(\mathbb{P}(x|\theta)\) seen as a function of \(\theta\).

A statistic is sufficient if it can losslessly replace the data: \[ \mathbb{P}(x|T(x),\theta) = \mathbb{P}(x|T(x)) \quad \mathbb{P}(\theta|T(x),x) = \mathbb{P}(\theta|T(x)) \\ \mathbb{P}(\theta,x|T(x)) = \mathbb{P}(\theta|T(x))\mathbb{P}(x|T(x)) \quad \mathbb{P}(x|\theta) = g(T(x),\theta)h(x) \]

Minimal sufficiency

The data itself is always a sufficient statistic. But not helpful.

We call a statistic \(T(X)\) minimally sufficient if it is sufficient and every other sufficient statistic can calculate \(T(X)\) (almost always).

Definition \(T(X)\) is a minimally sufficient statistic for \(\theta\) if it is sufficient and for any other sufficient statistic \(T'(X)\), there is some function \(f\) such that \(T(X)=f(T'(X))\) (for almost every \(X\))

This means that any other statistic \(T'(X)\) is (potentially) redundant: we could reduce anything we do with \(T'(X)\) to something we do with \(T(X)\).

Example

The Pareto distribution \(\text{Pareto}(\alpha,\beta)\) has density function that vanishes if \(y<\beta\) and for \(y\geq\beta\), \[ p(y|\theta) = \alpha\beta^\alpha y^{-\alpha-1} \]

Claim Let \(Y_1,\dots,Y_n\sim\text{Pareto}(\alpha,\beta)\). If \(\beta\) is known, \(\prod_{i=1}^n Y_i\) is a sufficient statistic for \(\beta\).

The joint density is \[ \prod p(y_i|\beta)=\alpha^n\beta^{n\alpha}\left(\prod y_i\right)^{-\alpha-1} \]

We can take \(h(y)=1\), this expression only depends on the \(y_i\) through the statistic.

Example

The Pareto distribution is exponential, with

\[ p(x|\alpha,\beta) = \alpha\beta^\alpha y^{-\alpha-1} = \exp[\color{blue}{\log\alpha + \alpha\log\beta} - \color{green}{\alpha}\color{orange}{\log y} - \log y] = \\ \exp[\color{blue}{A(\alpha,\beta)} + \color{green}{\eta(\alpha,\beta)}\color{orange}{T(y)}\color{black} - B(y)] \]

For any exponential distribution, we can find a sufficient statistic: \[ p(x_1,\dots,x_n|\theta) = \prod\exp[\color{blue}{A(\theta)} + \color{green}{\eta(\theta)}\color{orange}{T(x)}\color{black} - B(x)] = \\ \exp\left[\sum(\color{blue}{A(\theta)} + \color{green}{\eta(\theta)}\color{orange}{T(x)}\color{black} - B(x))\right] = \\ \exp\left[\color{blue}{\sum A(\theta)} + \color{green}{\eta(\theta)}\color{orange}{\sum T(x)}\right]\exp[nB(x)] \]

So \(\sum T(x)\) is a sufficient statistic. If \(T(x)\) is linear, \(T\left(\sum x\right)\) is sufficient.

Example

The Pareto distribution is exponential, with \[ p(x|\alpha,\beta) = \alpha\beta^\alpha x^{-\alpha-1} = \exp[\color{blue}{\log\alpha + \alpha\log\beta} - \color{green}{\alpha}\color{orange}{\log y} - \log y] \]

Both of these are sufficient statistics: \[ \prod x_i \qquad\qquad \sum \log x_i \]

Notice that either is the function image of the other. If \(T\) is a sufficient statistic and \(T=f(T')\) then \(T'\) must also be a sufficient statistic.

Recognizing minimal sufficiency

Theorem Suppose \(\mathcal{P}=\{p(x|\theta)\}\) is a family of distributions with \(p(x|\theta)=g(T(x),\theta)h(x)\), with \(T\) a sufficient statistic for \(\theta\).

If \(p(x|\theta)\propto p(y|\theta)\) implies \(T(x)=T(y)\) then \(T\) is minimally sufficient.

In other words, if the shape of the likelihood curve is retained (but possibly rescaled) between different random variables with the same sufficient statistic, then the statistic is minimally sufficient.

The proportionality constant is allowed to vary with \(x\) and \(y\), but not with \(\theta\).

Recognizing minimal sufficiency

If \(p(x|\theta)\propto p(y|\theta)\) implies \(T(x)=T(y)\) then \(T\) is min. suff.

Proof Suppose \(T'\) is also sufficient, with \(p(x|\theta)=g'(T'(x),\theta)h'(x)\) for all \(x\). If \(T(x)\neq f(T'(x))\) for any function \(f\), then there must be two data sets \(x,y\) such that \(T(x)\neq T(y)\) but \(T'(x)=T'(y)\).

But then, since \(g'(T'(x)|\theta)\) does not depend on the dataset, \[ p(x|\theta) = g'(T'(x),\theta)h'(x) = g'(T'(y),\theta)h'(x) = \\ \frac{h'(x)}{h'(y)}g'(T'(y),\theta)h'(y) \propto \\ g'(T'(y),\theta)h'(y) = p(y|\theta) \]

Recognizing minimal sufficiency

If \(p(x|\theta)\propto p(y|\theta)\) implies \(T(x)=T(y)\) then \(T\) is min. suff.

Proof … \(T(x)\neq T(y)\), \(T'(x)=T'(y)\)

Using \(T'(x)=T'(y)\) we could show \(p(x|\theta)\propto p(y|\theta)\). But this implies, by assumption, that \(T(x)=T(y)\).

Contradiction finishes the proof: \(T\) has to be \(f(T')\) for some \(f\).

Example

Consider \(X_1,\dots,X_n; Y_1,\dots,Y_m\sim\text{Bernoulli}(p)\) be iid samples.

\[ p(X|p) = p^{\sum X_i}(1-p)^{n-\sum X_i} \qquad p(Y|p) = p^{\sum Y_j}(1-p)^{m-\sum Y_j} \]

If for a statistic \(T(x)\), \(p(x|p)\propto p(y|p)\) implies \(T(x)=T(y)\), then \(T\) is minimal. Here, \(\sum X_i\) is a candidate for a minimal sufficient statistic.

\(p(x|p)\propto p(y|p)\) implies \(\frac{p(x|p)}{p(y|p)}\) is a constant that depends on \(x\) and \(y\) but not on \(p\).

\[ \frac{p(x|p)}{p(y|p)} = \frac{p^{\sum X_i}(1-p)^{n-\sum X_i}}{p^{\sum Y_j}(1-p)^{m-\sum Y_j}} = \frac{p^{\sum X_i - \sum Y_j}}{(1-p)^{m-n+\sum X_i-\sum Y_j}} \]

Example

\(X_1,\dots,X_n, Y_1,\dots,Y_m\sim\text{Bernoulli}(p)\). \(T\) is minimally sufficient if \(p(x|p)\propto p(y|p)\) implies \(T(x)=T(y)\).

\[ \frac{p(x|p)}{p(y|p)} = \frac{p^{\sum X_i - \sum Y_j}}{(1-p)^{m-n+\sum X_i-\sum Y_j}} \]

The numerator only ceases to depend on \(p\) if \(\sum X_i=\sum Y_j\). The denominator also requires \(m=n\) to get free of \(p\).

Thus \(p(x|p)\propto p(y|p)\) precisely when \(m=n\) and \(\sum X_i=\sum Y_j\).

Hence \(\sum X_i\) is minimally sufficient for equally sized Bernoulli samples.

Lehmann-Scheffé

The theorem and example suggest a generic method for deriving minimally sufficient statistics.

If for a likelihood ratio \[ \frac{\mathcal{L}(\theta|x)}{\mathcal{L}(\theta|y)} \] we can find a function \(g(x)\) such that the ratio is independent of \(\theta\) if and only if \(g(x)=g(y)\), then \(g(x)\) is a minimally sufficient statistic.

Example

The exponential distribution measures time-to-failure if the failure rate is constant. If failure changes over time – increases or decreases – exponentially, then the exponential distribution changes into the Weibull distribution.

This distribution has

Parameters \(k\) (shape) and \(\lambda\) (scale)
Density \(\frac{k}{\lambda}\left(\frac{x}{\lambda}\right)^{k-1}\exp\left[-\left(\frac{x}{\lambda}\right)^k\right]\) for non-negative \(x\), and 0 for negative \(x\).
Mean and Variance given in terms of the Gamma-function.

Example

The distribution \(\text{Weibull}(2,\theta)\) has likelihood \(\mathcal{L}(\theta|x) = 2x\exp[-x^2/\theta]/\theta\). The joint likelihood for \(X_1,\dots,X_n\sim\text{Weibull}(2,\theta)\) is \[ \mathcal{L}(\theta|x_1,\dots,x_n) = 2^n\theta^{-n}\prod x_i\cdot\exp\left[-\sum x_i^2/\theta\right] \] The likelihood ratio for two samples is \[ \frac{\mathcal{L}(\theta|x)}{\mathcal{L}(\theta|y)} = \frac{2^n\theta^{-n}\prod x_i\cdot\exp\left[-\sum x_i^2/\theta\right]} {2^n\theta^{-n}\prod y_i\cdot\exp\left[-\sum y_i^2/\theta\right]} = \frac{\prod x_i\cdot\exp\left[-\sum x_i^2/\theta\right]} {\prod y_i\cdot\exp\left[-\sum y_i^2/\theta\right]} \]

Since the only dependency on \(\theta\) is in the exponential, the ratio is independent of \(\theta\) if \(\sum x_i^2 = \sum y_i^2\). Hence \(\sum x_i^2\) is a minimally sufficient statistic for \(\theta\).

Lehmann-Scheffé

Claim If \(\mathcal{L}(\theta|x)/\mathcal{L}(\theta|y)\) is independent of \(\theta\) if and only if \(\tau(x)=\tau(y)\) for some \(\tau\), then \(\tau\) is minimally sufficient.

Proof \(\tau\) is sufficient.

For each fibre \(\tau^{-1}(\gamma)\) across all values of \(\gamma\), pick a single representative \(x_\gamma\). Then \(x_{\tau(x)}\) is the representative assigned to \(x\). So \(p(x|\theta)/p(x_{\tau(x)}|\theta)\) is independent of \(\theta\). Define \(h(x)=p(x|\theta)/p(x_{\tau(x)}|\theta)\).

Now, \(p(x|\theta)=\color{green}{p(x_{\tau(x)}|\theta)}\color{blue}{\frac{p(x|\theta)}{p(x_{\tau(x)}|\theta)}} = \color{green}{p(x_{\tau(x)}|\theta)}\color{blue}{h(x)}\) is a factorization that proves sufficiency.

Lehmann-Scheffé

Claim If \(\mathcal{L}(\theta|x)/\mathcal{L}(\theta|y)\) is independent of \(\theta\) if and only if \(\tau(x)=\tau(y)\) for some \(\tau\), then \(\tau\) is minimally sufficient.

Proof \(\tau\) is minimally sufficient.

Let \(T\) be any other sufficient statistic. We have factorizations \(g(\tau(x),\theta)h(x)\) and \(g'(T(x)|\theta)h'(x)\) of \(p(x|\theta)\). Let \(x,y\) be samples with \(T(x)=T(y)\). \[ \frac{p(x|\theta)}{p(y|\theta)} = \frac{g'(T(x)|\theta)h'(x)}{g'(T(y)|\theta)h'(y)} = \frac{h'(x)}{h'(y)} \] This ratio doesn't depend on \(\theta\) which implies \(\tau(x)=\tau(y)\).

Lehmann-Scheffé

Claim If \(\mathcal{L}(\theta|x)/\mathcal{L}(\theta|y)\) is independent of \(\theta\) if and only if \(\tau(x)=\tau(y)\) for some \(\tau\), then \(\tau\) is minimally sufficient.

Proof \(\tau\) is minimally sufficient.

Since \(T(x)=T(y)\) implies \(\tau(x)=\tau(y)\), we can set \(f(T(x)) = \tau(x)\). This is a function because if we were to pick a different preimage \(y\) of \(T(x)\), then \(\tau(y)=\tau(x)\), so the function is well defined.

Ancillary statistics

The polar opposite of a sufficient statistic is an ancillary statistic. \(T\) is ancillary for \(\theta\) if \(\mathbb{P}(T|\theta)\) is independent of \(\theta\).

Example: Consider \(X_1,\dots,X_n\sim\text{Bernoulli}(p)\). Suppose that \(n\) is chosen at random – for instance, we keep repeatedly running trials until a coin toss comes up head. Since \(n\) is generated independently of the value of \(p\), \(n\) is an ancillary statistic for \(p\).

A natural estimator for \(p\) is \(\hat p =\frac{\sum X_i}{n}\) – the proportion of successes. \(\hat p\) is not a sufficient statistic.

\[ \mathbb{P}(x|\hat p,p) = p^{n\hat p}(1-p)^{1-n\hat p} \] cannot be factored in a way that separates information depending on \(x\) (ie \(n\)) from information depending only on \(p\) and \(\hat p\).

Ancillary statistics

\(X_1,\dots,X_n\sim\text{Bernoulli}(p)\), \(n\) chosen at random. \(\mathbb{P}(x|\hat p,p) = p^{n\hat p}(1-p)^{1-n\hat p}\) shows us that \(\hat p\) is not sufficient for \(p\).

The pair \(n, \hat p\) however is sufficient.

If \(T\) is an insufficient statistic, and \(U\) is an ancillary statistic such that the pair \(T, U\) becomes a sufficient statistic, \(U\) is called an ancillary complement to \(T\).