Lecture 3

5 February, 2018

Estimators

A population is the space of all possible items under study. We will usually encounter the population through the population distribution or true distribution of some random vector of interest.

A parameter is some quantity of interest defined from the population.

A model is a map from a parameter $\theta$ to a probability distribution $P_\theta$. Often we use this interchangeably with the set of all possible distributions in some (restricted) collection of distributions. Once a model space is determined, we also determine a set $\Omega$ of admissible parameter values.

A statistic is a function of a set of data points: for any sample $X=\{X_1,\dots,X_N\}$, a statistic $\delta(X)$ is a random vector dependent on $X$.

Example

Ezekiel (1930) includes data collected on stopping distance and speed for cars.

Population? Sample? Useful model?

Example

Ezekiel (1930) includes data collected on stopping distance and speed for cars.

One candidate for a model would be linear with normal noise: $\text{dist} ~ \mathcal{N}(\beta_0 + \beta_1\cdot\text{speed}, \sigma^2)$.

Parameters?

Example

We flip a coin 100 times and count heads.

Population?
Sample?
Useful model?
Parameter?
Interesting statistic?

Fundamental task in Statistics

For a population parameter $\theta$, find a statistic $\hat\theta$ such that $\hat\theta$ helps determine $\theta$.

Example: $\overline X = \frac{\sum X_i}{N}$ is a statistic, as are the sample variance and sample correlation.

Estimator

To retain generality, we consider functions of the population parameter when defining what makes a statistic useful.

An estimator $\delta(X)$ of a quantity $g(\theta)$ depending on a population parameter $\theta$ is a statistic such that $\delta(X)$ is close to $g(\theta)$.

$g(\theta) = \theta$ or $g(\theta)=\theta_k$ are common choices for $g$. For instance, the linear regression example has $\theta = (\beta_0, \beta_1, \sigma^2)^T$; many linear regression methods focus on estimating $(\beta_0,\beta_1)^T$.

Notice that we allow for arbitrarily bad estimators right now; a constant function $\delta(X)=1$ is an estimator of any one-dimensional parameter.

Example

Coin toss with uneven distribution – for instance, stand on edge, head facing you, balanced with a finger on top. Flick it to send it spinning, and let it spin and fall.

Possibly has $p\neq0.5$.

A binomial model suggests itself: let $X$ be the number of heads produced. Then $X\sim\text{Binomial}(100,p)$.

We set our model space to be $P_p=\text{Binomial}(100,p)$ with our parameter space $p\in\Omega=[0,1]$.

Repeat 100 times produces a sample. A natural estimator candidate is $\delta(X) = X/100$.

Loss functions

To evaluate the quality of $\delta(X)$ as an estimator of $p$, we need to establish a way of measuring how close $\delta(X)$ is to $p$.

The loss function $L(\theta, d)$ measures the loss when estimating $g(\theta)$ with the value $d$.

We may assume $L(\theta, g(\theta))=0$, ie correct answer has no loss. For all other answers, $L(\theta, d)\geq 0$: we can never have negative loss.

Common loss functions:

squared error loss $L(\theta, d) = (\theta-d)^2$
absolute error loss $L(\theta, d) = |\theta-d|$

Risk function

Since $X$ is random, $L(\theta, \delta(X))$ is random. We could get huge loss even for a good estimator.

Instead, we can evaluate an estimator by its risk function $R(\theta, \delta(X)) = \mathbb{E}\left( L(\theta,\delta(X)) \middle| \theta \right)$

Example

Our coin toss has $X\sim\text{Binomial}(100,p)$, $\delta(X)=X/100$, $g(p)=p$, $L(p, d)=(p-d)^2$.

From the Binomial distribution, we know $\mathbb{E}X=100p$ and $\mathbb{V}X=\mathbb{E}(X^2)-(\mathbb{E}X)^2=100p(1-p)$. Hence, $\mathbb{E}\delta(X)=p$ and $\mathbb{V}\delta(X)=\frac{p(1-p)}{100}$.

The risk function is \[ R(p, \delta(X)) = \mathbb{E}(p-\delta(X))^2 = \mathbb{E}(\delta(X) - p)^2 = \\ \mathbb{E}(\delta(X) - \mathbb{E}\delta(X))^2 = \mathbb{V}\delta(X) = \frac{p(1-p)}{100} \]

Example

Another candidate is $\delta_1(X)=\frac{X+3}{100} = \delta(X)+\frac{3}{100}$.

\[ R(p, \delta_1(X)) = \mathbb{E}(\delta_1(X) - p)^2 = \mathbb{E}\left[\left(\delta(X)+\frac{3}{100}-p\right)^2\right] = \\ \mathbb{E}\left[\left(\delta(X)-p\right)^2 + \left(\frac{3}{100}\right)^2 + 2\delta(X)\frac{3}{100} - 2p\frac{3}{100}\right] = \\ \mathbb{V}\delta(X) + \frac{9}{100^2} + 2p\frac{3}{100} - 2p\frac{3}{100} = \frac{100p(1-p) + 9}{100^2} \]

Example

Another candidate is $\delta_2(X)=\frac{X+3}{106} = \frac{100}{106}\cdot(\delta(X)+3/100)$.

The book states

\[ R(p,\delta_2(X)) = \frac{(9-8p)(1+8p)}{106^2} \]

Example

The three risk functions as functions of $p$. Which estimator is better?

Bias and Variance

In addition to low risk, two common measures of estimator quality is the bias and the variance of the estimator. Let $\hat\theta$ be an estimator of $\theta$ and consider the expected square error. Here, $\theta$ is treated as a constant.

\[ \mathbb{E}(\hat\theta-\theta)^2 = \mathbb{E}\hat\theta^2 - 2\theta\mathbb{E}\hat\theta + \theta^2 = \\ \mathbb{E}\hat\theta^2 - (\mathbb{E}\hat\theta)^2 + (\mathbb{E}\hat\theta)^2 - 2\theta\mathbb{E}\hat\theta + \theta^2 = \\ \mathbb{V}\hat\theta + (\mathbb{E}\hat\theta-\theta)^2 \]

We define the bias of $\hat\theta$ with respect to $\theta$ to be $\text{Bias}(\hat\theta,\theta) = (\mathbb{E}\hat\theta-\theta)$.

Bias and Variance

We found the bias by deriving that the mean square error decomposes as

\[ \mathbb{E}(\theta-\hat\theta)^2 = \mathbb{V}\hat\theta + \text{Bias}(\hat\theta,\theta)^2 \]

This is a fundamental result and lies at the core of many tradeoffs when picking an estimator. We will return to this several times.

Example

\[ \text{Bias}(\delta(X),p) = p-p = 0 \\ \text{Bias}(\delta_1(X),p) = p+\frac{3}{100}-p = \frac{3}{100} \\ \text{Bias}(\delta_2(X),p) = \frac{100p+3}{106} - p = \frac{100p + 3 - 106p}{106} = \frac{3-6p}{106} \]

Useful estimators

Sample(s) $X_1,\dots,X_n$ and $Y_1,\dots,Y_m$, both i.i.d.

Parameter	Estimator	Expectation	Variance
$\mu$	$\overline{X}=\frac{1}{n}\sum_j X_j$	$\mu$	$\sigma^2/n$
$p$	$\hat{p}=X/n$	$p$	$pq/n$
$\mu_X-\mu_Y$	$\overline{X}-\overline{Y}$	$\mu_X-\mu_Y$	$\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{m}$
$p_X-p_Y$	$\hat{p}_X-\hat{p}_Y$	$p_X-p_Y$	$\frac{p_Xq_X}{n}+\frac{p_Yq_Y}{m}$
$\sigma^2$	$\overline{(X-\mu)^2}$	$\frac{n-1}{n}\sigma^2$
$\sigma^2$	$S^2 = \frac{\sum(X_j-\mu)^2}{n-1}$	$\sigma^2$	$\frac{(n-1)^2\kappa}{n^3}-\frac{(n-1)(n-3)\sigma^2}{n^3}$

where $\kappa=\mathbb{E}(x-\mu)^4$ is the central kurtosis.

Sufficiency

$X_1,\dots,X_n \sim \text{Bernoulli}(p)$. One statistic we could consider is $Y=\sum X_i$, the number of successes.

With $Y$ in hand, can we infer $p$ as easily as if we had all the $X_i$s?

$X_1,\dots,X_n \sim \text{Bernoulli}(p)$. One statistic we could consider is $Y=\sum X_i$, the number of successes.

With $Y$ in hand, can we infer $p$ as easily as if we had all the $X_i$s?

Consider the conditional probability:

\[ \mathbb{P}(x | y) = \frac{\mathbb{P}(x,y)}{\mathbb{P}(y)} \]

Here, if $\sum x_i \neq y$, then the numerator is 0; otherwise it is $p^y(1-p)^{n-y}$. The denominator is the probability of a binomial variable, with probability ${n\choose y}p^y(1-p)^{n-y}$. Their quotient is

\[ \mathbb{P}(x | y) = \frac{\mathbb{P}(x,y)}{\mathbb{P}(y)} = \frac{p^y(1-p)^{n-y}}{{n\choose y}p^y(1-p)^{n-y}} = \frac{1}{n\choose y} \]

$X_1,\dots,X_n \sim \text{Bernoulli}(p)$. $Y=\sum X_i$, the number of successes.

\[ \mathbb{P}(x|y) = \frac{1}{n\choose y} \]

Notice this conditional probability is independent of $p$. Anything we can say about $\mathbb{P}(x)$ once $\sum x_i$ is known can be said using only $\sum x_i$.

Sufficient Statistic

We call a statistic sufficient if we no longer need the data to understand the parameter we are estimating once we have access to the statistic.

Formally, we could define $t = T(X)$ to be a sufficient statistic for a parameter $\theta$ if $\mathbb{P}(\theta | x, t)$ does not depend on $x$.

The book provides an equivalent definition: that $t$ is sufficient if $\mathbb{P}(x | t, \theta)$ does not depend on $\theta$.

Sufficient Statistic

Too see the equivalence, note that by the definition of conditional probability \[ \mathbb{P}(x | t, \theta)\mathbb{P}(\theta|t) = \mathbb{P}(\theta, x | t) = \mathbb{P}(\theta | t, x) \mathbb{P}(x|t) \]

If $t$ is sufficient, then both outer sides reduce to $\mathbb{P}(\theta|t)\mathbb{P}(x|t)$.

If one of the conditions holds, say $\mathbb{P}(x | t, \theta)=\mathbb{P}(x|t)$, then we could cancel $\mathbb{P}(x|t)$ from both sides to get $\mathbb{P}(\theta|t) = \mathbb{P}(\theta|t,x)$.

Likelihood

One tool we will use a lot is the likelihood function. A function of a parameter $\theta$, the likelihood $\mathcal{L}(\theta|x)$ measures the probability of having seen a specific sample $x$ if $\theta$ was the true value of the parameter. For the continuous case, we use the density.

\[ \mathcal{L}(\theta|x) = \mathbb{P}(x|\theta) \qquad \mathcal{L}(\theta|x) = p(x|\theta) \]

Theorem $t=T(x)$ is a sufficient statistic for a parameter $\theta$ if the likelihood factors into two nonnegative functions \[ \mathcal{L}(\theta|x) = g(t,\theta)\cdot h(x) \] where $g$ is a function only of $t$ and $\theta$ and $h$ is a function only of the sample $x$.

Factorization theorem

Proof Sufficiency implies factorization

Since $T$ is a function of $x$, the joint probability is
$\mathbb{P}(x,T(x)|\theta) = \mathbb{P}(x|\theta)$.

By the definition of conditional probability,
$\mathbb{P}(x,T(x)|\theta) = \mathbb{P}(x|T(x),\theta)\mathbb{P}(T(x)|\theta)$

If $T$ is sufficient, $\mathbb{P}(x|T(x),\theta)$ does not depend on $\theta$, so it is some function $h(x)$. The probability $\mathbb{P}(T(x)|\theta)$ depends only on $T(x)$ and $\theta$, and corresponds to $g(T(x),\theta)$.

Factorization theorem

Proof Factorization implies sufficiency

Factorization tells us
$\mathbb{P}(x|\theta) = g(T(x),\theta)\cdot h(x)$.

Consider the conditional probability \[ \mathbb{P}(x|T(x),\theta) = \frac{\mathbb{P}(x,T(x)|\theta)}{\mathbb{P}(T(x)|\theta)} = \frac{\mathbb{P}(x|\theta)}{\mathbb{P}(T(x)|\theta)} \]

This fraction is 0 if $t\neq T(x)$, and else \[ \frac{\mathbb{P}(x|\theta)}{\mathbb{P}(T(x)|\theta)} = \frac{g(T(x),\theta)\cdot h(x)}{\mathbb{P}(T(x)|\theta)} \]

Factorization theorem

Proof Factorization implies sufficiency…

By summing disjoint outcomes \[ \mathbb{P}(T(x)|\theta) = \sum_{x':T(x')=T(x)} \mathbb{P}(x'|\theta) = \sum_{x'} h(x')g(T(x),\theta))$ \]
Thus \[ \mathbb{P}(x|T(x),\theta) = \frac{g(T(x),\theta)\cdot h(x)}{\mathbb{P}(T(x)|\theta)} = \frac{g(T(x),\theta)\cdot h(x)}{g(T(x),\theta)\sum_{x'} h(x')} = \\ \frac{h(x)}{\sum_{x'}h(x')} \qquad \text{This does not depend on $\theta$.} \]

Example

$X_1,\dots,X_n \sim \text{Bernoulli}(p)$. $Y=\sum X_i$, the number of successes.

\[ \mathbb{P}(x|\theta) = p^{y}(1-p)^{n-y} \]

Pick $g(y,p)=p^{y}(1-p)^{n-y}$ and $h(x)=1$.

Example

$X_1,\dots,X_n\sim\text{Uniform}(a,b)$. The density function $p(x)$ is 0 if any $X_i\not\in(a,b)$ and is $\frac{1}{b-a}$ if all $X_i\in(a,b)$.

\[ \mathbb{P}(x|a,b) = \prod_j \frac{\mathbb{1}_{(a,b)}(x_j)}{b-a} = \frac{1}{(b-a)^n}\mathbb{1}_{(a,b)}(\min x_i)\mathbb{1}_{(a,b)}(\max x_i)) \]

Pick $g((\min x_i,\max x_i)^T, (a,b)^T)$ to be this; $h(x)=1$.

Example

$X_1\dots X_n\sim\text{Normal}(\mu,\sigma^2)$. Suppose $\sigma^2$ is known.

\[ p(x|\mu) = \prod_j \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left[-\frac{(x_j-\mu)^2}{2\sigma^2}\right] = \\ \frac{1}{\sqrt{2\pi\sigma^2}^n}\exp[-n/2\sigma^2] \exp\left[-\sum_j(x_j^2-2x_j\mu+\mu^2)\right] \propto \\ \color{blue}{\exp\left[-\sum_j x_j^2\right]} \color{green}{\exp\left[\mu^2-2\mu\sum_j x_j)\right]}\color{black} = \color{blue}{h(x)}\color{green}{g\left(\sum_j x_j, \mu\right)} \]

If $\sigma^2$ had been unknown, notice that the proportionality constant does not depend on $x$.

Parameter	Estimator	Expectation	Variance
\(\mu\)	\(\overline{X}=\frac{1}{n}\sum_j X_j\)	\(\mu\)	\(\sigma^2/n\)
\(p\)	\(\hat{p}=X/n\)	\(p\)	\(pq/n\)
\(\mu_X-\mu_Y\)	\(\overline{X}-\overline{Y}\)	\(\mu_X-\mu_Y\)	\(\frac{\sigma_X^2}{n}+\frac{\sigma_Y^2}{m}\)
\(p_X-p_Y\)	\(\hat{p}_X-\hat{p}_Y\)	\(p_X-p_Y\)	\(\frac{p_Xq_X}{n}+\frac{p_Yq_Y}{m}\)
\(\sigma^2\)	\(\overline{(X-\mu)^2}\)	\(\frac{n-1}{n}\sigma^2\)
\(\sigma^2\)	\(S^2 = \frac{\sum(X_j-\mu)^2}{n-1}\)	\(\sigma^2\)	\(\frac{(n-1)^2\kappa}{n^3}-\frac{(n-1)(n-3)\sigma^2}{n^3}\)