Confidence Intervals: Small Sample

Mikael Vejdemo-Johansson

One-sided Confidence Bounds

Our confidence regions can be chosen to be single-tailed:

Code

df = tibble(
  xs = seq(-5,5,length.out=1001),
  dnorm = dnorm(xs),
  interval = ifelse(xs>qnorm(0.975), 0, ifelse(xs<qnorm(0.025), 0, dnorm)),
  upperbound = ifelse(xs<qnorm(0.95), dnorm, 0),
  lowerbound = ifelse(xs>qnorm(0.05), dnorm, 0)
) %>% pivot_longer(c(interval, upperbound, lowerbound))
ggplot(df, aes(x=xs)) +
  geom_line(aes(y=dnorm)) +
  geom_area(aes(y=value)) +
  geom_text(data=tibble(
    name=c("lowerbound","interval","upperbound")),
    label="95%",
    x=qnorm(0.5),
    y=0.125,
    aes(x=x,y=y,label=label), color="white") +
  coord_cartesian(xlim=c(-3,3)) +
  facet_grid(name~.)

We then talk about upper or lower confidence bounds. These are computed using \(z_\alpha = CDF^{-1}(1-\alpha)\) instead of \(z_{\alpha/2}\), and from a pivot expression involving just one inequality.

Small Samples: Student’s T Distribution

Normal Random Sample Distributions (Ch. 6.4)

Chapter 6.4 introduces several important distributions relating directly to the behavior of normal random samples. Suppose here that \(Z, Z_1,\dots,Z_n\sim\mathcal{N}(0,1)\) iid.

\(Z^2 \sim \chi^2(1)\)
If \(X_1\sim\chi^2(\nu_1)\) and \(X_2\sim\chi^2(\nu_2)\) then \(X_1+X_2\sim\chi^2(\nu_1+\nu_2)\).
So \(Z_1^2+\dots+Z_n^2\sim\chi^2(n)\)

If \(X\sim\mathcal{N}(\mu,\sigma^2)\), then the MLE \(\hat\sigma^2=\frac{1}{n}\sum(X_i-\mu)^2\) is connected to a chi-squared random variable: \((X-\mu)/\sigma \sim \mathcal{N}(0,1)\), and so we get a sum of squares of random variables: \[ \frac{n\hat\sigma^2}{\sigma^2} = \sum\left(\frac{X_i-\mu}{\sigma}\right)^2 \sim\chi^2(n) \]

Normal Random Sample Distributions (Ch. 6.4)

Chapter 6.4 introduces several important distributions relating directly to the behavior of normal random samples. Suppose here that \(Z, Z_1,\dots,Z_n\sim\mathcal{N}(0,1)\) iid.

\(Z^2 \sim \chi^2(1)\)
If \(X_1\sim\chi^2(\nu_1)\) and \(X_2\sim\chi^2(\nu_2)\) then \(X_1+X_2\sim\chi^2(\nu_1+\nu_2)\).
So \(Z_1^2+\dots+Z_n^2\sim\chi^2(n)\)
It is even true that if \(X_1\sim\chi^2(\nu_1)\) and \(X_3 = X_1+X_2\) with \(X_3\sim\chi^2(\nu_3)\) with \(\nu_r>\nu_1\) and \(X_1,X_2\) independent, then \(X_2\sim\chi^2(\nu_3-\nu_1)\)

We can also consider \[ \sum(X_i-\mu)^2 = \sum[(X_i-\overline{X})+(\overline{X}-\mu)]^2 =\\ = \sum(X_i-\overline{X})^2 + \color{DarkMagenta}{2(\overline{X}-\mu)\sum(X_i-\overline{X})} + \sum(\overline{X}-\mu)^2 \] The middle term vanishes because \(\sum(X_i-\overline{X}) = \sum X_i-n\overline{X} = n\overline{X}-n\overline{X}=0\). Divide through by \(\sigma^2\): \[ \sum\left(\frac{X_i-\mu}{\sigma}\right)^2 = \sum\left(\frac{X_i-\overline{X}}{\sigma}\right)^2 + n\left(\frac{\overline{X}-\mu}{\sigma}\right)^2 = \frac{(n-1)S^2}{\sigma^2} + \left(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\right)^2 \]

Since \(\overline{X}\sim\mathcal{N}(\mu,\sigma^2/n)\), with \(X_3=\sum\left(\frac{X_i-\mu}{\sigma}\right)^2\sim\chi^2(n)\) and \(X_1=\left(\frac{\overline{X}-\mu}{\sigma/\sqrt{n}}\right)^2\sim\chi^2(1)\) we get that \(X_2=(n-1)S^2/\sigma^2\sim\chi^2(n-1)\).

Normal Random Sample Distributions (Ch. 6.4)

In the previous two slides we have tacitly used a result telling us that \(\overline{X}\) is independent of \(S^2\), which holds for normal distributions but not necessarily non-normal distributions.

Suppose \(Z\sim\mathcal{N}(0,1)\) and \(X\sim\chi^2(\nu)\) are independent random variables. Student’s T-distribution with \(\nu\) degrees of freedom is defined as the distribution of the ratio

\[ T = \frac{Z}{\sqrt{X/\nu}} \]

It turns out that this distribution has PDF given by (Helmert (1875-1876), Lüroth (1876), Pearson (1895), Student (1908)) \[ PDF(t) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\sqrt{\nu\pi}\Gamma\left(\frac{\nu}{2}\right)}\left(1+\frac{t^2}{\nu}\right)^{-(\nu+1)/2} \]

The resulting curve looks like a heavy-tailed normal distribution, and can be accessed in R using dt, pt, qt, rt and in Python using scipy.stats.t.

Sidebar: The Student in Student’s T

William Sealy Gosset worked at the Guinness brewery and was studying the barley sources used for beer production. He collaborated with Ronald Fisher to develop a method that could work for potentially very small sample sizes, and published his method.

At the time, the brewery had had relatively recent problems with leaks of industrial secrets through scientific publication and had a blanket ban on publishing for their staff. Gosset convinced the company leadership that these results contain no industrial secrets, but to prevent envy among the other staff engineers, they required him to publish pseudonymously.

So in the journal Biometrika in 1908 appeared the article The Probable Error of a Mean, written by “Student” developing the T-distribution and a corresponding set of statistical tests.

Student’s T and CIs of normal variables

Theorem

If \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\) iid, then \[ T = \frac{\overline{X}-\mu}{S/\sqrt{n}} \sim T(n-1) \]

Proof

First, we will rewrite \(T\):

\[ T = \frac{\overline{X}-\mu}{\sqrt{S^2/n}} {\huge/} \frac{\sigma/\sqrt{n}}{\sqrt{\sigma^2/n}} = \frac{(\overline{X}-\mu)/(\sigma/\sqrt{n})}{\sqrt{\frac{(n-1)S^2}{n-1}\cdot\frac{1}{\color{DarkRed}{n}}\cdot\frac{\color{DarkRed}{n}}{\sigma^2}}} = \frac{(\overline{X}-\mu)/(\sigma/\sqrt{n})}{\sqrt{\frac{(n-1)S^2}{\sigma^2}{\huge/}(n-1)}} \]

Now, recognize that:

since \(\overline{X}\sim\mathcal{N}(\mu,\sigma^2/n)\), the numerator is a standard normal random variable.
as previously derived, \((n-1)S^2/\sigma^2\sim\chi^2(n-1)\)

So with \(Z=(\overline{X}-\mu)/(\sigma/\sqrt{n})\), \(X=(n-1)S^2/\sigma^2\) and \(\nu=n-1\), we have fully matched the ratio form defining a T-distribution variable.

Confidence Intervals for normal distributions

We have concluded that \(T = (\overline{X}-\mu)/(S/\sqrt{n})\sim T(n-1)\). Notice that even without knowing \(\sigma^2\), this forms a pivot quantity for \(\mu\).

Just like the standard normal, each T-distribution is symmetric around 0, unimodal¹. The limit as the DoF\(\to\infty\) is the normal distribution.

Write \(t_{\alpha,\nu}=CDF^{-1}_{T(\nu)}(1-\alpha)\) for the critical value with an upper tail of constant probability mass \(\alpha\).

We get, for \(X_1,\dots,X_n\sim\mathcal{N}(\mu,\sigma^2)\): \[ \PP(-t_{\alpha/2, n-1}<T<t_{\alpha/2,n-1}) = 1-\alpha \] And thus have to solve the simultaneous inequalities \[ \begin{align*} -t_{\alpha/2, n-1}&<\frac{\overline{X}-\mu}{S/\sqrt{n}} & \frac{\overline{X}-\mu}{S/\sqrt{n}} &< t_{\alpha/2,n-1} \\ -t_{\alpha/2, n-1}S/\sqrt{n} &< \overline{X}-\mu & \overline{X}-\mu &< t_{\alpha/2, n-1}S/\sqrt{n} \\ -(\overline{X}+t_{\alpha/2, n-1}S/\sqrt{n}) &<-\mu & -\mu &< -(\overline{X}-t_{\alpha/2, n-1}S/\sqrt{n}) \end{align*} \]

Which gives us the confidence interval \(\mu\in\overline{X}\pm t_{\alpha/2, n-1}S/\sqrt{n}\)

Alert: Our derivation assumes the \(X_i\) are normally distributed. Check your data with eg a QQ-plot (especially for small \(n\)) before applying this CI.

Example (Illustration II from Student (1908))

Sample seed corn from both hard (glutinous) wheat and soft (starchy) wheat were picked in three successive years and planted in heavy and light soil.

Voelcker who ran the initial experiment was looking for whether heavy soil would consistently produce hard corn, and was able to show this, but Student notices that the soft seeds tended to produce high yields. He gives the following data extracted from Voelcker’s reports (yields in grammes per pot):

Year	1899	1900	1901
Soil	Light/Heavy	Light/Heavy	Light/Heavy
Corn yield from soft seed	7.85/8.89	14.81/13.55	7.48/15.39
Corn yield from hard seed	7.27/8.32	13.81/13.36	7.97/13.13
Straw yield from soft seed	12.81/12.87	22.22/20.21	13.97/22.57
Straw yield from hard seed	10.71/12.48	21.64/20.26	11.71/18.96

Code

voelcker = tribble(
  ~year, ~soil, ~yieldtype, ~seedtype, ~yield,
  1899, "light", "corn", "soft", 7.85,
  1899, "heavy", "corn", "soft", 8.89,
  1900, "light", "corn", "soft", 14.81,
  1900, "heavy", "corn", "soft", 13.55,
  1901, "light", "corn", "soft", 7.48,
  1901, "heavy", "corn", "soft", 15.39,
#
  1899, "light", "corn", "hard", 7.27,
  1899, "heavy", "corn", "hard", 8.32,
  1900, "light", "corn", "hard", 13.81,
  1900, "heavy", "corn", "hard", 13.36,
  1901, "light", "corn", "hard", 7.97,
  1901, "heavy", "corn", "hard", 13.13,
#
  1899, "light", "straw", "soft", 12.81,
  1899, "heavy", "straw", "soft", 12.87,
  1900, "light", "straw", "soft", 22.22,
  1900, "heavy", "straw", "soft", 20.21,
  1901, "light", "straw", "soft", 13.97,
  1901, "heavy", "straw", "soft", 22.57,
#
  1899, "light", "straw", "hard", 10.71,
  1899, "heavy", "straw", "hard", 12.48,
  1900, "light", "straw", "hard", 21.64,
  1900, "heavy", "straw", "hard", 20.26,
  1901, "light", "straw", "hard", 11.71,
  1901, "heavy", "straw", "hard", 18.96,
)
voelcker_diff = voelcker %>% pivot_wider(names_from="seedtype", values_from="yield") %>% mutate(increase=soft-hard)

Student studies the differences in yield between soft seed and hard seed, and gets the values:

	Average	Std.dev.	95% CI
Corn yield difference	0.685	0.7664628	(-0.1193534, 1.4893534)
Straw yield difference	1.4816667	1.1707478	(0.2530422, 2.7102912)

(there are typos in this table in the published paper that influence the computed averages and standard deviations, and the published paper divides by \(n\) not \(n-1\) to get the sample std.dev.)

Example (Illustration II from Student (1908))

In order to use the method, we should really inspect the values we work with for normality. This is most easily done with a QQ-plot (called probability plot in our book):

Code

ggplot(voelcker_diff, aes(sample=increase, group=yieldtype)) + 
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~yieldtype)

For a glimpse of future topics, Student reports a \(x/s\) ratio of \(0.88\) for a \(T\)-score of \(0.88\sqrt{5}\) and a p-value of \(0.9468846\) for the corn, and a \(x/s\) ratio of \(1.20\) for a \(T\)-score of \(1.20\sqrt{5}\) and a p-value of \(0.9781759\). There are plenty of small oddities with the published paper: I need to compute pt(0.88*sqrt(5), 5) and pt(1.20*sqrt(5), 5) to get even close to Student’s reported values, but arguably I should be computing pt(0.75*sqrt(6), 5) and pt(1.05*sqrt(6), 5).

Prediction vs. Estimation

Suppose instead of estimating \(\mu\), we wanted to predict a likely next observed value.

Setup: \(X_1,\dots,X_{n+1}\sim\mathcal{N}(\mu,\sigma^2)\) iid. We can create a point predictor \(\overline{X}=\frac{1}{n}\sum_{i=1}^nX_i\) (NB note the summation bounds!!), which on its own gives no information about the degree of certainty.

The expected prediction error is \[ \EE[\overline{X}-X_{n+1}] = \EE[\overline{X}]-\EE[X_{n+1}]=\mu-\mu=0 \]

\(X_{n+1}\) is independent of \(X_1,\dots,X_n\), so no covariance term in the variance computation. The variance of prediction error is \[ \VV[\overline{X}-X_{n+1}] = \VV[\overline{X}] - \VV[X_{n+1}] = \frac{\sigma^2}{n} + \sigma^2 = \sigma^2\left(1+\frac{1}{n}\right) \]

Since prediction error is a linear combination of independent normally distributed variables, it is itself normally distributed, so \[ Z = \frac{(\overline{X}-X_{n+1})-0}{\sqrt{\sigma^2(1+1/n)}} = \frac{\overline{X}-X_{n+1}}{\sqrt{\sigma^2(1+1/n)}} \sim\mathcal{N}(0,1) \]

Using this as a pivot quantity, we get a prediction interval \(X_{n+1}\in\overline{X}\pm z_{\alpha/2}\sigma\sqrt{1+1/n}\).

By analogy with the derivation of the T-distribution CI, we can also derive a small sample prediction interval \(X_{n+1}\in\overline{X}\pm t_{\alpha/2, n-1}s\sqrt{1+1/n}\).

Notice how with large enough samples, any confidence interval can shrink arbitrarily small, whereas the width of a prediction interval is lower bounded by \(2z_{\alpha/2}\sigma\).

Your Homework

7:34 - Nicholas Basile

\[ \begin{align*} f(x_1,\dots,x_n|\lambda) &= \lambda e^{-\lambda x_1}\cdot\dots\cdot\lambda e^{-\lambda x_n} \\ &= \lambda^ne^{-\lambda(x_1+\dots+x_n)} \end{align*} \]

Factorization theorem:

\[ f(x_1,\dots,x_n|\lambda) = g(t\{x_1,\dots,x_n\}, \lambda)\cdot h(x_1,\dots,x_n) \]

We want to show \(\sum x_i\) is a sufficient statistic for \(\lambda\).

\[ \begin{align*} t\{x_1,\dots,x_n\} &= \sum_{i=1}^n x_i \qquad\text{sufficient statistic} \\ g(t\{x_1,\dots,x_n\}, \lambda) &= \lambda^ne^{-\lambda t\{x_1,\dots,x_n\} } \\ h(x_1,\dots,x_n) &= 1 \end{align*} \]

7:36 - James Lopez

\[ f(x_1,\dots,x_n | \theta_1,\theta_2) = \left(\frac{1}{\theta_2-\theta_1}\right)^n \]

With an indicator function: \[ = \left(\frac{1}{\theta_2-\theta_1}\right)^nI(\max(x_i)\leq\theta_2)I(\min(x_i)\geq\theta_1) \]

\(t_1 = \min(x_i)\qquad\qquad t_2=\max(x_i)\)

\[ g(t_1,t_2; \theta_1, \theta_2) = \left(\frac{1}{\theta_2-\theta_1}\right)^nI(\max(x_i)\leq\theta_2)I(\min(x_i)\geq\theta_1) \]

When \(h=1\), then we can say that \(\min(x_i)\) and \(\max(x_i)\) are sufficient statistics for \(\theta_1\) and \(\theta_2\).

7:38 - James Lopez

Lognormal PDF is:

\[ f(x|\mu,\sigma) = \frac{1}{x\sqrt{2\pi}\sigma}\cdot\exp\left[-\frac{1}{2\sigma^2}(\log x-\mu)^2\right] \]

PMF will be

\[ \frac{1}{\sum x_i(\sqrt{2\pi\sigma^2})^n}\cdot\exp\left[-\frac{1}{2\sigma^2}\sum(\log x_i-\mu)^2\right] = \\ = \frac{1}{(\sqrt{2\pi\sigma^2})^n}\cdot \frac{1}{\sum X_i}\cdot e^{-\frac{1}{2\sigma^2}\left(\sum(\log x_i)^2 - 2\mu\sum\log x_i + n\mu^2\right)} \]

As we can see the sufficient statistics are \(\sum(\log x_i)^2\) and \(\sum(\log x_i)\) for \(\mu\) and \(\sigma\).

7:42 a. - Maxim Kleyer

\[ \begin{align*} f(x_i|\lambda) &= \frac{\lambda^{x_i}e^{-\lambda}}{x_i!} & \EE[x_i] &= \lambda \\ && \VV[x_i] &= \lambda \\ \mathcal{I}(\lambda) &= \VV\left[\frac{d}{d\lambda}\ln\left(\frac{\lambda^{x_i}e^{-\lambda}}{x_i!}\right)\right] \\ &= \VV\left[\frac{d}{d\lambda}(x_i\ln(\lambda)-\lambda-\ln(x_i!)\right] \\ &= \VV\left[\frac{x_i}{\lambda}-1\right] \\ &= \frac{1}{\lambda^2} \VV[x_i] = \frac{\lambda}{\lambda^2} = \boxed{\frac{1}{\lambda}} \end{align*} \]

Victoria Paukova adds second method: \[ \begin{align*} \text{second derivative} &: -\frac{x}{\lambda^2} \\ \mathcal{I} &= \frac{\EE[x]}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda} \end{align*} \]

7:42 - Maxim Kleyer

Now, \(\VV[F] = \frac{1}{n\mathcal{I}(\lambda)} = \frac{\lambda}{n}\) (Cramér-Rao)
\[ \begin{align*} f(x_1,\dots,x_n|\lambda) &= \frac{\lambda^{\sum x_i}e^{-n\lambda}}{\prod_{i=1}^nx_i!} \\ \frac{d}{d\lambda}\left(\ln\left(f(x_1,\dots,x_n)\right)\right) &= \frac{d}{d\lambda}\left(\left(\sum_{i=1}^n x_i\right)\ln(\lambda)-n\lambda-\ln\left(\prod_{i=1}^nx_i!\right)\right) \\ &= \frac{\sum_{i=1}^n x_i}{\lambda}-n \qquad\text{set equal to 0} \\ &\Rightarrow \sum_{i=1}^n x_i = n\lambda \Rightarrow \boxed{\hat\lambda = \frac{\sum_{i=1}^n x_i}{n}} \\ \VV[\hat\lambda] &= \VV\left[\frac{\sum_{i=1}^n x_i}{n}\right] \\ &= \frac{1}{n^2}\left(\sum_{i=1}^n \VV[x_i]\right) = \frac{1}{n^2}\sum_{i=1}^n\lambda = \boxed{\frac{\lambda}{n}} \end{align*} \]

Variance of estimator is equal to the lower bound, estimator is efficient.

by Victoria Paukova: The theorem states that the normal distribution has mean \(\lambda\) and variance \(1/n\mathcal{I}(\theta)\), which we saw, as \(\lambda=\overline{x}\) and the variance was equal to \(\lambda/n\).

7:50 - James Lopez

\[ \begin{align*} f(x_1|\theta) &= \frac{1}{\theta-(-\theta)} & -\theta&\leq x_i\leq \theta \\ f(x_1,\dots,x_n|\theta) &= \prod\frac{1}{\theta-(-\theta)} & -\theta&\leq x_1\leq\theta, \dots, -\theta\leq x_n\leq\theta \\ &= \frac{1}{(2\theta)^n}& -\theta&\leq x_1\leq\theta, \dots, -\theta\leq x_n\leq\theta \\ \end{align*} \]

This meant that \(-\theta\leq\min(x_i)\leq\theta\), \(-\theta\leq\max(x_i)\leq\theta\).

So \(-\theta\leq\min(x_i)\) and \(\theta\geq\max(x_i)\).

Since we have \(-\min(x_i)\leq\theta\) and \(\max(x_i)\leq\theta\)

However \(f\) is a decreasing function so the \(\max(x_i)\) is the MLE.

MVJ: What if your sample was \(-5, -2, 1, 2\)?

You need \(\theta\geq\max(x_i)\) as well as \(\theta\geq-\min(x_i)\), so you might want \(\hat\theta=\max(|\max(x_i)|, |\min(x_i)|)\) or something like that.

It is biased because it will always underestimate \(\theta\). \(\theta\) can be bigger than \(\hat\theta\), but not vice versa. Hence \(\EE[\hat\theta] < \theta\).
\(f(x_1,\dots,x_n|\theta) = g(\min(x_i), \max(x_i); \theta)\). When \(h=1\), \(\min x_i\) and \(\max x_i\) are sufficient statistics.

7.50 - Mikael Vejdemo-Johansson

For the order statistics, Section 5.5 suggests that \[ \begin{align*} PDF(x_{(1)}, x_{(n)} | \theta) &= \iiint n!\cdot PDF(x_{(1)})\cdot PDF(x_{(2)})\cdot\dots\cdot PDF(x_{(n)}) dx_{(2)}\dots dx_{(n-1)} \\ &= \iiint n!\frac{1}{(2\theta)^n} dx_{(2)}\dots dx_{(n-1)} \\ &= n!\frac{(2\theta)^{n-2}}{(2\theta)^n} = \frac{n!}{4\theta^2} \qquad\text{if}\,-\theta\leq x_{(1)}\leq x_{(n)}\leq\theta \end{align*} \]

Figure 1: The region of the \(x_{(1)}-x_{(n)}\)-plane decomposed to compute expected values.

We split the \(x_{(1)}\)-\(x_{(n)}\) plane into four segments:

\(-\theta\leq x_{(1)}\leq x_{(n)}\leq 0\)

\[ \begin{align*} \hat\theta &= -x_{(1)} \\ \EE[\hat\theta] &= \int_{-\theta}^0 \int_{x_{(1)}}^0 \hat\theta PDF(x_{(1)}, x_{(n)}) dx_{(n)}dx_{(1)} \\ &= \int_{-\theta}^0 \int_{x_{(1)}}^0 (-x_{(1)}) \frac{n!}{4\theta^2} dx_{(n)}dx_{(1)} \\ &= \frac{n!}{4\theta^2}\int_{-\theta}^0 (-x_{(1)}\cdot 0) - ((-x_{(1)})\cdot x_{(1)}) dx_{(1)} \\ &= \frac{n!}{4\theta^2}\left(\frac{0^3}{3} - \frac{(-\theta)^3}{3}\right) = \theta n!/4 \end{align*} \]

\(-\theta\leq x_{(1)}\leq 0\leq x_{(n)}\leq \theta\) and \(-x_{(1)} \leq x_{(n)}\)

\[ \begin{align*} \hat\theta &= x_{(n)} \\ \EE[\hat\theta] &= \int_{-\theta}^0 \int_{-x_{(1)}}^\theta \hat\theta PDF(x_{(1)}, x_{(n)}) dx_{(n)}dx_{(1)} \\ &= \int_{-\theta}^0 \int_{-x_{(1)}}^\theta (x_{(n)}) \frac{n!}{4\theta^2} dx_{(n)}dx_{(1)} \\ &= \frac{n!}{4\theta^2}\int_{-\theta}^0 \left(\frac{\theta^2}{2} - \frac{(-x_{(1)})^2}{2} \right) dx_{(1)} \\ &= \frac{n!}{4\theta^2}\left[\left(\frac{\theta^2x_{(1)}}{2} - \frac{(x_{(1)})^3}{6} \right)\right]_{-\theta}^0 \\ & = \frac{n!}{4\theta^2}\left[\left(\frac{\theta^2\cdot0}{2}-\frac{0^3}{6}\right)-\left(\frac{\theta^2\cdot(-\theta)}{2}-\frac{(-\theta)^3}{6}\right)\right] \\ &= \frac{n!}{4\theta^2}\cdot\frac{3\theta^3 - \theta^3}{6} = \frac{n!}{4\theta^2}\cdot\frac{2\theta^3}{6} = \frac{n!\theta}{12} \end{align*} \]

\(-\theta\leq x_{(1)}\leq 0\leq x_{(n)}\leq \theta\) and \(-x_{(1)} \geq x_{(n)}\)

\[ \begin{align*} \hat\theta &= -x_{(1)} \\ \EE[\hat\theta] &= \int_{-\theta}^0 \int_0^{-x_{(1)}} \hat\theta PDF(x_{(1)}, x_{(n)}) dx_{(n)}dx_{(1)} \\ &= \int_{-\theta}^0 \int_0^{-x_{(1)}} (-x_{(1)}) \frac{n!}{4\theta^2} dx_{(n)}dx_{(1)} \\ &= \frac{n!}{4\theta^2}\int_{-\theta}^0 ((-x_{(1)})\cdot(-x_{(1)})) - ((-x_{(1)})\cdot 0) dx_{(1)} \\ &= \frac{n!}{4\theta^2}\left(\frac{0^3}{3} - \frac{(-\theta)^3}{3}\right) = \theta n!/4 \end{align*} \]

\(0\leq x_{(1)}\leq x_{(n)}\leq \theta\)

\[ \begin{align*} \hat\theta &= x_{(n)} \\ \EE[\hat\theta] &= \int_0^{\theta} \int_{x_{(1)}}^\theta \hat\theta PDF(x_{(1)}, x_{(n)}) dx_{(n)}dx_{(1)} \\ &= \int_0^{\theta} \int_{x_{(1)}}^\theta (x_{(n)}) \frac{n!}{4\theta^2} dx_{(n)}dx_{(1)} \\ &= \frac{n!}{4\theta^2}\int_0^{\theta} \left(\frac{\theta^2}{2} - \frac{x_{(1)}^2}{2} \right) dx_{(1)} \\ &= \frac{n!}{4\theta^2}\left[\left(\frac{\theta^2x_{(1)}}{2} - \frac{(x_{(1)})^3}{6} \right)\right]_0^{\theta} \\ & = \frac{n!}{4\theta^2}\left[\left(\frac{\theta^2\cdot\theta}{2}-\frac{\theta^3}{6}\right)-\left(\frac{\theta^2\cdot0}{2}-\frac{0^3}{6}\right)\right] \\ &= \frac{n!}{4\theta^2}\cdot\frac{3\theta^3 - \theta^3}{6} = \frac{n!}{4\theta^2}\cdot\frac{2\theta^3}{6} = \frac{n!\theta}{12} \end{align*} \]

7.50 - Mikael Vejdemo-Johansson

For the order statistics, Section 5.5 suggests that \[ \begin{align*} PDF(x_{(1)}, x_{(n)} | \theta) &= \iiint n!\cdot PDF(x_{(1)})\cdot PDF(x_{(2)})\cdot\dots\cdot PDF(x_{(n)}) dx_{(2)}\dots dx_{(n-1)} \\ &= \iiint n!\frac{1}{(2\theta)^n} dx_{(2)}\dots dx_{(n-1)} \\ &= n!\frac{(2\theta)^{n-2}}{(2\theta)^n} = \frac{n!}{4\theta^2} \qquad\text{if}\,-\theta\leq x_{(1)}\leq x_{(n)}\leq\theta \end{align*} \]

Figure 2: The region of the \(x_{(1)}-x_{(n)}\)-plane decomposed to compute expected values.

The four segments have expected values \(\frac{\theta n!}{4}, \frac{\theta n!}{12}, \frac{\theta n!}{4}, \frac{\theta n!}{12}\). These sum up (disjoint integrals add) to the expected value \(\theta n!\left(\frac{1}{4}+\frac{1}{12}+\frac{1}{4}+\frac{1}{12}\right) = \theta n!\frac{8}{12}\).

\(\frac{12}{8n!}\max(-x_{(1)}, x_{(n)})\) should be unbiased based on these calculations.

7:60 a. - James Lopez

\[ \begin{align*} f(x_1,\dots,x_n|\lambda) &= \frac{\lambda^{\sum x_i}\color{DarkMagenta}{e^{-n\lambda}}}{\prod x_i!} & f(y_1,\dots,y_n|\lambda) &= \frac{\lambda^{\sum y_i}\color{DarkMagenta}{e^{-n\lambda}}}{\prod y_i!} \\ \frac{f(x_1,\dots,x_n|\lambda)}{f(y_1,\dots,y_n|\lambda)} &= \frac{\prod y_i!}{\prod x_i!}\cdot\frac{\lambda^{\sum x_i}}{\lambda^{\sum y_i}} & &e^{-n\lambda} \text{ cancels} \\ &= \frac{\prod y_i!}{\prod x_i!}\cdot\lambda^{\sum x_i-\sum y_i} \\ \sum x_i-\sum y_i &= 0 \\ \sum x_i &= \sum y_i \end{align*} \]

Therefore \(\sum x_i\) is a minimally sufficient statistic for \(\lambda\).

7:60 b. - James Lopez

\[ \begin{align*} f(x_1,\dots,x_n|\lambda) &= \frac{1}{(\sqrt{2\pi\theta})^n} \exp\left[-\frac{1}{2\theta}\left(\sum x_i^2-2\theta\sum x_i+n\theta^2\right)\right] \\ \frac{f(x_1,\dots,x_n|\lambda)}{f(y_1,\dots,y_n|\lambda)} &= \frac{\color{DarkMagenta}{\frac{1}{(\sqrt{2\pi\theta})^n}} \exp\left[-\frac{1}{2\theta}\left(\sum x_i^2-2\theta\sum x_i+n\theta^2\right)\right]}{\color{DarkMagenta}{\frac{1}{(\sqrt{2\pi\theta})^n}} \exp\left[-\frac{1}{2\theta}\left(\sum y_i^2-2\theta\sum y_i+n\theta^2\right)\right]} \\ &= \exp\left[-\frac{1}{2\theta}\left(\sum x_i^2-2\theta\sum x_i+\color{SlateBlue}{n\theta^2} - \sum y_i^2 + 2\theta\sum y_i - \color{SlateBlue}{n\theta^2}\right)\right] \\ &= \exp\left[ -\frac{1}{2\theta}\left(\sum x_i^2-\sum y_i^2\right) -\frac{1}{2\theta}2\theta\left(\sum x_i-\sum y_i\right) \right] \\ &= \exp\left[-\frac{1}{2\theta}\left(\sum x_i^2-\sum y_i^2\right)\right]\cdot\exp\left[\sum x_i-\sum y_i\right] \end{align*} \]

To be independent of \(\theta\):

\(\sum x_i^2-\sum y_i^2 = 0\), so \(\sum x_i^2=\sum y_i^2\).

MVJ: Note that this is enough. Since \(\exp\left[\sum x_i-\sum y_i\right]\) doesn’t involve \(\theta\), the sums can differ as long as the sums of squares agree.

Minimally sufficient statistic is \(\sum x_i^2\).

7:60 c. - MVJ

Note that now \(\sigma^2 = \theta^2\) and not \(\theta\).

\[ \begin{align*} \frac{PDF(x_1,\dots,x_n|\theta)}{PDF(y_1,\dots,y_n|\theta)} &= \frac {\color{DarkMagenta}{\frac{1}{\sqrt{2\pi\theta^2}^n}}\cdot\exp\left[-\frac{1}{2\theta^2}\left(\sum x_i^2-2\theta\sum x_i+\color{Teal}{n\theta^2}\right)\right]} {\color{DarkMagenta}{\frac{1}{\sqrt{2\pi\theta^2}^n}}\cdot\exp\left[-\frac{1}{2\theta^2}\left(\sum y_i^2-2\theta\sum y_i+\color{Teal}{n\theta^2}\right)\right]} \\ &= \exp\left[-\frac{1}{2\theta^2}\left(\sum x_i^2-\sum y_i^2\right)-\frac{-2\theta}{2\theta^2}\left(\sum x_i-\sum y_i\right)\right] \\ &= \exp\left[-\frac{1}{2\theta^2}\left(\sum x_i^2-\sum y_i^2\right)\right]\cdot \exp\left[\frac{1}{\theta}\left(\sum x_i-\sum y_i\right)\right] \end{align*} \]

This expression requires both \(\sum x_i^2=\sum y_i^2\) and \(\sum x_i=\sum y_i\) to remove all instances of \(\theta\), hence a minimally sufficient statistic is \((\sum x_i, \sum x_i^2)\).