In the previous handout we learned briefly how to simulate values of random variables. In this project, we want to continue looking at the distribution of some common statistics.

You may wish to investigate these statistics side by side. For example if we create 100 rows and 10 columns of random data (n=10) then you might put the $\bar{y}$ in column 11, the median in column 12 and the midrange (below) in column 13. To make a side-by-side boxplot, you need to “stack” these rows. The command do so so here would look like:

MTB> stack c11 c12 c13 c14;
Subscripts c15.

The ; and the . are important. This stacks the first 3 columns into the fourth (c14) and provides subscripts in column c15. Now we can do a side-by-side box plot where the y variable is column c14 and the x variable for categories is column c15.

First we’ll start with $\bar{y}$ which we know is the mean or with a formula

$$\bar{y} = \frac{X_1 + \cdot +X_n}{n}$$

If the distribution of each $X$ is normal with mean $\mu$ and standard deviation $\sigma$ we know that the distribution of $\bar{y}$ is also normal, has the same mean $\mu_{\bar{y}} = \mu_X$ but the standard deviation is less – it is $\sigma_{|bar{y}} = \sigma/\sqrt{n}$

What about when each $X$ is not normal? Suppose $X$ is exponential with mean $10$. For $n=10$ and $n=100$ describe what you can of $\bar{y}$.
The median of $(X_1,X_2,...,X_n)$ is an interesting statistic. It tells us about the center of the data in a robust way. What is it’s distribution?

First, let’s suppose each $X$ is normal with mean 10 and standard deviation 5. If $n=10$ and $n=100$ describe the distribution of the median. (what is mean, median, is it normal? approximately normal?)
The midrange of $(X_1,X_2,...,X_n)$ is $(Q_3 - Q_1)/2$ or the halfway mark of the boxplot. It too is a measure of center of a distribution. Extra bonus points if you can figure out how to store the midrange into a column and look at its distribution.
In practice the average may be known but the variability not known. In a test to see if your sample average is far from the true mean of the population a $t$-test is performed. This looks at the following statistic

$$T= \frac{\bar{y} - \mu}{s/\sqrt{n}}$$

(This is like standardizing $\bar{y}$ as it has mean $\mu$ and standard deviation $\sigma/\sqrt{n}$. Here we have to estimate $\sigma$ by $s$ the sample standard deviation.)

Investigate the distribution of $T$ under the following assumptions
- First, let $X$ be normal with mean $10$ and standard deviation 5. Let $n=5$. (Note if we divided by $\sigma$ instead of $s$ the answer would be exactly normal with mean 0 and standard deviation 1.)
- Same scenario as above, but $n=15$.
- Now let $n=35$.
- Okay, let’s see what happens if $X$ is not normal. Suppose $X$ is uniform on the interval $[-1,1]$. Repeat the above for the 3 values of $n$.
In each case, you may want to compare $T$ to the $Z$ statistic which is

$$Z = \frac{\bar{y} - \mu}{\sigma/\sqrt{n}}.$$