Histograms and Empirical Distributions

In real life, we are presented with discrete data. For example the averaged daily temperature for a month may look like this:

$$67, 71, 65, 70, 74, 69, 68, 74, 74, 66, 72, 70, 70, 69, 74, 73, 73, 67, 71, 72, 68, 74, 72, 67, 74, 72, 74, 69, 68, 70.$$

We can represent this data graphically in several ways. One of the more common is a histogram, other common ones are stem-and-leaf diagrams, or box plots. We’ll look at the histogram, as it is related to a probability distribution that we will talk about later.

Basically a histogram is a bar plot where the heights of the bars have a special meaning. First off, we need to set up bins. These are simply intervals that lie on the $x$ axis.

The area of a bar is proportional to the frequency of times the observation is in the bin corresponding to the bar.

The proportion is fixed by demanding that the total area be 1. (Well not quite as we’ll see). So it is easy to find the height. If there are $n$ observations, and the bin is the interval $(a,b]$ then the height of the bar above $b$ would be given by the formula

$$\text{Area of bar} = h (b - a) = \frac{1}{n} \#\{i : x_i \text{ is in } (a,b]\}$$

where $x_i$ is the $i$th observation.

For the example above if we wanted a bin at $(65,70]$ then the height would be

$$h (70-65) = \frac{1}{30} 14 = 0.4667$$

as there are 14 observations in the range. (notice 65 is not, but 70 is.)

How can we get MATLAB to do this for us? Well MATLAB has a command that will do this automatically. Its output is given in this picture

As an example though we will write our own. These commands could easily be turned into a m-file.

>> x =
[67,71,65,70,74,69,68,74,74,66,72,70,70,69,74,73,73,67,71,72,68,74,72,67,74,72,74,69,68,70]
                                % our data vector
>> a = [60,65,70,75]            % our bins.
>> n = length(x);k=length(a);   % lengths of vectors
>> hold on                      % for plotting line by line
>> for i=1:(k-1)                % a loop
>> hist(i) = (1/(n*(a(i+1)-a(i)))*(sum(x > a(i)) - sum(x>a(i+1)));
                                % This last line uses fancy stuff
>> plot([a(i),a(i),a(i+1),a(i+1)],[0,hist(i),hist(i),0]); % plot bar
>> end                          % end for loop

The whole point of writing this is to show the use of comparing a vector to a value to count the number of times something happens.

Notice what is returned with

>> x > 70
 Columns 1 through 26:

  0  1  0  0  1  0  0  1  1  0  1  0  0  0  1  1  1  0  1  1  0  1  1  0  1  1

 Columns 27 through 30:

  1  0  0  0

That is a 0 when that value of x is not greater than 70 and a 1 when it is. If we this we count the number of times x is bigger than 70. Neat!

Notice we said that the area adds to 1. Well this isn’t quite the case if the bins aren’t chosen big enough to count all the data.

Exercise: This data is the daily close of the stock market for two weeks

$$8390,8173,8331,8107,8020,8089,8091,8262,8258,8212,8204,7970,8169,8217$$

Make a histogram for this data with 5 equally spaced bins between the smallest and largest values. (Hint: If $x$ is your data then smallest=min(x) and largest=max(x) will help you find the range, and a = linspace(smallest,largest,5) will create 5 equally spaced bins.)
Redo you histogram with 10 bins. Does the data have any sort of shape?

Empirical Distributions. The histogram is a way of representing graphically something called an . This is a fancy name for a frequency of times the observation is in some set. Here is a definition

$$P_n((a,b)) = \#\{i \leq n: x_i \text{ is in } (a,b)\}$$

Notice it depends on $n$ and the interval or “bin” that you are interested in. Many theorems in statistics describe what happens to this distribution, which is random as it depends on the data, as $n$ gets large.

Exercise: For the stock market data, Find the following:

$P_7((8200,8300))$
$P_{14}((8200,8300))$