Using MATLAB in probability and statistics

Using MATLAB in probability and statistics

Department of Mathematics, CSI

Last updated on Sep 3, 1998


A much more complete tutorial on using MATLAB in Calculus may be found on-line at

http://www.math.csi.cuny.edu/matlab/tutorial/

CAVEAT: This is just a sketch of what it should be. If you would like see something appearing here, please let me know verzani@math.csi.cuny.edu.

1  Examples

This section simply records some examples. Eventually, they will be merged into the main body of the text.


Example: coin tossing
Often we will want to simulate coin tossing. Numerically we need to generate one of two numbers with equal probability. This does it for 0 and 1. How would you modify it so that you generated -1 and 1?

>> x = rand(1)                  % 1 random number in [0,1]. Uniform
>> x = floor(x+.5)              % [0,.5) -> 0 and [.5,1] -> 1

That's it. To do lots at a time, the rand function can be given an argument:

>> x = rand(1,100)              % 100 random numbers in a row
>> x = floor(x+.5)              % the same


Example: average coin tossing
The law of large numbers asserts that as the number of trials tends to infinity the average will tend to the true probability. More formally if X1, X2,... is a sequence of i.i.d. events then the average number of times Xi is in an event A will tend towards the probability of A. In symbols


lim
n- > ¥ 
#{i £ n : Xi Î A}
n
= P(A)
We can illustrate this with the following eexample of tossing coins.

If you assign heads to be a 1 tails to be a 0, then a sequence of coin tosses looks like this

(0,1,1,0,0,0,1,0,1,1,....) = (T,H,H,T,T,T,H,T,H,H,...)
The average number of heads for the first n trials is precisely

0
1
, 1
2
, 2
3
, 2
4
, 2
5
, 2
6
,...
We want to recreate this, and then plot it with MATLAB. Here we use the cumsum command which gives the cumulative sum. For the sequence of 0's and 1's above we have
cumsum([0,1,1,0,0,0,1,0,1,1]) = 0,1,2,2,2,2,3,3,4,5

Here we go:

>> n = 100;                     % number of random coin tosses
>> x = rand(1,n);               % x is in [0,1]
>> x = floor(x+.5);             % x is in {0,1}
>> y = cumsum(x);               % the numerator
>> z = y ./ (1:n);              % the denominator
>> plot(z)                      %  Should be converging to 0.5

The graphs you generate will be different each time, but they should all be converging to 0.5 by the end. The amount of difference is controlled by the central limit theorem


Example: Histograms and Empirical Distributions

In real life, we are presented with discrete data. For example the averaged daily temperature for a month may look like this:

67, 71, 65, 70, 74, 69, 68, 74, 74, 66, 72, 70, 70, 69, 74, 73, 73, 67, 71, 72, 68, 74, 72, 67, 74, 72, 74, 69, 68, 70.
We can represent this data graphically in several ways. One of the more common is a histogram, other common ones are stem-and-leaf diagrams, or box plots. We'll look at the histogram, as it is related to a probability distribution that we will talk about later.

Basically a histogram is a bar plot where the heights of the bars have a special meaning. First off, we need to set up bins. These are simply intervals that lie on the x axis.

The area of a bar is proportional to the frequency of times the observation is in the bin corresponding to the bar.

The proportion is fixed by demanding that the total area be 1. (Well not quite as we'll see). So it is easy to find the height. If there are n observations, and the bin is the interval (a,b] then the height of the bar above b would be given by the formula

\textArea of bar = h (b - a) = 1
n
#{i : xi \text is in (a,b]}
where xi is the ith observation.

For the example above if we wanted a bin at (65,70] then the height would be

h (70-65) = 1
30
14 = 0.4667
as there are 14 observations in the range. (notice 65 is not, but 70 is.)

How can we get MATLAB to do this for us? Well MATLAB has a command hist that will do this automatically. Its output is given in this picture

Figure

As an example though we will write our own. These commands could easily be turned into a m-file.

>> x =
[67,71,65,70,74,69,68,74,74,66,72,70,70,69,74,73,73,67,71,72,68,74,72,67,74,72,74,69,68,70]
                                % our data vector
>> a = [60,65,70,75]            % our bins.
>> n = length(x);k=length(a);   % lengths of vectors
>> hold on                      % for plotting line by line
>> for i=1:(k-1)                % a loop
>> hist(i) = (1/(n*(a(i+1)-a(i)))*(sum(x > a(i)) - sum(x>a(i+1)));
                                % This last line uses fancy stuff
>> plot([a(i),a(i),a(i+1),a(i+1)],[0,hist(i),hist(i),0]); % plot bar
>> end                          % end for loop
The whole point of writing this is to show the use of comparing a vector to a value to count the number of times something happens.

Notice what is returned with

>> x > 70
 Columns 1 through 26:

  0  1  0  0  1  0  0  1  1  0  1  0  0  0  1  1  1  0  1  1  0  1  1  0  1  1

 Columns 27 through 30:

  1  0  0  0

That is a 0 when that value of x is not greater than 70 and a 1 when it is. If we sum this we count the number of times x is bigger than 70. Neat!

Notice we said that the area adds to 1. Well this isn't quite the case if the bins aren't chosen big enough to count all the data.

Exercise: This data is the daily close of the stock market for two weeks

8390,8173,8331,8107,8020,8089,8091,8262,8258,8212,8204,7970,8169,8217

  1. Make a histogram for this data with 5 equally spaced bins between the smallest and largest values. (Hint: If x is your data then smallest=min(x) and largest=max(x) will help you find the range, and a = linspace(smallest,largest,5) will create 5 equally spaced bins.)
  2. Redo you histogram with 10 bins. Does the data have any sort of shape?

Empirical Distributions. The histogram is a way of representing graphically something called an empirical distribution. This is a fancy name for a frequency of times the observation is in some set. Here is a definition

Pn((a,b)) = #{i £ n: xi \text is in (a,b)}
Notice it depends on n and the interval or ``bin'' that you are interested in. Many theorems in statistics describe what happens to this distribution, which is random as it depends on the data, as n gets large.

Exercise: For the stock market data, Find the following:

  1. P7((8200,8300))
  2. P14((8200,8300))

2  Working with data

In MATLAB We use store data in a vector or a matrix . In order to use MATLAB effectively, we need to learn how to store data, and access the data once we have stored it.

2.1  storing data

2.1.1  storing a data set

Suppose we run an experiment and we get a bunch of data. HOw can we store thisin MATLAB so that we can then manipulate it with the computer rather than our pencil and paper?

As an example, suppose we had data on the age of students in a classroom:

21,19,20,26,20,21,21,28,19,19,35,20,21,23
To put this into MATLAB we will store it as a list or vector . The MATLAB command to do this is
  ages = [21,19,20,26,20,21,21,28,19,19,35,20,21,23];

What did we do here?

Here are some more examples of storing data:

Suppose we flip a coin 10 times and get sequence of heads and tails as follows:

H, T, T, H, T, H, H, H, T, H

We could store this in in a data vector by arbitrarily saying a Heads is a 1 and a Tails is a 0 giving

>> cointoss = [1,0,0,1,0,1,1,1,0,1]; % 1 for heads, 0 for tails

If we recorded the amount of gasoling we buy at each fillup we might get a sequence of data such as:

10.9, 10.8, 11.3, 13.2, 9.8, 9.7, 6.5 11.4

This goes into a data vector in the same way:

>> gas = [ 10.9, 10.8, 11.3, 13.2, 9.8, 9.7, 6.5 11.4];

We see that storing a given list of data is not too hard. It just takes some typing.

2.1.2  automatically generated lists of numbers

It may be that our data is not random in nature but rather something we can describe mathematically. For example we might wish to store the integers between 1 and 10. This could be done as

>>one_to_ten=[1,2,3,4,5,6,7,8,9,10]; % too much typing

Or this could be achiebed using MATLAB with the colon-operator :

>> one_to_ten=1:10;

In general the colon operator has a step size. So the command

>> x = a:h:b;

will assign to the variable x the list of values

a, a+h,a+2*h ... a+k*h

where k is the largest integer with a+k(h) £ b. The difference between successive numbers is h hence the name step-size.

Another way to do a similar thing is to specify the number of values you want between a and b. This is done with the linspace command:

>> x = linspace(a,b,n);         % n numbers between a and b

For those unafraid of algebra, you can show that this is the same as:

>> x = a:(a-b)/(n-1):b;         % same a linspace(a,b,n)

Or as:

>> x = (0:(n-1))*(b-a)/n+a;         % again the same

These two help you generate evenly space data. If you want to generate exponentially spaced data it is also easy:

>> n= 1:5;
>> x = 10.^n;                   % 10^1, 10^2,... 10^5

The only part to remember is you need the ``.^'' as MATLAB otherwise would not know how to raise 10 to a list of numbers.

Here is a list of a bunch of different ways to generate non-random data:

>> x = 1:3;                     % easy as 1,2,3
>> x = 10 - (1:10);             % counts down to 0. Need the
                                % parantheses
>> x = 1:2:99;                  % just the odd numbers
>> x = (-1).^(0:9);             % 1,-1,1,-1,...
>> x = 10.^(-(1:5));            % 10^(-1), 10^(-2), ..., 10^(-5)
>> x = sin((1:n)*(2*pi/n));     % sample the sine wave n times

2.1.3  two-dimensional data

To store more complex data, such as a joint distribution, or correlated data, we need to use matrices in place of vectors.

For example, suppose we have data given in a table for height and weight of a basketball team:
Height 72 76 80 81 85
Weight 180 205 245 250 300

We could store this with two different variables:

>> height = [72,76,80,81,85];
>> weight = [180,205,245,250,300];

Or we could combine the data into an array:

>> combined_data = [72,180; 76,205; 80,245; 81,250;85,300]
combined_data =

   72  180
   76  205
   80  245
   81  250
   85  300

The semicolon tells MATLAB to start a new row. Be careful, you need to have the same number of elements in each row for MATLAB to understand what you are doing.

This presents the data in a columnar form. To change this around use the transpose operator ':

>> combined_data'
ans =

   72   76   80   81   85
  180  205  245  250  300

To summarize:

With these basics, we should be able to use the already defined variables height and weight to define the array containing both:

>> [height',weight']
ans =

   72  180
   76  205
   80  245
   81  250
   85  300

>> [height,weight]
ans =

   72   76   80   81   85  180  205  245  250  300

>> [height;weight]
ans =

   72   76   80   81   85
  180  205  245  250  300

>> [height;weight]'
ans =

   72  180
   76  205
   80  245
   81  250
   85  300

Try to figure out exactly why each statement produced the given output.

2.1.4  Reading data from a file

You may want to read or write data to a file on a diskette, or the hard drive. To do this is easy. But you need to know the commands.

The save command saves data to a file. The load command loads data from a file.

For example suppose you have data in a vector x:

>> x= [1,2,1,4,2,5,1,3,2,1,4,2]; % some precious data

With save you can put this data into a file. You can give the file the name you wish, and also specify how the format is stored. Your options are: matlab binary (the default), ascii (plain text), tabs (tab seperated), and double (double precision).

Here are some examples:

>> save file.mat x;             % stores x in binary mode into
                                % file.mat incurrent directory
>> save file.txt x -ascii       % stores x as  text file
>> save a:\file.txt x -tab      % store as tab seperated on a floppy
                                % disk
>> save c:\matlab\file.txt x y z - ascii % save 3 variable to hard
                                         % drive

You can load these back in with the load command:

>> load file.mat;               % loads x back in
>> load a:\file.txt;            % stores x in a new variable called
                                % file
>> temp='file.mat'; load(temp); % you can use a variable

2.2  accessing data

If we are going to store data in a vector or list, then we need to know how to access the data for it to be of any use.

Accessing the entire list is easy. Suppose our list contains the grades on an exam:

>> grades =[78,98,76,85,45,89,84,68,95];

Then to access the entire list we simply have to type the name without the semicolon:

>> grades                       % no semicolon!
grades =

  78  98  76  85  45  89  84  68  95

This is how we can give this data to a function. For example, we might wish to sort the data:

>> sort_grades = sort(grades);  % sort in increasing order
sort_grades =

  45  68  76  78  84  85  89  95  98

We might want to know what the largest grade is,or the smallest grade. Forthe sorted data these are just the first and last grades. It would be nice to know how to access these. MATLAB keeps a list as an ordered set of numbers, the first element in the list is ordered with a 1, the nth element with an n. So

>> sort_grades(1);              % first element is 45
>> sort_grades(2);              % returns second element or 68
>> sort_grades(length(sort_grades)); % returns the last element 98.

The last line used the length command to get the last element. If the list has lenght n, then there are n entries, and so the construct returns the last one.

We can change an individual grade. Suppose we mistyped a grade: the 76 shlud have beenan 86. To change the 3rd entry we reference it and then define it:

>> grades(3) = 86;              % change the third element.

We would still have to reevalute the sort line to update the variable sort_grades.

2.3  Tricky MATLAB operations

3  Generating random data

4  Functions of the data and functions of probability

5  Plotting Data

Index (showing section)


File translated from TEX by TTH, version 1.50.