Random Data Previous Up Next

6  Random Data

Although Einstein said that god does not play dice, R can. For example


> sample(1:6,10,replace=T)
 [1] 6 4 4 3 5 2 3 3 5 4
or with a function

> RollDie = function(n) sample(1:6,n,replace=T)
> RollDie(5)
[1] 3 6 1 2 2
In fact, R can create lots of different types of random numbers ranging from familiar families of distributions to specialized ones.

6.1  Random number generators in R-- the ``r'' functions.

As we know, random numbers are described by a distribution. That is, some function which specifies the probability that a random number is in some range. For example P(a < X b). Often this is given by a probability density (in the continuous case) or by a function P(X=k) = f(k) in the discrete case. R will give numbers drawn from lots of different distributions. In order to use them, you only need familiarize yourselves with the parameters that are given to the functions such as a mean, or a rate. Here are examples of the most common ones. For each, a histogram is given for a random sample of size 100, and density (using the ``d'' functions) is superimposed as appropriate.
Uniform.
Uniform numbers are ones that are "equally likely" to be in the specified range. Often these numbers are in [0,1] for computers, but in practice can be between [a,b] where a,b depend upon the problem. An example might be the time you wait at a traffic light. This might be uniform on [0,2].

> runif(1,0,2)                  # time at light
[1] 1.490857                    # also runif(1,min=0,max=2)
> runif(5,0,2)                  # time at 5 lights
[1] 0.07076444 0.01870595 0.50100158 0.61309213 0.77972391
> runif(5)                      # 5 random numbers in [0,1]
[1] 0.1705696 0.8001335 0.9218580 0.1200221 0.1836119
The general form is runif(n,min=0,max=1) which allows you to decide how many uniform random numbers you want (n), and the range they are chosen from ([min,max])

To see the distribution with min=0 and max=1 (the default) we have

> x=runif(100)                 # get the random numbers
> hist(x,probability=TRUE,col=gray(.9),main="uniform on [0,1]")    
> curve(dunif(x,0,1),add=T)
  


Figure 25: 100 uniformly random numbers on [0,1]


The only tricky thing was plotting the histogram with a background ``color''. Notice how the dunif function was used with the curve function.
Normal.
Normal numbers are the backbone of classical statistical theory due to the central limit theorem The normal distribution has two parameters a mean and a standard deviation s. These are the location and spread parameters. For example, IQs may be normally distributed with mean 100 and standard deviation 16, Human gestation may be normal with mean 280 and standard deviation about 10 (approximately). The family of normals can be standardized to normal with mean 0 (centered) and variance 1. This is achieved by "standardizing" the numbers, i.e. Z=(X-)/s.

Here are some examples

> rnorm(1,100,16)               # an IQ score
[1] 94.1719
> rnorm(1,mean=280,sd=10)
[1] 270.4325                    # how long for a baby (10 days early)    
  
Here the function is called as rnorm(n,mean=0,sd=1) where one specifies the mean and the standard deviation.

To see the shape for the defaults (mean 0, standard deviation 1) we have (figure 26)

> x=rnorm(100)
> hist(x,probability=TRUE,col=gray(.9),main="normal mu=0,sigma=1")
> curve(dnorm(x),add=T)
## also for IQs using rnorm(100,mean=100,sd=16)
  

  
Figure 26: Normal(0,1) and normal(100,16)


Binomial.
The binomial random numbers are discrete random numbers. They have the distribution of the number of successes in n independent Bernoulli trials where a Bernoulli trial results in success or failure, success with probability p.

A single Bernoulli trial is given with n=1 in the binomial

> n=1, p=.5                     # set the probability
> rbinom(1,n,p)                 # different each time
[1] 1
> rbinom(10,n,p)                # 10 different such numbers
 [1] 0 1 1 0 1 0 1 0 1 0
A binomially distributed number is the same as the number of 1's in n such Bernoulli numbers. For the last example, this would be 5. There are then two parameters n (the number of Bernoulli trials) and p (the success probability).

To generate binomial numbers, we simply change the value of n from 1 to the desired number of trials. For example, with 10 trials:

> n = 10; p=.5
> rbinom(1,n,p)                 # 6 successes in 10 trials
[1] 6                           
> rbinom(5,n,p)                 # 5 binomial number
[1] 6 6 4 5 4
The number of successes is of course discrete, but as n gets large, the number starts to look quite normal. This is a case of the central limit theorem which states in general that (X-- - )/s is normal in the limit (note this is standardized as above) and in our specific case that
^
p
 
- p
( pq/n )
1/2
 
 
is approximately normal, where p^ = (number of successes)/n.

The graphs (figure 27) show 100 binomially distributed random numbers for 3 values of n and for p=.25. Notice in the graph, as n increases the shape becomes more and more bell-shaped. These graphs were made with the commands

> n=5;p=.25                    # change as appropriate
> x=rbinom(100,n,p)            # 100 random numbers
> hist(x,probability=TRUE,)
## use points, not curve as dbinom wants integers only for x
> xvals=0:n;points(xvals,dbinom(xvals,n,p),type="h",lwd=3) 
> points(xvals,dbinom(xvals,n,p),type="p",lwd=3) 
...    repeat with n=15, n=50


Figure 27: Random binomial data with the theoretical distribution




Exponential
The exponential distribution is important for theoretical work. It is used to describe lifetimes of electrical components (to first order). For example, if the mean life of a light bulb is 2500 hours one may think its lifetime is random with exponential distribution having mean 2500. The one parameter is the rate = 1/mean. We specify it as follows rexp(n,rate=1). Here is an example with the rate being 1/2500 (figure 28).

> x=rexp(100,1/2500)
> hist(x,probability=TRUE,col=gray(.9),main="exponential mean=2500")
> curve(dexp(x,1/2500),add=T)


Figure 28: Random exponential data with theoretical density


There are others of interest in statistics. Common ones are the Poisson, the Student t-distribution, the F distribution, the beta distribution and the c2 (chi squared) distribution.

6.2  Sampling with and without replacement using sample

R has the ability to sample with and without replacement. That is, choose at random from a collection of things such as the numbers 1 through 6 in the dice rolling example. The sampling can be done with replacement (like dice rolling) or without replacement (like a lottery). By default sample samples without replacement each object having equal chance of being picked. You need to specify replace=TRUE if you want to sample with replacement. Furthermore, you can specify separate probabilities for each if desired.

Here are some examples


## Roll a die
> sample(1:6,10,replace=TRUE)
 [1] 5 1 5 3 3 4 5 4 2 1        # no sixes!
## toss a coin
> sample(c("H","T"),10,replace=TRUE)
 [1] "H" "H" "T" "T" "T" "T" "H" "H" "T" "T"
## pick 6 of 54 (a lottery)
> sample(1:54,6)                # no replacement
[1]  6 39 23 35 25 26
## pick a card. (Fancy! Uses paste, rep)
> cards = paste(rep(c("A",2:10,"J","Q","K"),4),c("H","D","S","C"))
> sample(cards,5)               # a pair of jacks, no replacement
[1] "J D" "5 C" "A S" "2 D" "J H"
## roll 2 die. Even fancier
> dice = as.vector(outer(1:6,1:6,paste))
> sample(dice,5,replace=TRUE)   # replace when rolling dice
[1] "1 1" "4 1" "6 3" "4 4" "2 6"
  
The last two illustrate things that can be done with a little typing and a lot of thinking using the fun commands paste for pasting together strings, rep for repeating things and outer for generating all possible products.

6.3  A bootstrap sample

Bootstrapping is a method of sampling from a data set to make statistical inference. The intuitive idea is that by sampling, one can get an idea of the variability in the data. The process involves repeatedly selecting samples and then forming a statistic. Here is a simple illustration on obtaining a sample.

The built in data set faithful has a variable ``eruptions'' that measures the time between eruptions at Old Faithful. It has an unusual distribution. A bootstrap sample is just a sample with replacement from the given values. It can be found as follows

> data(faithful)                # part of R's base
> names(faithful)               # find the names for faithful
 [1] "eruptions" "waiting" 
> eruptions = faithful[['eruptions']] # or attach and detach faithful
> sample(eruptions,10,replace=TRUE)
 [1] 2.03 4.37 4.80 1.98 4.32 2.18 4.80 4.90 4.03 4.70
> hist(eruptions,breaks=25)     # the dataset
## the bootstrap sample
> hist(sample(eruptions,100,replace=TRUE),breaks=25)
  


Figure 29: Bootstrap sample


Notice that the bootstrap sample has a similar histogram, but it is different (figure 29).

6.4  d, p and q functions

The d functions were used to plot the theoretical densities above. As with the ``r'' functions, you need to specify the parameters, but differently, you need to specify the x values (not the number of random numbers n).

    
Figure 30: Illustration of 'p' and 'q' functions


The p and q functions are for the cumulative distribution functions and the quantiles. As mentioned, the distribution of a random number is specified by the probability that the number is between a and b for arbitrary a and b, P(a < X b). In fact, the value F(x) = P(X b) is enough.

The p functions answer what is the probability that a random variable is less than x. Such as for a standard normal, what is the probability it is less than .7?

> pnorm(.7)                     # standard normal
[1] 0.7580363 
> pnorm(.7,1,1)                 # normal mean 1, std 1
[1] 0.3820886
Notationally, these answer P(Z .7) where Z is a standard normal or normal(1,1). To answer P(Z > .7) is also easy. You can do the work by noting this is 1 - P(Z .7) or let R do the work, by specifying lower.tail=F as in:

> pnorm(.7,lower.tail=F)  
[1] 0.2419637
The q function are inverse to this. They ask, what value corresponds to a given probability. This the quantile or point in the data that splits it accordingly. For example, what value of z has .75 of the area to the right for a standard normal? (This is Q3)

> qnorm(.75)
[1] 0.6744898
Notationally, this is finding z which solves 0.75 = P(Z z).

6.5  Standardizing, scale and z scores

To standardize a random variable you subtract the mean and then divide by the standard deviation. That is
Z =
X -
s
.
To do so requires knowledge of the mean and standard deviation.

You can also standardize a sample. There is a convenient function scale that will do this for you. This will make your sample have mean 0 and standard deviation 1. This is useful for comparing random variables which live on different scales.

Normal random variables are often standardized as the distribution of the standardized normal variable is again normal with mean 0 and variance 1. (The ``standard'' normal.) The z-score of a normal number is the value of it after standardizing.

If we have normal data with mean 100 and standard deviation 16 then the following will find the z-scores

> x = rnorm(5,100,16)
> 
> x
[1] 93.45616 83.20455 64.07261 90.85523 63.55869
> z = (x-100)/16
> z
[1] -0.4089897 -1.0497155 -2.2454620 -0.5715479 -2.2775819    
  
The z-score is used to look up the probability of being to the right of the value of x for the given random variable. This way only one table of normal numbers is needed. With R, this is not necessary. We can use the pnorm function directly

> pnorm(z)
[1] 0.34127360 0.14692447 0.01236925 0.28381416 0.01137575
> pnorm(x,100,16)               # enter in parameters
[1] 0.34127360 0.14692447 0.01236925 0.28381416 0.01137575    
  

6.6  Problems

6.1
Generate 10 random numbers from a uniform distribution on [0,10]. Use R to find the maximum and minimum values.x
6.2
Generate 10 random normal numbers with mean 5 and standard deviation 5 (normal(5,5)). How many are less than 0? (Use R)
6.3
Generate 100 random normal numbers with mean 100 and standard deviation 10. How many are 2 standard deviations from the mean (smaller than 80 or bigger than 120)?
6.4
Toss a fair coin 50 times (using R). How many heads do you have?
6.5
Roll a ``die'' 100 times. How many 6's did you see?
6.6
Select 6 numbers from a lottery containing 49 balls. What is the largest number? What is the smallest? Answer these using R.
6.7
For normal(0,1), find a number z* solving P(Z z*) = .05 (use qnorm).
6.8
For normal(0,1), find a number z* solving P(-z* Z z*) = .05 (use qnorm and symmetry).
6.9
How much area (probability) is to the right of 1.5 for a normal(0,2)?
6.10
Make a histogram of 100 exponential numbers with mean 10. Estimate the median. Is it more or less than the mean?
6.11
Can you figure out what this R command does?

> rnorm(5,mean=0,sd=1:5)
6.12
Use R to pick 5 cards from a deck of 52. Did you get a pair or better? Repeat until you do. How long did it take?
Copyright © John Verzani, 2001-2. All rights reserved.

Previous Up Next