Multivariate Data
5 Multivariate Data
Getting comfortable with viewing and manipulating multivariate data
forces you to be organized about your data.
R uses data frames to
help organize big data sets and you should learn how to as well.
5.1 Storing multivariate data in data frames
Often in statistics, data is presented in a tabular format similar
to a spreadsheet. The columns are for different variables, and each
row is a different measurement or variable for the same person or
thing. For example, the dataset
home which accompanies these
notes contains two columns, the 1970 assessed value of a home and
the year 2000 assessed value for the same home.
R uses
data frames to store these variables together and
R has many shortcuts for using data stored this way.
If you are using a dataset which is
built-in to
R or comes from a spreadsheet or other data source,
then chances are the data is available already as a data frame. To
learn about importing outside data into
R look at the
``Entering Data into
R'' appendix and the
document
R Data Import/Export which accompanies the
R
software.
You can make your own data frames of course and may need to.
To make data into a data frame you first need a data set that is
an appropriate candidate: it will fit into a rectangular array. If
so, then the
data.frame command will do the work for
you. As an example, suppose 4 people are asked three questions:
their weight, height and gender and the data is entered into
R
as separate variables as follows:
> weight = c(150, 135, 210, 140)
> height = c(65, 61, 70, 65)
> gender = c("Fe","Fe","M","Fe")
> study = data.frame(weight,height,gender) # make the data frame
> study
weight height gender
1 150 65 Fe
2 135 61 Fe
3 210 70 M
4 140 65 Fe
Notice, the columns inherit the variable names. Different names
are possible if desired. Try
> study = data.frame(w=weight,h=height,g=gender)
for example to shorten them.
You can give the rows names as well. Suppose the subjects were Mary, Alice,
Bob and Judy, then the
row.names command will either list
the row names or set them. Here is how to set them
> row.names(study)<-c("Mary","Alice","Bob","Judy")
The
names command will give the column names and you can
also use this to adjust them.
5.2 Accessing data in data frames
The
study data frame has three variables. As before, these can be
accessed individually after attaching the data frame to your
R
session with the
attach command:
> study
weight height gender
1 150 65 Fe
2 135 61 Fe
3 210 70 M
4 140 65 Fe
> rm(weight) # clean out an old copy
> weight
Error: Object "weight" not found
> attach(study)
> weight
[1] 150 135 210 140
However, attaching and detaching the data frame can be a chore if
you want to access the data only once. Besides, if you attach the
data frame, you can't readily make changes to the original data frame.
To access the data it helps to know that data frames can be thought
of as lists or as arrays and accessed accordingly.
-
To access as an array
-
An array is a way of storing data so that it can be accessed with a
row and column. Like a spreadsheet, only technically the entries
must all be of the same type and one can have more than rows and
columns.
Data frames are arrays as they have columns which are the
variables and rows which are for the experimental unit. Thus we
can access the data by specifying a row and a column. To access
an array we use single brackets ([row,column]). In
general there is a row and column we can access. By letting one
be blank, we get the entire row or column. As an example these
will get the weight variable
> study[,'weight'] # all rows, just the weight column
[1] 150 135 210 140
> study[,1] # all rows, just the first column
Array access allows us much more flexibility though. We can get
both the weight and height by taking the first and second
columns at once
> study[,1:2]
weight height
Mary 150 65
Alice 135 61
Bob 210 70
Judy 140 65
Or, we can get all of Mary's info by looking just at her row or
just her weight if desired
> study['Mary',]
weight height gender
Mary 150 65 Fe
> study['Mary','weight']
[1] 150
- To access as a list
-
A list is a more general storage concept than a data frame. A
list is a set of objects, each of which can be any other
object. A data frame is a list, where the objects are the
columns as vectors.
To access a list, one uses either a dollar sign, $, or double
brackets and a number or name. So for our study variable
we can access the weight (the first column) as a list all of
these ways
> study$weight # using $
[1] 150 135 210 140
> study[['weight']] # using the name.
> study[['w']] # unambiguous shortcuts are okay
> study[[1]] # by position
These two can be combined as in this example. To get just the
females information. These are the rows where gender is 'Fe' so we
can do this
> study[study$gender == 'Fe', ] # use $ to access gender via a list
weight height gender
Mary 150 65 Fe
Alice 135 61 Fe
Judy 140 65 Fe
5.3 Manipulating data frames: stack and unstack
In some instances, there are two different ways to store data. The
data set
PlantGrowth looks like
> data(PlantGrowth)
> PlantGrowth
weight group
1 4.17 ctrl
2 5.58 ctrl
3 5.18 ctrl
4 6.11 ctrl
...
There are 3 groups a control and two treatments. For each group,
weights are recorded. The data is generated this way, by recording a
weight and group for each plant. However, you may want to plot
boxplots for the data broken down by their group. How to do this?
A brute force way is do as follows for each value of
group
> attach(PlantGrowth)
> weight.ctrl = weight[group == "ctrl"]
This quickly grows tiresome. The
unstack function will do this
all at once for us. If the data is structured correctly, it will
create a data frame with variables corresponding to the levels of the
factor.
> unstack(PlantGrowth)
ctrl trt1 trt2
1 4.17 4.81 6.31
2 5.58 4.17 5.12
3 5.18 4.41 5.54
4 6.11 3.59 5.50
...
Thus, to do a boxplot of the three groups, one could use this command
> boxplot(unstack(PlantGrowth))
5.4 Using R's model formula notation
The
model formula notation that
R uses allows this to be done in a systematic
manner. It is a bit confusing to learn, but this flexible notation is used by
most of
R's more advanced functions.
To illustrate, the above could be done by (if the data frame
PlantGrowth is attached)
> boxplot(weight ~ group)
What does this do? It breaks the weight variable down by values of
the group factor and hands this off to the boxplot command. One
should read the line
weight ~ group as ``model weight
by the variable group''. That is, break weight down by
the values of group.
When there are two variables involved things are pretty
straightforward. The response variable is on the left hand side and
the predictor on the right:
response ~ predictor (when two variables).
When there are more than two predictor variables things get a little
confusing. In particular, the usual mathematical operators do not do
what you may think. Here are a few different possibilities that will
suffice for these notes.
9
Suppose the variables are generically named
Y, X1, X2
formula |
meaning |
Y ~ X1 |
Y is modeled by X1 |
Y ~ X1 + X2 |
Y is modeled by X1 and
X2 as in multiple regression |
Y ~ X1 * X2 |
Y is modeled by X1,
X2 and X1*X2 |
(Y ~ (X1 + X2)^2) |
Two-way interactions. Note usual powers |
Y ~ X1+ I((X2^2) |
Y is modeled by X1 and X22 |
Y ~ X1 | X2 |
Y is modeled by X1
conditioned on X2 |
The exact interpretation of ``modeled by'' varies depending upon the
usage. For the
boxplot command it is different than the
lm command. Also notice that usual mathematical meanings are
available, but need to be included inside the
I function.
5.5 Ways to view multivariate data
Now that we can store and access multivariate data, it is time to see
the large number of ways to visualize the datasets.
-
n-way contingency tables
-
Two-way contingency tables were formed with the table
command and higher order ones are no exception. If w,x,y,z
are 4 variables, then the command table(x,y) creates a
two-way table, table(x,y,z) creates two-way tables x
versus y for each value of z. Finally
x,y,z,w will do the same for each combination of values of
z and w. If the variables are stored in a data frame,
say df then the command table(df) will behave as
above with each variable corresponding to a column in the given order.
To illustrate let's look at some relationships in the dataset
Cars93 found in the MASS library.
> library(MASS);data(Cars93);attach(Cars93)
## make some categorical variables using cut
> price = cut(Price,c(0,12,20,max(Price)))
> levels(price)=c("cheap","okay","expensive"))
> mpg = cut(MPG.highway,c(0,20,30,max(MPG.highway)))
> levels(mpg) = c("gas guzzler","okay","miser"))
## now look at the relationships
> table(Type)
Type
Compact Large Midsize Small Sporty Van
16 11 22 21 14 9
> table(price,Type)
Type
price Compact Large Midsize Small Sporty Van
cheap 3 0 0 18 1 0
okay 9 3 8 3 9 8
expensive 4 8 14 0 4 1
> table(price,Type,mpg)
, , mpg = gas guzzler
Type
price Compact Large Midsize Small Sporty Van
cheap 0 0 0 0 0 0
okay 0 0 0 0 0 2
expensive 0 0 0 0 0 0
...
See the commands xtabs and ftable for more
sophisticated usages.
- barplots
- Recall, barplots work on summarized data. First you
need to run your data through the table command or something
similar. The barplot command plots each column as a variable
just like a data frame. The output of table when called with
two variables uses the first variable for the row. As before
barplots are stacked by default: use the argument
beside=TRUE to get side-by-side barplots.
> barplot(table(price,Type),beside=T) # the price by different types
> barplot(table(Type,price),beside=T) # type by different prices
- boxplots
- The boxplot command is easily used for all the
types of data storage. The command boxplot(x,y,z) will
produce the side by side boxplots seen previously. As well, the
simpler usages boxplot(df) and boxplot(y ~ x)
will also work. The latter using the model formula notation.
Example: Boxplot of samples of random data
Here is an example, which will print out 10 boxplots of normal data
with mean 0 and standard deviation 1. This uses the
rnorm
function to produce the random data.
> y=rnorm(1000) # 1000 random numbers
> f=factor(rep(1:10,100)) # the number 1,2...10 100 times
> boxplot(y ~ f,main="Boxplot of normal random data with model notation")
Note the construction of
f. It looks like 1 through 10
repeated 100 times to make a
factor of the same length of
x. When the model notation is used, the boxplot of the
y data is done for each level
of the factor
f. That is, for each value of
y when
f is 1
and then 2 etc. until 10.
Figure 18: Boxplot made with boxplot(y ~ f)
- stripcharts
- The side-by-side boxplots are useful for displaying
similar distributions for comparison -- especially if there is a lot
of data in each variable. The stripchart can do a similar
thing, and is useful if there isn't too much data. It plots the
actual data in a manner similar to rug which is used with
histograms. Both stripchart(df) and
stripchart(y ~ x) will work, but not
stripchart(x,y,z).
For example, as above, we will generate 10 sets of random normal
numbers. Only this time each will contain only 10 random numbers.
> x = rnorm(100)
> y = factor(rep(1:10,10))
> stripchart(x ~ y)
Figure 19: A stripchart
- violinplots and densityplots
- The functions simple.violinplot and simple.densityplot
can be used in place of side-by-side boxplots to compare
different distributions.
Both use the empirical density found by the density function
to illustrate a variables distribution. The density may be thought
of like a histogram, only there is much less ``chart junk'' (extra
lines) so more can effectively be placed on the same graph.
A violinplot is very similar to a boxplot, only the box is replaced
by a density which is given a mirror image for clarity. A
densityplot plots several densities on the same scale. Multiple
histograms would look really awful, but multiple densities are
manageable.
As an illustration, we show for the same dataset all three in
figure 20. The density plot
looks a little crowded, but you can clearly see that there are two
different types of distributions being considered here. Notice,
that we use the functions in an identical manner to the boxplot.
> par(mfrow=c(1,3)) # 3 graphs per page
> data(InsectSprays) # load in the data
> boxplot(count ~ spray, data = InsectSprays, col = "lightgray")
> simple.violinplot(count ~ spray, data = InsectSprays, col = "lightgray")
> simple.densityplot(count ~ spray, data = InsectSprays)
Figure 20: Compare boxplot, violinplot, densityplot for same data
- scatterplots
- Suppose x is a predictor and both
y and z are response variables. If you want to plot
them on the same graph but with a different character you can do
so by setting the pch (plot character) command. Here is a
simple example
> plot(x,y) # simple scatterplot
> points(x,z,pch="2") # plot these with a triangle
Notice, the second command is not plot but rather
points which adds to the current plot unlike plot which
draws a new plot.
Sometimes you have x and y data that is also broken down by a
given factor. You may wish to plot a scatterplot of the x and y
data, but use different plot characters for the different levels of
the factors. This is usually pretty easy to do. We just need to use
the levels of the factor to give the plotting character. These levels are
store internally as numbers, and we use these for the value of pch
Example: Tooth growth
The built-in
R dataset
ToothGrowth has data from a
study that measured tooth growth as a function of amount of Vitamin
C. The source of the Vitamin C came from orange juice or a vitamin
supplement. The scatterplot of dosage vs. length is given
below. Notice the different plotting figures for the 2 levels of
the factor of which type of vitamin C.
> data("ToothGrowth")
> attach(ToothGrowth)
> plot(len ~ dose,pch=as.numeric(supp))
## click mouse to add legend.
> tmp = levels(supp) # store for a second
> legend(locator(1),legend=tmp,pch=1:length(tmp))
> detach(ToothGrowth)
Figure 21: Tooth growth as a function of vitamin C dosage
From the graph it
appears that for all values of dose, the vitamin form (VC) was less
effective.
Sometimes you want a to look at the distribution of x and the
distribution of y and also look at their relationship
with a scatterplot. (Not the case above, as the x distribution is
trivial) This is easier if you can plot multiple graphs
at once. This is implemented in the function
simple.scatterplot (taken from the
layout help page).
Example: GDP vs. CO2 emissions
The question of CO
2 emissions is currently a ``hot'' topic due
to their influence on the greenhouse effect. The dataset
emissions
contains data on the Gross Domestic Product and
CO
2 emissions for several European countries and the United
States for the year 1999. A scatterplot of the data
is
interesting:
> data(emissions) # or read in from dataset
> attach(emissions)
> simple.scatterplot(perCapita,CO2)
> title("GDP/capita vs. CO2 emissions 1999")
> detach(emissions)
Figure 22: Per capita GDP vs. CO2 emissions
Notice, with the additional information of this scatter plot, we can
see that the distribution of GDP/capita is fairly spread out unlike
the distribution of the CO
2 emissions which has the lone outlier.
- paired scatterplots
- If the 3 variables hold numeric values,
then scatterplots are appropriate. The pairs command will
produce scatterplots for each possible pair. It can be used as
follows pairs(cbind(x,y,z)), pairs(df) or if in the
factor form pairs(data.frame(split(times,week))). Of these,
the easiest is if the data is in a data frame. If not, notice the
use of cbind which binds the variables together as columns.
(This is how data frames work.)
Figure 23 is an example using
the same
emissions
data set. Can you spot the previous
plot?
> pairs(emissions)
Figure 23: Using pairs with emissions data
The pairs command has many options to customize the
graphs. The help page has two nice examples.
- others
- The
Ggobi package and
accompanying software, allows you to manipulate the data in the plot
and do such things as brush data in one plot and have it highlighted
in another.
5.6 The lattice package
The add-on package
lattice implements the Trellis graphics
concepts of Bill Cleveland. It and the accompanying
grid
package allow a different way to view multivariate data than
described above. As of version 1.5.0 these are recommended packages
but not part of the base version of
R.
Some useful graphs are easy to create and are shown below. Many
other usages are possible. Both packages are well described in
Volume 2/2 of the
R News newsletter
(
http://cran.r-project.org/doc/Rnews)).
Let's use the data set
Cars93 to illustrate. We assume
this has been loaded, but not attached to illustrate the use of
data = below.
The basic idea is that the graphic consists of a number of panels.
Typically these correspond to some value of a conditioning variable.
That is, a different graphic for each level of the factor used to
condition, or if conditioning by a numeric variable for ``shingles''
which are ranges of the conditioning variable. The functions are
called with the model formula notation. For univariate graphs such
as histograms, the response variable, the left side, is empty. For
bivariate graphs it is given. Notice below that the names for the
functions are natural, yet different from the usual ones. For
example,
histogram is used instead of
hist.
-
histograms
- Histograms are univariate. The following command
shows a histogram of the maximum price conditioned on the number
of cylinders. Note the response variable is left blank.
> histogram( ~ Max.Price | Cylinders , data = Cars93)
- Boxplots
- Boxplots are also univariate. Here is the same
information, only summarized using boxplots. The command is
bwplot.
> bwplot( ~ Max.Price | Cylinders , data = Cars93)
- scatterplots
- Scatterplots are available as well. The function
is xyplot and not simply plot. As these are
bivariate a response variable is needed. The following will plot
the relationship between MPG and tank size. We expect that cars
with better mileage can have smaller tanks. This type of plot
allows us to check if it is the same for all types of cars.
> attach(Cars93) # don't need data = Cars93 now
> xyplot(MPG.highway ~ Fuel.tank.capacity | Type)
## plot with a regression line
## first define a regression line drawing function
> plot.regression = function(x,y) {
+ panel.xyplot(x,y)
+ panel.abline(lm(y~x))
+ }
> trellis.device(bg="white") # set background to white.
> xyplot(MPG.highway ~ Fuel.tank.capacity | Type, panel = plot.regression)
This results in figure 24.
Notice we can see some trends from the figure. The slope appears
to become less steep as the size of the car increases.
Figure 24: Example of lattice graphics: relation of m.p.g. and fuel tank size.
Notice the trellis.device command setting the background
color to white. The default colors are a bit dark. The figure
drawn includes a regression line. This was achieved by specifying
a function to create the panel. By default, the xyplot
will use the panel.xyplot function to plot a scatterplot.
To get more, we defined a function of x and y that
plotted a scatterplot (again with panel.xyplot) and also
added a regression line using panel.abline) and the
lm function. Many more possibilities are possible.
5.7 Problems
-
5.1
- For the
emissions
dataset there is an outlier for the CO2
emissions. Find this value using identify and then redo the
plot without this point.
- 5.2
- The Simple data set chips contains data on
thickness of integrated chips. There are data for two chips, each
measured at 4 different places. Create a side-by-side boxplot of the
thickness for each place of measurement. (There should be 8 boxplots
on the same graph). Do the means look the same? The variances?
- 5.3
- The Simple data set chicken contains weights of
chickens who are given 1 of 3 different food rations. Create a
boxplot of all 3 rations. Does there appear to be a difference in
mean?
- 5.4
- The Simple data set
WeightData
contains information
on weights for children aged 0 to 144 months. Make a side-by-side
boxplot of the weights broken down by age in years. What kind of
trends do you see? (The variable age is in months. To
convert to years can be done using cut as follows
> age.yr = cut(age,seq(0,144,by=12),labels=0:11)
assuming the dataset has been attached.)
- 5.5
- The Simple data set carbon contains carbon monoxide
levels at 3 different industrial sites. The data has two variables: a
carbon monoxide reading, and a factor variable to keep track of the site.
Create side-by-side boxplots of the monoxide
levels for each site. Does there appear to be a difference? How so?
- 5.6
- For the data set
babies
make a pairs plot
(pairs(babies)) to investigate the relationships between the
variables. Which variables seem to have a linear relationship? For
the variables for birthweight and gestation make a scatter plot
using different plotting characters (pch) depending on the
level of the factor smoke.
Copyright © John Verzani, 2001-2. All rights reserved.