Bivariate Data

## 4  Bivariate Data

The relationship between 2 variables is often of interest. For example, are height and weight related? Are age and heart rate related? Are income and taxes paid related? Is a new drug better than an old drug? Does a batter hit better as a switch hitter or not? Does the weather depend on the previous days weather? Exploring and summarizing such relationships is the current goal.

### 4.1  Handling bivariate categorical data

The table command will summarize bivariate data in a similar manner as it summarized univariate data.

Suppose a student survey is done to evaluate if students who smoke study less. The data recorded is
 Person Smokes amount of Studying 1 Y less than 5 hours 2 N 5 - 10 hours 3 N 5 - 10 hours 4 Y more than 10 hours 5 N more than 10 hours 6 Y less than 5 hours 7 Y 5 - 10 hours 8 Y less than 5 hours 9 N more than 5 hours 10 Y 5 - 10 hours
We can handle this in R by creating two vectors to hold our data, and then using the table command.

> smokes = c("Y","N","N","Y","N","Y","Y","Y","N","Y")
> amount = c(1,2,2,3,3,1,2,1,3,2)
> table(smokes,amount)
amount
smokes 1 2 3
N 0 2 2
Y 3 2 1

We see that there may be some relationship7

What would be nice to have are the marginal totals and the proportions. For example, what proportion of smokers study 5 hours or less. We know that this is 3 /(3+2+1) = 1/2, but how can we do this in R?

The command prop.table will compute this for us. It needs to be told the table to work on, and a number to indicate if you want the row proportions (a 1) or the column proportions (a 2) the default is to just find proportions.


> tmp=table(smokes,amount)      # store the table
> old.digits = options("digits") # store the number of digits
> options(digits=3)             # only print 3 decimal places
> prop.table(tmp,1)             # the rows sum to 1 now
amount
smokes   1     2     3
N 0.0 0.500 0.500
Y 0.5 0.333 0.167
> prop.table(tmp,2)             # the columns sum to 1 now
amount
smokes 1   2     3
N 0 0.5 0.667
Y 1 0.5 0.333
> prop.table(tmp)
amount                    # all the numbers sum to 1
smokes   1   2   3
N 0.0 0.2 0.2
Y 0.3 0.2 0.1
> options(digits=old.digits)    # restore the number of digits


### 4.2  Plotting tabular data

You might wish to graphically represent the data summarized in a table. For the smoking example, you could plot the amount variable for each of No or Yes, or the No and Yes variable for each level of smoking. In either case, you can use a barplot. We simply call it in the appropriate manner.

> barplot(table(smokes,amount))
> barplot(table(amount,smokes))
> smokes=factor(smokes)        # for names
> barplot(table(smokes,amount),
+ beside=TRUE,                  # put beside not stacked
>
> barplot(table(amount,smokes),main="table(amount,smokes)",
+ beside=TRUE,
+ legend.text=c("less than 5","5-10","more than 10"))


Figure 10: 4 barplots of same data

Notice in figure 10 the importance of order when making the table. Essentially, barplot plots each row of data. It can do it in a stacked manner (the default), or besides (by setting beside=TRUE). The attribute legend.text adds the legend to the graph. You can change the names, but the default of legend.text=T is easiest if you have a factor labeling the rows of the table command.

Some Extra Insight: Conditional proportions
You may also want to know about the conditional proportions. For example, among the smokers what are the proportions. To answer this, we need to divide the second row by 6. One or two rows is easy to do by hand, but how do we automate the work? The function apply will apply a function to rows or columns of a matrix. In this case, we need a function to find the proportions of a vector. This is as easy as

> prop = function(x) x/sum(x)

To apply this function to the matrix x is easy. First the columns (index 2) are done by

> apply(x,2,prop)
amount
1   2         3
N 0 0.5 0.6666667
Y 1 0.5 0.3333333

Index 1 is for the rows, however, we need to use the transpose function, t() to make the result look right.

> t(apply(x,1,prop))

smokes   1         2         3
N 0.0 0.5000000 0.5000000
Y 0.5 0.3333333 0.1666667


### 4.3  Handling bivariate data: categorical vs. numerical

Suppose you have numerical data for several categories. A simple example might be in a drug test, where you have data (in suitable units) for an experimental group and for a control group.


experimental:  5 5  5 13 7 11 11 9 8  9
control:      11 8  4  5 9  5 10 5 4 10

You can summarize the data separately and compare, but how can you view the data together? A side by side boxplot is a good place to start. To generate one is simple:


> x = c(5, 5, 5, 13, 7, 11, 11, 9, 8, 9)
> y = c(11, 8, 4, 5, 9, 5, 10, 5, 4, 10)
> boxplot(x,y)


Figure 11: Side-by-side boxplots

From this comparison (figure 11), we see that the y variable (the control group, labeled 2 on the graph) seems to be less than that of the x variable (the experimental group).

Of course, you may also receive this data in terms of the numbers and a variable indicating the category as follows

amount:   5 5 5 13 7 11 11 9 8 9 11 8 4 5 9 5 10 5 4 10
category: 1 1 1  1 1  1  1 1 1 1  2 2 2 2 2 2  2 2 2  2

To make a side by side boxplot is still easy, but only if you use the model syntax as follows

> amount = scan()
1: 5 5 5 13 7 11 11 9 8 9 11 8 4 5 9 5 10 5 4 10
21:
>category = scan()
1: 1 1 1  1 1  1  1 1 1 1  2 2 2 2 2 2  2 2 2  2
21:
> boxplot(amount ~ category)    # note the tilde ~

Read the part amount ~ category as breaking up the values in amount, by the categories in category and displaying each one. Verbally, you might read this as amount by category''. More on this syntax will appear in the section on multivariate data.

### 4.4  Bivariate data: numerical vs. numerical

Comparing two numerical variables can be done in different ways. If the two variables are thought to be independent samples you might like to compare their distributions in some manner. However, if you expect a relationship between the variables, you might like to look for that by plotting pairs of points.

### 4.5  Comparing two distributions with plots

If we wish to compare two distributions, we can do so with side-by-side boxplots, However, we may wish to compare histograms or some other graphs to see more of the data. Here are several different ways to do so.
Side by side boxplots with rug
By using the rug command we can see all the data. It works best with smallish data sets (otherwise use the jitter command to break ties).

> library("Simple");data(home)  # read in dataset home
> attach(home)
> names(home)
[1] "old" "new"
> boxplot(scale(old),scale(new)) #make boxplot after scaling each
> detach(home)

This example, introduced the scale function. This puts the two data sets on the same scale so they can sensibly be compared.

If you make this boxplot, you will see that the two distributions look quite a bit different. The full dataset homedata will show this even more.
Using stripcharts or dotplots
The stripchart (a dotplot) will plot all the data in a way that makes it relatively easy to compare the distributions. For the data frame hd this is done with

> stripchart(scale(old),scale(new))

Comparing shapes of distributions
Using the density function allows us to compare a distributions shape on the same graph. This is hard to do with histograms. The function simple.violinplot compares densities by creating violin plots. These are similar to boxplots, only instead of a box, the density is drawn with it's mirror image.

Try this command to see what the graphs look like

> simple.violinplot(scale(old),scale(new))


### 4.6  Using scatterplots to compare relationships

Often we wish to investigate one numerical variable against another. For example the height of a father compared to their sons height. The plot command will gladly display two variables in a scatterplot.

Example: Home data
The home data example of the previous section shows old assessed value (1970) versus new assessed value (2000). There should be some relationship. Let's investigate with a scatterplot (figure 12).


> data(home);attach(home)
> plot(old,new)
> detach(home)


Figure 12: Scatterplot of home data with a sample and full dataset

The second graph is drawn from the entire data set. This should be available as a data set through the command data(). Here we plot it using attach:

> data(homedata)
> attach(homedata)
> plot(old,new)
> detach(homedata)

The graphs seem to illustrate a strong linear trend, which we will investigate later.

R Basics: What does attaching do?
You may have noticed that when we attached home and homedata we have the same variable names: old and new. What exactly does attaching do? When you ask R to use a value of a variable or a function it needs to find it. R searches through several enviroments'' for these variables. By attaching a data frame, you put the names into the second environment searched (the name of the dataframe is in the first). These are masked by any variables which already have the same name. There are consequences to this to be aware of. First, you might be confused about which variable you are using. And most importantly, you can't change the values of the variables in the data frame without referencing the data frame. For example, we create a data frame df below with variables x and y.

> x = 1:2;y = c(2,4);df = data.frame(x=x,y=y)
> ls()                          # list all the varibles known
[1] "df" "x"  "y"
> rm(y)                         # delete the y variable
> attach(df)                    # attach the data frame
> ls()
[1] "df" "x"                    # y is visible, but doesn't show up
> ls(pos=2)                     # y is in position 2 from being attached
[1] "x" "y"
> y                             # y is visible because df is attached
[1] 2 4
> x                             # which x did we find, x or df[['x']]
[1] 1 2
> x=c(1,3)                      # assign to x
> df                            # not the x in df
x y
1 1 2
2 2 4
> detach(df)
> x                             # assigned to real x variable
[1] 1 3
> y

It is important to remember to detach the dataset between uses of these variables, or you may forget which variable you are referring to.

We see in these examples relationships between the data. Both were linear relationships. The modeling of such relationships is a common statistical practice. It allows us to make predictions of the y variable based on the value of the x variable.

### 4.7  Linear regression.

Linear regression is the name of a procedure that fits a straight line to the data. The idea is that the x value is something the experimenter controls, the y value one the experimenter measures. The line is used to predict the value of y for a known value of x. The variable x is the predictor variable and y the response variable.

Suppose we write the equation of the line as
 ^ y
= b0 + b1 x.
Then, for each xi the predicted value would be
 ^ yi
= b0 + b1 xi.

But the measured value is yi, the difference is called the residual and is simply
ei = yi -
 ^ yi
.

The method of least squares is used to choose the values of b0 and b1 that minimize the sum or the squares of the residual errors. Mathematically this is
 n å i=1
(yi -
 ^ yi
)2.

Solving, gives
b1 =
sxy
sx2
=
å(xi -
 -- x
)(yj-
 -- y
)
å(xi-
 -- x
)2
,    b0 =
 -- y
- b1
 -- x
.

That is, a line with slope given by b1 going through the point (x--,y--).

R plots these in 3 steps: plot the points, find the values of b0, b1, add a line to the graph:


> data(home);attach(home)
> x = old                       # use generic variable names
> y = new                       # for illustration only.
> plot(x,y)
> abline(lm(y ~ x))
> detach(home)


Figure 13: Home data with regression line

The abline command is a little tricky (and hard to remember). The abline function prints lines on the current graph window and is generally a useful function. The line it prints is coming from the lm functions. This is the function for a linear model. The funny syntax y ~ x tells R to model the y variable as a linear function of x. This is the model formula syntax of R which can be tricky, but is fairly straightforward in this situation.

As an alternative to the above, the function simple.lm, provided with these notes, will make this same plot and return the regression coefficients


> data(home);attach(home)
> x = old; y = new
> simple.lm(x,y)

Call:
lm(formula = y ~ x)

Coefficients:
(Intercept)            x
-2.121e+05    6.879e+00
> detach(home)

You can also access the coefficients directly with the function coef. The above ones would be found with

> lm.res = simple.lm(x,y)      # store the answers in lm.res
> coef(lm.res)
Coefficients:
(Intercept)            x
-2.121e+05    6.879e+00
> coef(lm.res)[1]               # first one, use [2] for second
(Intercept)
-2.121e+05


### 4.8  Residual plots

Another worthwhile plot is of the residuals. This can also be done with the simple.lm, but you need to ask. Continuing the above example


simple.lm(x,y,show.residuals=TRUE)

Which produces the plot shown in figure 14.

Figure 14: Plot of residuals for regression model

There are 3 new plots. The normal plot will be explained later. The upper right one is a plot of residuals versus the fitted values (y^'s). If the standard statistical model is to apply, then the residuals should be scattered about the line y=0 with normally'' distributed values. The lower left is a histogram of the residuals. If the standard model is applicable, then this should appear bell'' shaped.

For this data, we see a possible outlier that deserves attention. This data set has a few typos in it.

To access residuals directly, you can use the command resid on your lm result. This will make a plot of the residuals

> lm.res = simple.lm(x,y)
> the.residuals = resid(lm.res) # how to get residuals
> plot(the.residuals)


### 4.9  Correlation coefficients

A valuable numeric summary of the strength of the linear relationship is the Pearson correlation coefficient, R, defined by
R =
å(Xi -
 -- X
)(Yi-
 -- Y
)
æ
ç
ç
è
å(Xi-
 -- X
)2 å(Yi-
 -- Y
)2 ö
÷
÷
ø
 1/2

This is a scaled version of the covariance between X and Y. This measures how one variable varies as the other does. The correlation is scaled to be in the range [-1,1]. Values or R2 close to 1 indicate a strong linear relationship, values close to 0 a weak one. (There still may be a relationship, just not a linear one.) In R the correlation coefficient is found with the cor function

> cor(x,y)                      # to find R
[1] 0.881
> cor(x,y)^2                    # to find R^2
[1] 0.776

This is also found by R when it does linear regression, but it doesn't print it by default. We just need to ask though using summary(lm(y ~ x)).

The Spearman rank correlation is the same thing only applied to the ranks of the data. The rank of a data set is simply another vector giving the relative rank in terms of size. An example might make it clearer

> rank(c(2,3,5,7,11))           # already in order
[1] 1 2 3 4 5
> rank(c(5,3,2,7,11))           # for example, 5 is 3rd largest
[1] 3 2 1 4 5
> rank(c(5,5,2,7,5))            # ties have ranks averaged (2+3+4)/3=3
[1] 3 3 1 5 3

To find the Spearman rank correlation, we simply apply cor() to the ranked data

> cor(rank(x),rank(y))
[1] 0.925

This number is close to 1 (or -1) if there is a strong increasing (decreasing) trend in the data. (The trend need not be linear.)

As a reminder, you can make a function to do this calculation for you. For example,

> cor.sp <- function(x,y) cor(rank(x),rank(y))

Then you can use this as

> cor.sp(x,y)
[1] 0.925


### 4.10  Locating points

R currently has a few methods to interact with a graph. Some important ones allow us to identify and locate points on the graph.

Example: Presidential Elections: Florida

Consider this data set from the 2000 United States presidential election in the state of Florida. 8 It records the number of votes each candidate received by county. We wish to investigate the relationship between the number of votes for Bush against the number of votes for Buchanan.


> data("florida")              # or read.table on florida.txt
> names(florida)
[1] "County"     "V2"         "GORE"       "BUSH"       "BUCHANAN"
[6] "NADER"      "BROWNE"     "HAGELIN"    "HARRIS"     "MCREYNOLDS"
> attach(florida)               # so we can get at the names BUSH, ...
> simple.lm(BUSH,BUCHANAN)
...
Coefficients:
(Intercept)            x
45.28986      0.00492
> detach(florida)               # clean up


We see a strong linear relationship, except for two "outliers". How can we identify these points?

One way is to search through the data to find these values. This works fine for smaller data sets, for larger ones, R provides a few useful functions: identify to find index of the closest (x,y) coordinates to the mouse click and locator to find the (x,y) coordinates of the mouse click.

To identify the outliers, we need their indices which are provided by identify:


> identify(BUSH,BUCHANAN,n=2)   # n=2 gives two points
[1] 13 50

Click on the two outliers and find the corresponding indices are 13 and 50. The values would be found by taking the 13th or 50th value of the vectors:

> BUSH[50]
[1] 152846
> BUCHANAN[50]
[1] 3407
> florida[50,]
County V2   GORE   BUSH BUCHANAN NADER BROWNE HAGELIN HARRIS MCREYNOLDS
50     50 39 268945 152846     3407  5564    743     143     45        302
50       103      188 432286

The latter shows the syntax to slice out the entire row for county 50.

County 50 is not surprisingly Miami-Dade county, the home of the infamous (well maybe) butterfly ballot that caused great confusion among the voters. The location of Buchanan on the ballot was in some sense where Gore's position should have been. How many votes did this give Buchanan that should have been Gore's? One way to answer this is to find the regression line for the data without this data point and then to use the number of Bush votes to predict the number of Buchanan votes.

To eliminate one point from a data vector can be done with fancy indexing, by using a minus sign (BUSH[50] is the 50th element, BUSH[-50] is all but the 50th element).


>  simple.lm(BUSH[-50],BUCHANAN[-50])
...
Coefficients:
(Intercept)            x
65.57350      0.00348

Notice the fit is much better. Also notice that the new regression line is y^ = 65.57350 + 0.00348 x instead of y^=45.28986+ 0.00492 x. How much difference does this make? Well the regression line predicts the value for a given x. If Bush received 152,846 votes (BUSH[50]) then we expect Buchanan to have received

> 65.57350 + 0.00348 * BUSH[50]
[1] 597

and not 3407 (BUCHANAN[50]) as actually received. (This difference is much larger than the statewide difference that gave the 2000 U.S. presidential election to Bush over Gore.)

Some Extra Insight: Using simple.lm to predict
We could do this prediction with the simple.lm function which calls the R function predict appropriately. Here is how

> simple.lm(BUSH[-50],BUCHANAN[-50],pred=BUSH[50])
[1] 597.7677
...


### 4.11  Resistant regression

This example also illustrates another important point. That is, like the mean and standard deviation the regression line is very sensitive to outliers. Let's see what the regression line looks like for the data with and without the points. Since we already have the equation for the line without the point, the simplest way to do so is to first draw the line for all the data, and then add in the line without Miami-Dade. This is done with the abline function.

> simple.lm(BUSH,BUCHANAN)
> abline(65.57350,0.00348)      # numbers from above

Figure 16 shows how sensitive the regression line is.

Figure 16: Regression lines for data with and without Miami-Dade outlier

### 4.12  Using rlm or lqs for resistant regression

Resistance in statistics means the procedure is resistant to some percentage of arbitrarily large outliers, robustness means the procedure is not greatly affected by slight deviations in the assumptions. There are various ways to create a resistant regression line. In R there are two in the package MASS that are used in a manner similar to the lm function (but not the simple.lm function). The function lqs works with a simple principle (by default). Rather than minimize the sum of the squared residuals for all residuals, it does so for just a percentage of them. The rlm function uses something known as an M-estimator. Both give similar results, but not identical. In what follows, we will use rlm, but we could have used lqs provided we load the library first (library('lqs')).

Let's apply rlm to the Florida election data. We will plot both the regular regression line and the resistant regression line (fig 17).

> library(MASS)                 # read in the external library
> attach(florida)
> plot(BUSH,BUCHANAN)           # a scatter plot
> abline(lm(BUCHANAN ~ BUSH),lty="1") # lty sets line type
> abline(rlm(BUCHANAN ~ BUSH),lty="2")
> detach(florida)               # tidy up

Notice a few things. First, we used the model formula notation lm(y ~ x) as this is how rlm expects the function to be called. We also illustrate how to change the line type (lty) and how to include a legend with legend.

Figure 17: Voting data with resistant regression line

As well, you may plot the resistant regression line for the data, with and without the outliers as below, you will find as expected that the lines are the same.

> plot(BUSH,BUCHANAN)
> abline(rlm(BUCHANAN ~ BUSH),lty='1')
> abline(rlm(BUCHANAN[-50] ~ BUSH[-50]),lty='2')

This graph will show that removing one point makes no difference to the resistant regression line (as expected).

R Basics: Plotting graphs using R
In this section, we used the plot command to make a scatterplot and the abline command to add a line to it. There are other ways to manipulate plots using R that are useful to know.

It helps to know that R has different functions to create an initial graph and to add to an existing graph.
Creating new plots with plot and curve.
The plot function will plot points as already illustrated. In addition, it can be told to plot the points and connect them with straight lines. These commands will plot a parabola. Notice how we need to first create the values on the x axis to plot

> x=seq(0,4,by=.1)              # create the x values
> plot(x,x^2,type="l")          # type="l" to make line

The convenient curve function will plot functions (of x) in an easier manner. The above plotted the function y=x2 over the interval [0,4]. This is done with curve all at once with

> curve(x^2,0,4)

Notice as illustrated, both plot and curve create new graph windows.
Adding to a graph with points, abline, lines and curve.
We can add to the exiting graph window the several different functions. To add points we use the points command which is similar to the plot command. We've seen that to add a straight line, the abline function is available. The lines function is used to add more general lines. It plots the points specified and connects them with straight lines. Similar to adding type=''l'' in the plot function. Finally, curve will add to a graph if the additional argument add=TRUE is given.

To illustrate, if we have the dataset
 mileage 0 4 8 12 16 20 24 28 32 tread wear 394 329 291 255 229 204 179 163 150
Then the regression line has intercept 360 and slope -7.3. Here are three ways to plot the data and the regression line:

> miles = (0:8)*4               # 0 4 8 ... 32
1: 394 329 291 255 229 204 179 163 150
10:
> plot(miles,tread)             # make the scatterplot
## or as we know the intercept and slope
> abline(360,-7.3)
## or using points
> points(miles,360 - 7.3*miles,type="l")
## or using lines
> lines(miles,360 - 7.3*miles)
## or using curve


### 4.13  Problems

4.1
A student evaluation of a teacher is on a 1-5 Leichert scale. Suppose the answers to the first 3 questions are given in this table
 Student Ques. 1 Ques. 2 Ques. 3 1 3 5 1 2 3 2 3 3 3 5 1 4 4 5 1 5 3 2 1 6 4 2 3 7 3 5 1 8 4 5 1 9 3 4 1 10 4 2 1
Enter in the data for question 1 and 2 using c(), scan(), read.table or data.entry()
1. Make a table of the results of question 1 and question 2 separately.
2. Make a contingency table of questions 1 and 2.
3. Make a stacked barplot of questions 2 and 3.
4. Make a side-by-side barplot of all 3 questions.
4.2
In the library MASS is a dataset UScereal which contains information about popular breakfast cereals. Attach the data set as follows

> library('MASS')
> data('UScereal')
> attach(UScereal)
> names(UScereal)               # to see the names

Now, investigate the following relationships, and make comments on what you see. You can use tables, barplots, scatterplots etc. to do you investigation.
1. The relationship between manufacturer and shelf
2. The relationship between fat and vitamins
3. the relationship between fat and shelf
4. the relationship between carbohydrates and sugars
5. the relationship between fibre and manufacturer
6. the relationship between sodium and sugars
Are there other relationships you can predict and investigate?
4.3
The built-in data set mammals contains data on body weight versus brain weight. Use the cor to find the Pearson and Spearman correlation coefficients. Are they similar? Plot the data using the plot command and see if you expect them to be similar. You should be unsatisfied with this plot. Next, plot the logarithm (log) of each variable and see if that makes a difference.
4.4
For the data set on housing prices, homedata , investigate the relationship between old assessed value and new. Use old as the predictor variable. Does the data suggest a linear relationship?Are there any outliers? What may have caused these outliers? What is the predicted new assessed value for a \$75,000 house in 1970.
4.5
For the florida dataset of Bush vs. Buchanan, there is another obvious outlier that indicated Buchanan received fewer votes than expected. If you remove both the outliers, what is the predicted value for the number of votes Buchanan would get in Miami-Dade county based on the number of Bush votes?
4.6
For the data set emissions plot the per-Capita GDP (gross domestic product) as a predictor for the response variable CO2 emissions. Identify the outlier and find the regression lines with this point, and without this point.
4.7
Attach the data set babies :

> library("Simple")
> data("babies")
> attach(babies)

This data set contains much information about babies and their mothers for 1236 observations. Find the correlation coefficient (both Pearson and Spearman) between age and weight. Repeat for the relationship between height and weight. Make scatter plots of each pair and see if your answer makes sense.
4.8
Find a dataset that is a candidate for linear regression (you need two numeric variables, one a predictor and one a response.) Make a scatterplot with regression line using R.
4.9
The built-in data set mtcars contains information about cars from a 1974 Motor Trend issue. Load the data set (data(mtcars)) and try to answer the following:
1. What are the variable names? (Try names.)
2. what is the maximum mpg
3. Which car has this?
4. What are the first 5 cars listed?
5. What horsepower (hp) does the Valiant'' have?
6. What are all the values for the Mercedes 450slc (Merc 450SLC)?
7. Make a scatterplot of cylinders (cyl) vs. miles per gallon (mpg). Fit a regression line. Is this a good candidate for linear regression?
4.10
Find a graphic of bivariate data from the newspaper or other media source. Use R to generate a similar figure.