Lesson 2

Some R commands

Lesson 2 covers statistical material for performing two sample $ t $ tests and chi-squared goodness of fit tests. It also covers the R commands and idioms necessary to pull this off. Here are some questions on the R commands.

EDA or exploratory data analysis is a term to describe the
exploration of a data set prior to any formal model fitting. Such
explorations can be via statistical summaries or via graphics. For
this topic it is useful to know many different ways that such
activities can be done.

Let's begin with the simple dataset used in the notes:

bottom <- c(0.43, 0.266, 0.567, 0.531, 0.707, 0.716)
surface <- c(0.415, 0.238, 0.39, 0.41, 0.605, 0.609)

We can use these variables separately or combine them into a data frame:

DF <- data.frame(bottom = bottom, surface = surface)

First some questions about data frames. If you are confused, check the comments when you guess wrong.

Using your version of R, make the above data frame and tell me what the outputs of
nrow(DF) is:

6 ## the number of rows in the data frame 2 ## the number of rows in the data frame

Is there a different between colnames(DF) and names(DF)?

Yes No

Which of these commands returns the values where the bottom value is 0.430 or less?

DF[DF$bottom <= 0.430] DF[DF$bottom <= 0.430, ]

Numeric summaries

Okay, lets use DF to look at numeric summaries. In the notes we see
that summary will summarize a numeric variable with its so-called
5-number summary (well, technically not if you are pedantic) and also
its mean). We can call this same method for a data frame:
summary(DF).

Do so. Which variable has the largest maximum?

bottom surface

Calling mean(DF) causes an warning, calling median(DF) an error. The warning for mean suggests using sapply. What is the output of sapply(DF, median)?

0.5361667 0.4445000 0.5490 0.4125

(The sapply function iterates over the object in its first argument
and applies the function to it from the second. For data frames, it
iterates over each column variable so the above takes the median of
each column. The sapply function then tries to put the output into a
nice format.

Graphical summaries

The two sample t-test is about comparing means. A good graphic to
investigate is the parallell or side-by-side boxplots. These are made
many different ways in R. We use the boxplot function.

Issue the command boxplot(DF). Do you get side-by-side boxplots?

Yes No

Well you answered “Yes”, good. This is because data frames are
lists and boxplot will do the “right thing” for lists.

Data frames are also matrices. (Huh?) Will boxplot do the right thing for
matrices? To check look at the output of boxplot(as.matrix(DF))

Yes No

The above two questions show that for rectangular data, the boxplot
function does what we would like with minimal fuss. Good. However,
lots of two sample data will not fit into a data frame with each
column being a variable. Well, if we had two different sample
sizes. The alternative storage is to have one column for the values
and one column indicating which group. (This generalizes to more than two samples, which leads to ANOVA).

The stack command is used to make this format. (More generally there
is the reshape function for this type of work and the reshape2 package.)

Run the command

st <- stack(DF)

What type of storage does R use for ind? (Use class(st$ind))

character factor logical numeric

The stack command works with R's formula interface. We can more or
less avoid this when working with two samples, but it is a huge
advantage when working with multivariate data. It is one area where R
shines compared to other languages when doing statistics.

Does the following notation make the same side-by-side boxplot: boxplot(values ~ ind, data=st)

Yes No

The t.test can be done many ways. Do all of these produce the same output?

t.test(bottom, surface)
t.test(DF$bottom, DF$surface)
with(DF, t.test(bottom, surface))
t.test(values ~ ind, data=st)

Yes No

As mentioned in the notes R uses “generic functions” to allow one
function name to dispatch to different functions depending on the
arguments you supply. In computer science terms, multiple dispatch is
termed polymorphism in the object oriented literature. (A point I make
for those of you who already know that.) Base R has three different
ways to achieve this, and there are others provided in add-on
packages. The simplest and most common is S3. There the class of the
first argument to a function is considered. This is why both
t.test(bottom, surface) and t.test(values ~ ind, data=st) work as
different functions are ultimately consulted. (The first has a numeric
variable for the first argument, the second a variable.)

The methods function will list the different “methods” registered for a generic function. How many are there for t.test.

The term “Non-visible” means what? Well the function is there but can't be seen – without extra help. Which of these will find the definition (a bunch of code) for the formula implementation of t.test?

t.test.formula stats:::t.test.formula

Lesson 2

About

EDA

Numeric summaries

Graphical summaries

Generic functions