Lesson 2

The questions

This quiz covers the questions in the notes for week 2 of the R Intro Statistics Dot Com class. Please discuss on the online forum there if there are questions.

What gets plotted by:

plot(values ~ ind, data=stacked)

(stacked is stacked data so values is numeric, ind a factor)

side-by-side boxplots a scatter plot

Two versions of a test are given within a class. Suppose the students
were selected at random to take the first or second test. The data is

test 1 scores  75 85 78  82  65  85
--------------------------------------
test 2 scores  90 95 87  92  94  95

If we view the test 1 scores as a random sample from a population, describe the population:

Geez, this is hard to know without more background. But the sampling is done from the class so that would be the population. All statistics.com students who have ever taken this class

Let us enter the data on tests with:

t1 <- c(75, 85, 78, 82, 65, 85)
t2 <- c(90, 95, 87, 92, 94, 95)

Which command will do a two sample \( t \) test with an assumption of equal variances:

t.test(t1, t2) t.test(t1, t2, var.equal=TRUE) two.sample.t.test(t1,t2)

The ToothGrowth data set has tooth measurements for various
dosages of two supplements. This command compares the two
supplements for the smallest dose:

t.test(len ~ supp, ToothGrowth, subset = dose == 0.5)

## 
##  Welch Two Sample t-test
## 
## data:  len by supp 
## t = 3.17, df = 14.97, p-value = 0.006359
## alternative hypothesis: true difference in means is not equal to 0 
## 95 percent confidence interval:
##  1.719 8.781 
## sample estimates:
## mean in group OJ mean in group VC 
##            13.23             7.98 
##

Repeat the above with the dose value 2.0. What is the p-value?

This shows the useful subset command to restrict the cases (rows) considered when using a model formula to specify a problem.

This command computes a subset of the morley data set for experiments 1 and 5.

d <- subset(morley, subset = Expt == 1 | Expt == 5)

A t.test is done with

t.test(Speed ~ Expt, data=d)

If instead of 1 and 5 you used 2 and 4, what would be the \( p \)-value?

In the above we use a logical operator to subset (it is in 1 *or* it is in 5). It can also be done using the %in% operator, as with Expt %in% c(1,5).

You were asked the following: For the \RCode{home} data set from
\RCode{UsingR} make side-by-side boxplots of the two variables. Make them. Which of the following values is the best estimate for Q3 for the new variable

1.4e05 (140,000) 2.8e+05 (280,000) 3.2e+05 (320,000) 3.6e+05 (360,000) 4.2e+05 (420,000) 6e+05 (600,000)

Does this graphic show similar distributions?

require("UsingR")

## Loading required package: UsingR

## Loading required package: MASS

with(homedata, qqplot(y1970, y2000))

plot of chunk unnamed-chunk-11

Well, no. Similar distributions show up a more or less as straight lines in this graphic Well, obviously. The are measuring the same thing, home prices just in different years.

The twins data set from UsingR has IQ scores for identical twins
raised under different circumstances. The assumption of independent
samples should be clearly wrong for such data and the idea of pairing
should hopefully be natural. You are asked to Perform a two-sided
paired \( t \)-test for equivalence of means for the Foster and
Biological data.

What is the \( p \)-value?

You have this question

For more than a century, the three species of large fish – gumpies, sticklebarbs, and spotheads – that are native to a certain river have been observed to co-exist in equal proportions of one-third each. But now a random sample of 300 large fish drawn from a standard fish-sampling location has turned up numbers and proportions suggesting that something has occurred to upset the natural ecology of the river. If the three fish species still inhabited the river in equal proportions, we would expect to find about 100 instances of each in a sample of size N=300; whereas what we actually observe are 89 gumpies, 120 sticklebarbs, and 91 spotheads.
Taken from http://faculty.vassar.edu/lowry/ch8pt1.html

If we set the data to be f:

f <- c(89, 120, 91)

WHich of these is the appropriate command to use to do the test?

chisq.test(f, p=c(1,1,1)) chisq.test(f, c(1,1,1)/3) chisq.test(f, p=c(1,1,1)/3) chisq.test(f, p=c(1,1,1), simulate.p.value=TRUE)

About