Lab 9
Tobias Johnson
4/19/2018
Getting Started
A few quick reminders:
The console lets you try commands and see the results right away, but nothing you do there is saved.
To create the file that you’ll eventually hand in, go to File -> New File -> R Script. Save the file as
lab8.R
or something like that. In this file, put your answers like this:# Exercise 1 mean(dataset$left.bicep.thickness) mean(dataset$right.bicep.thickness) # On average, people's right biceps are thicker than their left biceps # Exercise 2
To compile your R file, press Ctrl-Shift-K, or click on the little notebook icon in the toolbar. When it asks for the report output format, choose html.
If you load a dataset in the console, this doesn’t make it available in your R file. If you load a dataset in your R file, this doesn’t make it available in the console.
To run a line of code from your R file in the console without having to type it in again, put your cursor on the line and press Ctrl-Enter.
You can look up a command in help by putting the cursor on it and pressing F1. Or, in the console, enter in a question mark followed by the name of the command, like
?mean
.
Goals
Today’s lab is about finding confidence intervals and doing tests of significance using the t-distribution, using the t.test
command in R.
Exercise 1
We’ll again use data from the 2010 General Social Survey, though with more variables than last time. You can download the data at http://www.math.csi.cuny.edu/~maher/teaching/2019/spring/stats/labs/gss_expanded.csv
Each observation in the dataset represents an individual who responded to the poll. The variables in the dataset are:
variable | definition |
---|---|
sibs |
number of siblings |
age |
age |
physhlth |
number of days out of last 30 patient experienced problems with physical health |
mntlhlth |
number of days out of the last 30 patient experienced depression or other mental health problems |
sex |
the sex of the respondent, coded as "female" or "male" |
income |
the annual household income of the respondent |
Task. Load the dataset into an object called gss
.
## Warning in file(file, "rt"): "internal" method cannot handle https
## redirection to: 'https://www.math.csi.cuny.edu/~maher/teaching/2019/spring/
## stats/labs/gss_expanded.csv'
## Warning in file(file, "rt"): "internal" method failed, so trying "libcurl"
Exercise 2
Do people from very large families have different incomes from people from smaller families? We’ll investigate this question now. We consider people to be from a large family if they have four or more siblings, to match what we looked at in Lab 8
Task. Using the subset
command, split the dataset into two new datasets called gss.large
and gss.small
, where gss.large
has all observations with four or more siblings and gss.small
has all observations with less than four siblings.
Exercise 3
Report the sizes of your two samples. (Remember the nrow
command!)
Exercise 4
Here is a summary of the conditions in which inference using the t-distribution is valid for two-sample procedures:
Simple random samples: The two datasets must be simple random samples from different populations; or, for an experiment, they two datasets must consist of a simple random sample from the population with each individual then assigned at random to one of two groups.
Skew and outlier conditions:
- If the combined sample size, \(n_1+n_2\), is smaller than 15: inference is valid only if the data from both populations is very close to normal.
- If the combined sample size, \(n_1+n_2\), is from 15 to 40: inference is valid only if there are no outliers or strong skew in either dataset
- If the combined sample size, \(n_1+n_2\), is 40 or larger: inference is valid unless there are extreme skew or outliers in either of the datasets.
Task. Judge if conditions for inference are satisfied for comparing the mean income of people from large and small families. You’ll need to make histograms of the income variable for both samples.
Exercise 5
Task. Regardless of whether conditions for inference are satisfied, find a t-confidence interval for the difference in mean income in the two populations. Use a 95% confidence level.
Exercise 6
Task. Carry out a test of significance at a .05 level for whether the two populations have different mean incomes. In a comment, state the null and alternative hypothesis for your test. In another comment, report the p-value for your test, and state the conclusion (i.e., reject or fail to reject the null hypothesis).
Exercise 7
Task. Come up with a research question of your own comparing an average value for a variable in this dataset between two different groups. Write your question in a comment.
Exercise 8
Task. Find a confidence interval and perform a test of significance addressing your research question.