For the second report, your work starts by creating a list of interesting hypotheses to study. These should be of the following kinds, to match up with the tests you will be learning.
I want you to make your hypotheses based on a subset of your data. If your dataset is called dataset
, then you will subset your data with the following code:
set.seed(Last4DigitsFromYourStudentIDNumber)
datasubset = dataset[sample(nrow(dataset))[1:(nrow(dataset)/2)],]
and then base your hypotheses on the values from that dataset.
Notice that even if you suspect two values to be almost the same, the test you would perform is to test for inequality, and then take a failure to reject the null hypothesis as support for the values being similar (but not necessarily equal).
You should make at least 5 different kinds of hypotheses from this list.
mean(datasubset$Variable)
is [not equal to] / [larger than] / [smaller than]M
for some specific valueM
. Example hypothesis:mean(beer$PercentAlcohol) > 5
.mean(datasubset$Variable1[SomeCondition])
is [not equal to] / [larger than] / [smaller than]mean(datasubset$Variable1[SomeOtherCondition])
, whereVariable1
may be the same asVariable2
– or different. This is a type of hypothesis that can be inspired by plots likeggplot(datasubset, aes(x=Variable1, color=Variable2)) + geom_freqpoly()
and seeing different values of the (categorical)Variable2
producing different or similar distributions. Example hypothesis:mean(beer$PercentAlcohol[beer$Brewery=="Sierra Nevada"]) > mean(beer$PercentAlcohol[beer$Brewery=="Flying Dog Brewery "])
- The proportion of some label of a categorical variable is [not equal to] / [larger than] / [smaller than]
P
for some specific valueP
. Example hypothesis: At least 45% of likely voters plan to vote for Trump. - The proportion of one specific label is [not equal to] / [larger than] / [smaller than] the proportion of another specific label. Example hypothesis: A larger proportion of likely voters plan to vote for Clinton than for Trump.
- There is a relation between the rows and the columns of a two-way table. Example hypothesis: Gender affects the choice of lunch food.
- The frequencies of labels in one particular variable follows a particular distribution. Example hypothesis: this six-sided die is fair: each number has equal probability of occurring.
- There is a linear relationship between a specific pair of variables. Example hypothesis: Calories and alcohol content of beer are related through a linear equation:
PercentAlcohol = b1*Calories + b0
whereb1≠0
.
Due dates
As a support for formulating hypotheses, you will read an assigned co-students’ report and suggest ideas to them according to the report you read. These ideas are due October 24.
You then need to formulate your hypotheses and register them with me through Blackboard. Your hypotheses are due October 31.
A draft of your report for feedback, as with the first report, is welcome by December 5.
Your report is due through Blackboard timestamped no later than 14.30 on December 12.