For the second report, your work starts by creating a list of interesting hypotheses to study. These should be of the following kinds, to match up with the tests you will be learning.

I want you to make your hypotheses based on the same subset of data you used for the first report. If your dataset is called dataset, then you will subset your data with the following code:

data = subset(data, 1:nrow(data) %in% sample.int(nrow(data), nrow(data)/2))

and then base your hypotheses on the values from that dataset.

Notice that even if you suspect two values to be almost the same, the test you would perform is to test for inequality, and then take a failure to reject the null hypothesis as support for the values being similar (but not necessarily equal).

You should make at least 5 different kinds of hypotheses from this list.

  1. mean(datasubset$Variable) is [not equal to] / [larger than] / [smaller than] M for some specific value M. Example hypothesis: mean(beer$PercentAlcohol) > 5.
  2. mean(datasubset$Variable1[SomeCondition]) is [not equal to] / [larger than] / [smaller than] mean(datasubset$Variable1[SomeOtherCondition]). This is a type of hypothesis that can be inspired by plots like ggplot(datasubset, aes(x=Variable1, color=Variable2)) + geom_freqpoly() and seeing different values of the (categorical) Variable2 producing different or similar distributions. Example hypothesis: mean(beer$PercentAlcohol[beer$Brewery=="Sierra Nevada"]) > mean(beer$PercentAlcohol[beer$Brewery=="Flying Dog Brewery "])
  3. mean(datasubset$Variable1) is [not equal to] / [larger than] / [smaller than] mean(datasubset$Variable2. Example hypothesis: mean(beer$sales.2016) != mean(beer$sales.2017).
  4. The proportion of some label of a categorical variable is [not equal to] / [larger than] / [smaller than] P for some specific value P. Example hypothesis: At least 10% of all beer types are Ales.
  5. median(datasubset$Variable) is [not equal to] / [larger than] / [smaller than] M for some specific value M. Example hypothesis: median(beer$PercentAlcohol) > 5.
  6. The proportion of one specific label is [not equal to] / [larger than] / [smaller than] the proportion of another specific label. Example hypothesis: A larger proportion of beer brands are IPAs than are Lager.
  7. There is a relation between the rows and the columns of a two-way table. Example hypothesis: Gender affects the choice of lunch food.
  8. The frequencies of labels in one particular variable follows a particular distribution. Example hypothesis: this six-sided die is fair: each number has equal probability of occurring.
  9. There is a linear relationship between a specific pair of variables. Example hypothesis: Calories and alcohol content of beer are related through a linear equation: PercentAlcohol = b1*Calories + b0 where b1≠0.