# First report: descriptive statistics

Your first report will describe your dataset in detail, with special focus on a handful of variables in the data.

You should subset your data using the code

```
set.seed(Last4DigitsFromYourStudentIDNumber)
data = subset(data, 1:nrow(data) %in% sample.int(nrow(data), nrow(data)/2))
```

## Criteria for F

Not handing a report in on time. Omitting one of the instructed tasks completely. Handing in a report where any knitting errors prove difficult or time-consuming to correct. Handing in a report where four or more of the criteria for C have minor errors.

## Criteria for D

Minor errors in the criteria for C. For example: the report file does not knit, but errors are relatively easy to fix; report has grammatical or spelling errors; etc.

## Criteria for C

Your report will be readable, written in English, without grammatical or spelling errors. It will be submitted as an **RMarkdown** file, together with all data files needed to knit the report into a finished text. Your RMarkdown file will run on the lab computers without errors.

Your report will be subsetting the data using the code snippet on this page before any visualization or computation takes place.

Your report will describe general information about the dataset, including:

- Who collected the data?
- When?
- Where?
- Why?

You will include a suggestion of the kinds of questions the data was collected in order to answer. For instance, the *Iris* data we have looked could be said to have been collected to find methods for determining species for flower specimens.

You will describe the layout of the data:

- How many cases (observations)?
- How many variables? What are they? What do they contain? Are they numeric, ordinal, categorical or something else? (dates? times?)
- For each numeric variable give an appropriate measure of center, and an appropriate measure of spread.
- For each categorical variable give an overview of possible values. If there are only a few categories, list all – if there are many categories, tell us how many there are.
- Are there missing values? How are they encoded? How common are they? If there are missing values, include some number (count or percentage) that describes how much is missing in your description of each variable.

You will pick a handful of variables for more careful study. You should pick no more than 5 each of numeric and categorical variables. Fewer is fine, even recommended.

For each of the picked variables, produce detailed descriptions of their distributions. Include an appropriate plot to describe how the values of the variable distributes. Where appropriate evaluate whether you think variables have a normal distribution. Explain your reasoning.

For each pair of picked variables, describe how they relate to each other. Where appropriate, produce plots, correlations, two-way tables.

## Criteria for B

For a B, your report fulfills **all** the criteria for an C, and has minor errors in the criteria for A.

## Criteria for A

For all relationships between two numerical variables, discuss if it is appropriate to fit a linear model. If it is appropriate, do so; evaluate the quality of the model. Provide plots to justify your statements.

Include a discussion that compares and contrasts possible presentations for your data: what options were you choosing between for plotting, for choosing measures of spread and center, and why did you pick the ones you used?

Include a critique of the original data collection. Given your understanding of **Why** the data was collected, does the data support learning about the questions you believe it was collected to answer? What would you like to see included in the dataset to better respond to these questions – leave feasibility aside: what is your wish list for extending the data?