MVJ
12April, 2018
We have a tutor! Maria Zakharycheva will be available once a week as a tutor for MTH214: both statistics and RStudio / RMarkdown.
Preferred time?
Day | Time |
---|---|
Monday | 4.30 - 6.25 |
Wednesday | 4.30 - 6.25 |
Thursday | 2.30 - 4.25 |
By next week, you need to have chosen a data set to work on.
Find suggested data sources on the course webpage.
Data can fail in several ways
NA
in R. Missing data is contagious: almost all things you do with NA
will give the result NA
.Data type errors can be visible when loading data into R: if you expected a numeric column, but got a factor column, there might be something non-numeric in the data file.
Data entry errors can be visible as outliers: remarkably large or remarkably small observations. You must always inspect outliers – do not ignore data just because it weakens the analysis.
Missing data can often be handled by instructing functions to ignore the missing entries.
Single, or few numbers to summarize a dataset more compactly than graphs. Because they are numbers, they can be further analyzed mathematically.
What numbers that summarize a dataset do you know?
Center | Spread | ||
---|---|---|---|
mean | sum divided by count | standard deviation | mean deviation from the mean |
median | middlemost value | inter-quartile range | difference 75%th - 25%th |
mid-point | (max + min) / 2 | range | difference max - min |
mode | most common value |
The mean of a collection of numbers is their sum divided by the number of numbers:
\[ \overline x = \frac{\sum_{i=1}^N x_i}{N} = \frac{x_1+x_2+\dots+x_N}{N} \]
Side note: the notation \(\sum_{i=a}^b x_i\) means that you let \(i\) take on every value from \(a\) to \(b\), and you add up the results: so \(\sum_{i=1}^3 i^2\) would come out as \(1+4+9\).
The mean of a collection of numbers is their sum divided by the number of numbers:
\[ \overline x = \frac{\sum_{i=1}^N x_i}{N} = \frac{x_1+x_2+\dots+x_N}{N} \]
Exercise What is the mean of
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
The mean of a collection of numbers is their sum divided by the number of numbers:
\[ \overline x = \frac{\sum_{i=1}^N x_i}{N} = \frac{x_1+x_2+\dots+x_N}{N} \]
Exercise What is the mean of
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
The sum is 62. There are 10 numbers in the list. This gives us a mean of 6.2.
The median is the middlemost observation:
If the number of elements in the list is even, the middle falls between two numbers. In this case the convention is to take the mean of those two numbers.
The median is the middlemost observation:
If the number of elements in the list is even, the middle falls between two numbers. In this case the convention is to take the mean of those two numbers.
Exercise
Find the median of the list
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 |
The median is the middlemost observation:
If the number of elements in the list is even, the middle falls between two numbers. In this case the convention is to take the mean of those two numbers.
Exercise
Find the median of the list
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
The mid-point is the mean of the maximum and the minimum: the value halfway through the full range of the numbers.
The mid-point is the mean of the maximum and the minimum: the value halfway through the full range of the numbers.
Exercise
Find the mid-point of
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
The mode is the most common value. The mode may well not be unique: several number could occur equally often.
The mode is the most common value. The mode may well not be unique: several number could occur equally often.
Exercise
Find the mode of
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
The mode is the most common value. The mode may well not be unique: several number could occur equally often.
Exercise
Find the mode of
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
One of the candidates for a mode is 7. Can you see any other?
A measure is robust if it is resistant to outliers: does it change much if the largest value is made much, much larger?
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
Measure | Value |
---|---|
Mean | 6.2 |
Median | 7 |
Mid-point | 5 |
Mode | 7 |
A measure is robust if it is resistant to outliers: does it change much if the largest value is made much, much larger?
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
Measure | Value |
---|---|
Mean | 6.2 |
Median | 7 |
Mid-point | 5 |
Mode | 7 |
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 900 |
Measure | Value |
---|---|
Mean | |
Median | |
Mid-point | |
Mode |
Which of the measures do you think change with the change in data?
A measure is robust if it is resistant to outliers: does it change much if the largest value is made much, much larger?
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 9 |
Measure | Value |
---|---|
Mean | 6.2 |
Median | 7 |
Mid-point | 5 |
Mode | 7 |
1 | 3 | 5 | 6 | 7 | 7 | 7 | 8 | 9 | 900 |
Measure | Value |
---|---|
Mean | 95.3 |
Median | 7 |
Mid-point | 450.5 |
Mode | 7 |
The variance of a variable is the mean squared deviation from the mean:
\[ s^2 = \frac{(x_1-\overline x)^2 + \dots + (x_N-\overline x)^2}{N-1} \]
The standard deviation of a variable is the square root of the variance. This makes the standard deviation into something similar to the distance from the data to its mean.
These measure spread about the mean: should not be used with any other measure of center.
These are not robust.
The quartiles of a variable are values such that 25% / 50% / 75% / 100% is smaller than that value. The median is the second quartile. The maximum is the fourth.
The interquartile range is the difference between the first and the third quartile. Because of how quartile works, we know that 50% of the data falls within the interquartile range.
The IQR should be paired with the median.
The IQR is robust.
The range of a variable is the difference between the minimum and the maximum.
The range is not robust.
Quartiles form an example of quantiles: the \(p\)th quantile is a value such that a fraction \(p\) of the data is smaller than the quantile: the first quartile is the 0.25th quantile.
Outliers are points that are unusually large or small, and likely to influence mean/sd too much. One common definition is Tukey’s rule: \(x_k\) is an outlier if it falls more than \(1.5\times\)IQR above the third or below the first quartile.
Median, IQR and outliers are often described with boxplots:
The boxplot has components:
Mean/sd | Median/IQR |
---|---|
Closely related to the normal distribution | Robust |
Easy to use in theoretical statistics | Difficult in theory |
Basis for many and easy tests |
Consider temperatures:
\[ {}^\circ C = \frac{5}{9}({}^\circ F-32) \]
How much can we say about the distribution of temperatures in Celsius if we know the distribution in Fahrenheit?
Consider temperatures:
\[ {}^\circ C = \frac{5}{9}({}^\circ F-32) \]
Overall shape stays the same: since \(5/9 > 0\), right-skewed remains right-skewed, etc.
Consider temperatures:
\[ {}^\circ C = \frac{5}{9}({}^\circ F-32) \]
Overall shape stays the same: since \(5/9 > 0\), right-skewed remains right-skewed, etc.
Actual values for measures of center and spread change.
Measure | \({}^\circ C\to{}^\circ F\) Â Â Â Â Â Â Â Â | \(x\to a+bx\) |
---|---|---|
Mean \(\overline x\) | \(\frac{5}{9}(\overline x-32)\) | \(a+b\overline x\) |
Median \(m\) | \(\frac{5}{9}(m-32)\) | \(a+bm\) |
Mode \(m\) | \(\frac{5}{9}(m-32)\) | \(a+bm\) |
Variance \(s^2\) | $s^2 | \(b^2s^2\) |
Standard Deviation \(s\) | \(\frac{5}{9}s\) | \(bs\) |
IQR \(iqr\) | \(\frac{5}{9}iqr\) | \(b\cdot iqr\) |
favstats(~Time, data=time)
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
2 | 6 | 12 | 24 | 208 | 23.96 | 40.77221 | 25 | 0 |
time %>% filter(Time > 150) %>% kable
CountryName | CountryCode | Time |
---|---|---|
Suriname | SUR | 208 |
favstats(~Time, data=time %>% filter(Time <= 150))
min | Q1 | median | Q3 | max | mean | sd | n | missing | |
---|---|---|---|---|---|---|---|---|---|
2 | 5.75 | 11.5 | 20.25 | 53 | 16.29167 | 14.16511 | 24 | 0 |
68 - 95 - 99 rule
With the normal distribution (details in two weeks):
% of the data | in the range |
---|---|
68% | \(\overline x \pm s\) |
95% | \(\overline x \pm 2s\) |
99% | \(\overline x \pm 3s\) |