Lecture 3

MVJ

12April, 2018

Tutoring

We have a tutor! Maria Zakharycheva will be available once a week as a tutor for MTH214: both statistics and RStudio / RMarkdown.

Preferred time?

Day Time
Monday 4.30 - 6.25
Wednesday 4.30 - 6.25
Thursday 2.30 - 4.25

Data set choice

By next week, you need to have chosen a data set to work on.

Find suggested data sources on the course webpage.

Data Failure

Data can fail in several ways

Data type errors can be visible when loading data into R: if you expected a numeric column, but got a factor column, there might be something non-numeric in the data file.

Data entry errors can be visible as outliers: remarkably large or remarkably small observations. You must always inspect outliers – do not ignore data just because it weakens the analysis.

Missing data can often be handled by instructing functions to ignore the missing entries.

Summary Statistics

Single, or few numbers to summarize a dataset more compactly than graphs. Because they are numbers, they can be further analyzed mathematically.

What numbers that summarize a dataset do you know?

Summary statistics: center and spread

Center Spread
mean sum divided by count standard deviation mean deviation from the mean
median middlemost value inter-quartile range difference 75%th - 25%th
mid-point (max + min) / 2 range difference max - min
mode most common value

Measure of Center: Mean

The mean of a collection of numbers is their sum divided by the number of numbers:

\[ \overline x = \frac{\sum_{i=1}^N x_i}{N} = \frac{x_1+x_2+\dots+x_N}{N} \]

Side note: the notation \(\sum_{i=a}^b x_i\) means that you let \(i\) take on every value from \(a\) to \(b\), and you add up the results: so \(\sum_{i=1}^3 i^2\) would come out as \(1+4+9\).

Measure of Center: Mean

The mean of a collection of numbers is their sum divided by the number of numbers:

\[ \overline x = \frac{\sum_{i=1}^N x_i}{N} = \frac{x_1+x_2+\dots+x_N}{N} \]

Exercise What is the mean of

1 3 5 6 7 7 7 8 9 9

Measure of Center: Mean

The mean of a collection of numbers is their sum divided by the number of numbers:

\[ \overline x = \frac{\sum_{i=1}^N x_i}{N} = \frac{x_1+x_2+\dots+x_N}{N} \]

Exercise What is the mean of

1 3 5 6 7 7 7 8 9 9

The sum is 62. There are 10 numbers in the list. This gives us a mean of 6.2.

Measure of Center: Median

The median is the middlemost observation:

  1. Sort the list of numbers.
  2. Take the middle-most number – if the list has \(N\) numbers, the middle is in row number \((N+1)/2\).

If the number of elements in the list is even, the middle falls between two numbers. In this case the convention is to take the mean of those two numbers.

Measure of Center: Median

The median is the middlemost observation:

  1. Sort the list of numbers.
  2. Take the middle-most number – if the list has \(N\) numbers, the middle is in row number \((N+1)/2\).

If the number of elements in the list is even, the middle falls between two numbers. In this case the convention is to take the mean of those two numbers.

Exercise

Find the median of the list

1 3 5 6 7 7 7 8 9

Measure of Center: Median

The median is the middlemost observation:

  1. Sort the list of numbers.
  2. Take the middle-most number – if the list has \(N\) numbers, the middle is in row number \((N+1)/2\).

If the number of elements in the list is even, the middle falls between two numbers. In this case the convention is to take the mean of those two numbers.

Exercise

Find the median of the list

1 3 5 6 7 7 7 8 9 9

Measure of Center: Mid-point

The mid-point is the mean of the maximum and the minimum: the value halfway through the full range of the numbers.

Measure of Center: Mid-point

The mid-point is the mean of the maximum and the minimum: the value halfway through the full range of the numbers.

Exercise

Find the mid-point of

1 3 5 6 7 7 7 8 9 9

Measure of Center: Mode

The mode is the most common value. The mode may well not be unique: several number could occur equally often.

Measure of Center: Mode

The mode is the most common value. The mode may well not be unique: several number could occur equally often.

Exercise

Find the mode of

1 3 5 6 7 7 7 8 9 9

Measure of Center: Mode

The mode is the most common value. The mode may well not be unique: several number could occur equally often.

Exercise

Find the mode of

1 3 5 6 7 7 7 8 9 9

One of the candidates for a mode is 7. Can you see any other?

Properties of measures of center

A measure is robust if it is resistant to outliers: does it change much if the largest value is made much, much larger?

1 3 5 6 7 7 7 8 9 9
Measure Value
Mean 6.2
Median 7
Mid-point 5
Mode 7

Properties of measures of center

A measure is robust if it is resistant to outliers: does it change much if the largest value is made much, much larger?

1 3 5 6 7 7 7 8 9 9
Measure Value
Mean 6.2
Median 7
Mid-point 5
Mode 7
1 3 5 6 7 7 7 8 9 900
Measure Value
Mean
Median
Mid-point
Mode

Which of the measures do you think change with the change in data?

Properties of measures of center

A measure is robust if it is resistant to outliers: does it change much if the largest value is made much, much larger?

1 3 5 6 7 7 7 8 9 9
Measure Value
Mean 6.2
Median 7
Mid-point 5
Mode 7
1 3 5 6 7 7 7 8 9 900
Measure Value
Mean 95.3
Median 7
Mid-point 450.5
Mode 7

Measures of spread: Variance and Standard Deviation

The variance of a variable is the mean squared deviation from the mean:

\[ s^2 = \frac{(x_1-\overline x)^2 + \dots + (x_N-\overline x)^2}{N-1} \]

The standard deviation of a variable is the square root of the variance. This makes the standard deviation into something similar to the distance from the data to its mean.

These measure spread about the mean: should not be used with any other measure of center.

These are not robust.

Measures of spread: Interquartile range

The quartiles of a variable are values such that 25% / 50% / 75% / 100% is smaller than that value. The median is the second quartile. The maximum is the fourth.

The interquartile range is the difference between the first and the third quartile. Because of how quartile works, we know that 50% of the data falls within the interquartile range.

The IQR should be paired with the median.

The IQR is robust.

Measures of spread: Range

The range of a variable is the difference between the minimum and the maximum.

The range is not robust.

Quartiles, quantiles, outliers

Quartiles form an example of quantiles: the \(p\)th quantile is a value such that a fraction \(p\) of the data is smaller than the quantile: the first quartile is the 0.25th quantile.

Outliers are points that are unusually large or small, and likely to influence mean/sd too much. One common definition is Tukey’s rule: \(x_k\) is an outlier if it falls more than \(1.5\times\)IQR above the third or below the first quartile.

Boxplots

Median, IQR and outliers are often described with boxplots:

The boxplot has components:

Mean/sd or median/IQR?

Mean/sd Median/IQR
Closely related to the normal distribution Robust
Easy to use in theoretical statistics Difficult in theory
Basis for many and easy tests

Transformed data

Consider temperatures:

\[ {}^\circ C = \frac{5}{9}({}^\circ F-32) \]

How much can we say about the distribution of temperatures in Celsius if we know the distribution in Fahrenheit?

Transformed data

Consider temperatures:

\[ {}^\circ C = \frac{5}{9}({}^\circ F-32) \]

Overall shape stays the same: since \(5/9 > 0\), right-skewed remains right-skewed, etc.

Transformed data

Consider temperatures:

\[ {}^\circ C = \frac{5}{9}({}^\circ F-32) \]

Overall shape stays the same: since \(5/9 > 0\), right-skewed remains right-skewed, etc.

Actual values for measures of center and spread change.

Transformed data

Measure \({}^\circ C\to{}^\circ F\)                 \(x\to a+bx\)
Mean \(\overline x\) \(\frac{5}{9}(\overline x-32)\) \(a+b\overline x\)
Median \(m\) \(\frac{5}{9}(m-32)\) \(a+bm\)
Mode \(m\) \(\frac{5}{9}(m-32)\) \(a+bm\)
Variance \(s^2\) $s^2 \(b^2s^2\)
Standard Deviation \(s\) \(\frac{5}{9}s\) \(bs\)
IQR \(iqr\) \(\frac{5}{9}iqr\) \(b\cdot iqr\)

Example of outlier: Business starting times

favstats(~Time, data=time)
min Q1 median Q3 max mean sd n missing
2 6 12 24 208 23.96 40.77221 25 0

Example of outlier: Business starting times

time %>% filter(Time > 150) %>% kable
CountryName CountryCode Time
Suriname SUR 208
favstats(~Time, data=time %>% filter(Time <= 150))
min Q1 median Q3 max mean sd n missing
2 5.75 11.5 20.25 53 16.29167 14.16511 24 0

Why use \(\overline x \pm 2s\)?

68 - 95 - 99 rule

With the normal distribution (details in two weeks):

% of the data in the range
68% \(\overline x \pm s\)
95% \(\overline x \pm 2s\)
99% \(\overline x \pm 3s\)