Big Data Analytics
Lecture 1

Mikael Vejdemo-Johansson
Ping Ji

2019-01-31

Welcome to Big Data Analytics

Semester Overview

  • Foundations of Big Data
  • Statistical presentation, communication, inference, learning
  • Different large scale data types, and how to use them

Structure of the course

  • Weekly lectures: Mikael Vejdemo-Johansson, Ping Ji
  • Guest lecturers: Denis Khryashchev, Sara-Jayne Terp, Joshua Brown, Mario Gonzalez
  • Lab / homework tasks: 2 programming tasks, using Kaggle
  • Term report:
    Either a deep analysis of a dataset, or an in depth explanation of additional techniques. End of semester: in class presentation

Contact Information

Course information will come through Blackboard. A detailed syllabus is available now.

Course Literature

  • Efron & Hastie: Computer age statistical inference
  • James, Witten, Hastie, Tibshirani: An introduction to statistical learning
    Free as ebook
  • Stephens-Davidowitz: Everybody Lies

Read now: Efron & Hastie, Part I.

Big Data

Big Data

Big Data is when you have to think about handling your data:

  • …how will you fit this in memory?
  • …how will you fit this on disk(s)?
  • …how will you compute summaries quickly enough?
  • …how will you find what features to focus on?

Three Vs of big data

  • Volume - large data sets (Google; Wikipedia; CERN; High-throughput screening, …)
  • Velocity - fast data, real-time processing (high speed trading; Twitter firehose; networks)
  • Variety - complex data (image data; graph data; video data; data fusion; shape of data)

Four Vs of big data

  • Volume - large data sets (Google; Wikipedia; CERN; High-throughput screening, …)
  • Velocity - fast data, real-time processing (high speed trading; Twitter firehose; networks)
  • Variety - complex data (image data; graph data; video data; data fusion; shape of data)
  • Veracity - reliable data (biases; ethical data analysis; data cleaning)

Five Vs of big data

  • Volume - large data sets (Google; Wikipedia; CERN; High-throughput screening, …)
  • Velocity - fast data, real-time processing (high speed trading; Twitter firehose; networks)
  • Variety - complex data (image data; graph data; video data; data fusion; shape of data)
  • Veracity - reliable data (biases; ethical data analysis; data cleaning)
  • Value - valid and relevant data (applicability; relevance; actionable; impactful)

Data Analytics

Data Analytics

  • Draw from past data to predict future behavior
  • Multidisciplinary - extensive use of computation, mathematics, statistics
  • Connected to business / consumer needs

Classical Statistics

Based on observations \(x_i\) from a random variable \(X\), describe \(X\) sufficiently well to enable inferences and predictions.

Mean of a variable

Estimate the mean \(\mu_X\) based on repeated observations \(x_1,\dots,x_n\):

  1. Calculate \(\overline{x}\) and \(s\) from the data
  2. Calculate the confidence interval \[ \overline{x} - t_{1-\alpha/2}\cdot s/\sqrt{n} \leq \mu_X \leq\overline{x} + t_{1-\alpha/2}\cdot s/\sqrt{n} \]

Classical Statistics

Based on observations \(x_i\) from a random variable \(X\), describe \(X\) sufficiently well to enable inferences and predictions.

Mean of a variable

Check whether a mean \(\mu_X\) is significantly different from a hypothesized mean \(\mu_0\), based on repeated observations \(x_1,\dots,x_n\):

  1. Calculate \(\overline{x}\) and \(s\) from the data
  2. Calculate the test statistic \[ T = \frac{\overline{x}-\mu_0}{s/\sqrt{n}} \sim T(n-1) \]

Classical Statistics

Based on observations \(x_i\) from a random variable \(X\), describe \(X\) sufficiently well to enable inferences and predictions.

Mean of a variable

Check whether two means \(\mu_X\) and \(\mu_Y\) are significantly different from each other, based on repeated observations \(x_1,\dots,x_n\) and \(y_1,\dots,y_m\):

  1. Calculate \(\overline{x}, \overline{y}, s_x, s_y\) from the data
  2. Calculate the test statistic \[ T = \frac{\overline{x}-\overline{y}}{s_p\sqrt{s_x^2/n+s_y^2/m}} \sim T(...) \]

Classical Statistics

Based on observations \(x_i\) from a random variable \(X\), describe \(X\) sufficiently well to enable inferences and predictions.

Mean of a variable

Fundamental issue: as data sizes grow, \(1/\sqrt{n}\) will dominate everything else.

Everything is statistically significant.

Standard suggestion from Stats 101: look at effect sizes! look at domain specific significance concepts!

Power analysis, large sample sizes

Denote:

  • \(F_{\mathcal{N}} = \mathbb{P}(Z \leq z)\) for \(Z\sim\mathcal{N}(0,1)\)
  • \(z_{1-\alpha/2} = F_{\mathcal{N}}^{-1}(1-\alpha/2)\)
  • \(\Delta\) the standardized smallest difference between means to be detected

Then the power of the T-test is the probability of detecting an effect of size \(\Delta\):

\[ 1-\beta \approx F_\mathcal{N}(\Delta\sqrt{n}-z_{1-\alpha/2}) \]

With a large sample size, we can solve for \(\alpha\) to find a significance cutoff value.

Power analysis, large sample sizes

100k samples each from two normal distributions:

Quantity Estimate
\(\overline x\) \(-0.0141262\)
\(\overline y\) \(0.0085098\)
\(s_X\) \(1.0032245\)
\(s_Y\) \(1.0022946\)
t-test \(p\) \(4.4774123\times 10^{-7}\)

Power analysis, large sample sizes

100k samples each from two normal distributions:

Power analysis, large sample sizes

100k samples each from two normal distributions:

Power analysis, large sample sizes

100k samples each from two normal distributions:

We consider a difference to be significant if greater than \(0.1\).

power.t.test(n=100000, delta=0.1, sd=1, power=0.8, sig.level=NULL)
## 
##      Two-sample t test power calculation 
## 
##               n = 1e+05
##           delta = 0.1
##              sd = 1
##       sig.level = 1.030365e-102
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

At \(p = 4.4774123\times 10^{-7}\) we are not able to reject the null.

Drill Down

With large scale data, even VERY small subpopulations can be studied.

Example (ongoing research): study only taxi rides in NYC that start and end at the same position.

An Overview of Statistical Schools of Thought

Basic division: Frequentism vs. Bayesianism

  Frequentist Bayesian
Probability is… …asymptotic proportion of successes in repeated trials …measure of synthesized belief
Parametric inference is… …estimating value of parameters from data …updating probability distributions on parameters from data

Frequentist Concepts

Core interest: estimating some value \(\theta\) related to some real probability distribution on \(X\) from some estimator \(\hat\theta = t(X)\).

  • Bias: \(\mathbb{E}[\hat\theta]-\theta\)
  • Variance: \(\mathbb{E}[(\hat\theta - \mathbb{E}\hat\theta)^2]\)

Bias - Variance trade-off: \[ \text{MSE} = \text{Bias}^2 + \text{Variance} \]

We use standard error to refer to \(\sqrt{\mathbb{V}(\hat\theta)}\) - the standard deviation of an estimator.

Frequentist Concepts

To derive quantities describing an estimator we can use:

Plug-in estimators

Given a formula relating a quantity to parameters, plug in an estimator directly.

Example: The sample mean \(\overline{X} = \sum X_i/n\) has standard error \[ \text{se}(\overline{X}) = \sqrt{\mathbb{V}(X)/n} \]

We can estimate \(\mathbb{V}(X)\) using the sample variance \(\hat{\mathbb{V}}(X) = \sum (x_i-\overline{x})^2/(n-1)\). This yields an estimated standard error \[ \widehat{\text{se}}(\overline{X}) = \sqrt{\sum(x_i-\overline x)^2/(n(n-1))} \]

Frequentist Concepts

To derive quantities describing an estimator we can use:

Taylor expansions

More complicated statistics can be related back using linear approximations. For a function \(s(\hat\theta)\) we can Taylor expand around \(\theta=\mathbb{E}\hat\theta\): \[ s(\hat\theta) - s(\theta) \approx s'(\theta)(\hat\theta-\theta) \]

So \(\mathbb{V}[s(\hat\theta)] = \mathbb{E}[(s(\hat\theta)-s(\theta))^2] \approx |s'(\theta)|^2\mathbb{V}\theta\).

\(\hat\theta = \overline{x}^2\). Then \(d\hat\theta/d\overline{x} = 2\overline{x}\). Plugin into the Taylor expansion we get

\[ \text{se}(\overline{x}^2) = 2|\overline{x}|\widehat{\text{se}}(\overline{x}) \]

Frequentist Concepts

To derive quantities describing an estimator we can use:

Maximum Likelihood

We define the likelihood function as a function on parameter values:

\[ \mathcal{L}(\theta | x) = \mathbb{P}(x | \theta) \]

Neyman-Pearson’s Lemma When constructing a statistical testing rule to pick between two distributions \(f_0\) and \(f_1\), the smallest errors are achieved by \[ t_c(x) = \begin{cases} 1 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) \geq c$} \\ 0 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) < c$} \end{cases} \] for \(c\) chosen to achieve the desired confidence level

Frequentist Concepts

To derive quantities describing an estimator we can use:

Maximum Likelihood

We define the likelihood function as a function on parameter values:

\[ \mathcal{L}(\theta | x) = \mathbb{P}(x | \theta) \]

The Maximum Likelihood Estimator tends to be unbiased and with least possible variance – and even when not, tends to work very well.

\[ \hat\theta_{MLE} = \max_{\hat\theta} \mathcal{L}(\hat\theta | x) \]

Frequentist Concepts

To derive quantities describing an estimator we can use:

Bootstrap and Simulations

Frequentism wants us to focus on repeated experiments. …so let’s repeat some experiments.

Use the dataset \(x\) as a probability distribution itself; sample \(x^{(1)}, \dots, x^{(B)}\) repeatedly from \(x\) (with replacement). This approximates the true distribution, and we can use the observed distributions of \(t(x^{(k)})\) to study the statistical behavior of \(t(x)\).

Frequentist Concepts

To derive quantities describing an estimator we can use:

Bootstrap and Simulations

(Small data) example: mpg dataset

manufacturer model displ year cyl trans drv cty hwy fl class
audi a4 1.8 1999 4 auto(l5) f 18 29 p compact
audi a4 1.8 1999 4 manual(m5) f 21 29 p compact
audi a4 2.0 2008 4 manual(m6) f 20 31 p compact
audi a4 2.0 2008 4 auto(av) f 21 30 p compact
audi a4 2.8 1999 6 auto(l5) f 16 26 p compact
audi a4 2.8 1999 6 manual(m5) f 18 26 p compact
audi a4 3.1 2008 6 auto(av) f 18 27 p compact
audi a4 quattro 1.8 1999 4 manual(m5) 4 18 26 p compact
audi a4 quattro 1.8 1999 4 auto(l5) 4 16 25 p compact
audi a4 quattro 2.0 2008 4 manual(m6) 4 20 28 p compact
audi a4 quattro 2.0 2008 4 auto(s6) 4 19 27 p compact
audi a4 quattro 2.8 1999 6 auto(l5) 4 15 25 p compact
audi a4 quattro 2.8 1999 6 manual(m5) 4 17 25 p compact
audi a4 quattro 3.1 2008 6 auto(s6) 4 17 25 p compact
audi a4 quattro 3.1 2008 6 manual(m6) 4 15 25 p compact
audi a6 quattro 2.8 1999 6 auto(l5) 4 15 24 p midsize
audi a6 quattro 3.1 2008 6 auto(s6) 4 17 25 p midsize
audi a6 quattro 4.2 2008 8 auto(s6) 4 16 23 p midsize
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 11 15 e suv
chevrolet c1500 suburban 2wd 5.3 2008 8 auto(l4) r 14 20 r suv
chevrolet c1500 suburban 2wd 5.7 1999 8 auto(l4) r 13 17 r suv
chevrolet c1500 suburban 2wd 6.0 2008 8 auto(l4) r 12 17 r suv
chevrolet corvette 5.7 1999 8 manual(m6) r 16 26 p 2seater
chevrolet corvette 5.7 1999 8 auto(l4) r 15 23 p 2seater
chevrolet corvette 6.2 2008 8 manual(m6) r 16 26 p 2seater
chevrolet corvette 6.2 2008 8 auto(s6) r 15 25 p 2seater
chevrolet corvette 7.0 2008 8 manual(m6) r 15 24 p 2seater
chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 14 19 r suv
chevrolet k1500 tahoe 4wd 5.3 2008 8 auto(l4) 4 11 14 e suv
chevrolet k1500 tahoe 4wd 5.7 1999 8 auto(l4) 4 11 15 r suv
chevrolet k1500 tahoe 4wd 6.5 1999 8 auto(l4) 4 14 17 d suv
chevrolet malibu 2.4 1999 4 auto(l4) f 19 27 r midsize
chevrolet malibu 2.4 2008 4 auto(l4) f 22 30 r midsize
chevrolet malibu 3.1 1999 6 auto(l4) f 18 26 r midsize
chevrolet malibu 3.5 2008 6 auto(l4) f 18 29 r midsize
chevrolet malibu 3.6 2008 6 auto(s6) f 17 26 r midsize
dodge caravan 2wd 2.4 1999 4 auto(l3) f 18 24 r minivan
dodge caravan 2wd 3.0 1999 6 auto(l4) f 17 24 r minivan
dodge caravan 2wd 3.3 1999 6 auto(l4) f 16 22 r minivan
dodge caravan 2wd 3.3 1999 6 auto(l4) f 16 22 r minivan
dodge caravan 2wd 3.3 2008 6 auto(l4) f 17 24 r minivan
dodge caravan 2wd 3.3 2008 6 auto(l4) f 17 24 r minivan
dodge caravan 2wd 3.3 2008 6 auto(l4) f 11 17 e minivan
dodge caravan 2wd 3.8 1999 6 auto(l4) f 15 22 r minivan
dodge caravan 2wd 3.8 1999 6 auto(l4) f 15 21 r minivan
dodge caravan 2wd 3.8 2008 6 auto(l6) f 16 23 r minivan
dodge caravan 2wd 4.0 2008 6 auto(l6) f 16 23 r minivan
dodge dakota pickup 4wd 3.7 2008 6 manual(m6) 4 15 19 r pickup
dodge dakota pickup 4wd 3.7 2008 6 auto(l4) 4 14 18 r pickup
dodge dakota pickup 4wd 3.9 1999 6 auto(l4) 4 13 17 r pickup
dodge dakota pickup 4wd 3.9 1999 6 manual(m5) 4 14 17 r pickup
dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 14 19 r pickup
dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 14 19 r pickup
dodge dakota pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup
dodge dakota pickup 4wd 5.2 1999 8 manual(m5) 4 11 17 r pickup
dodge dakota pickup 4wd 5.2 1999 8 auto(l4) 4 11 15 r pickup
dodge durango 4wd 3.9 1999 6 auto(l4) 4 13 17 r suv
dodge durango 4wd 4.7 2008 8 auto(l5) 4 13 17 r suv
dodge durango 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
dodge durango 4wd 4.7 2008 8 auto(l5) 4 13 17 r suv
dodge durango 4wd 5.2 1999 8 auto(l4) 4 11 16 r suv
dodge durango 4wd 5.7 2008 8 auto(l5) 4 13 18 r suv
dodge durango 4wd 5.9 1999 8 auto(l4) 4 11 15 r suv
dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 12 16 r pickup
dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 9 12 e pickup
dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 13 17 r pickup
dodge ram 1500 pickup 4wd 4.7 2008 8 auto(l5) 4 13 17 r pickup
dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 12 16 r pickup
dodge ram 1500 pickup 4wd 4.7 2008 8 manual(m6) 4 9 12 e pickup
dodge ram 1500 pickup 4wd 5.2 1999 8 auto(l4) 4 11 15 r pickup
dodge ram 1500 pickup 4wd 5.2 1999 8 manual(m5) 4 11 16 r pickup
dodge ram 1500 pickup 4wd 5.7 2008 8 auto(l5) 4 13 17 r pickup
dodge ram 1500 pickup 4wd 5.9 1999 8 auto(l4) 4 11 15 r pickup
ford expedition 2wd 4.6 1999 8 auto(l4) r 11 17 r suv
ford expedition 2wd 5.4 1999 8 auto(l4) r 11 17 r suv
ford expedition 2wd 5.4 2008 8 auto(l6) r 12 18 r suv
ford explorer 4wd 4.0 1999 6 auto(l5) 4 14 17 r suv
ford explorer 4wd 4.0 1999 6 manual(m5) 4 15 19 r suv
ford explorer 4wd 4.0 1999 6 auto(l5) 4 14 17 r suv
ford explorer 4wd 4.0 2008 6 auto(l5) 4 13 19 r suv
ford explorer 4wd 4.6 2008 8 auto(l6) 4 13 19 r suv
ford explorer 4wd 5.0 1999 8 auto(l4) 4 13 17 r suv
ford f150 pickup 4wd 4.2 1999 6 auto(l4) 4 14 17 r pickup
ford f150 pickup 4wd 4.2 1999 6 manual(m5) 4 14 17 r pickup
ford f150 pickup 4wd 4.6 1999 8 manual(m5) 4 13 16 r pickup
ford f150 pickup 4wd 4.6 1999 8 auto(l4) 4 13 16 r pickup
ford f150 pickup 4wd 4.6 2008 8 auto(l4) 4 13 17 r pickup
ford f150 pickup 4wd 5.4 1999 8 auto(l4) 4 11 15 r pickup
ford f150 pickup 4wd 5.4 2008 8 auto(l4) 4 13 17 r pickup
ford mustang 3.8 1999 6 manual(m5) r 18 26 r subcompact
ford mustang 3.8 1999 6 auto(l4) r 18 25 r subcompact
ford mustang 4.0 2008 6 manual(m5) r 17 26 r subcompact
ford mustang 4.0 2008 6 auto(l5) r 16 24 r subcompact
ford mustang 4.6 1999 8 auto(l4) r 15 21 r subcompact
ford mustang 4.6 1999 8 manual(m5) r 15 22 r subcompact
ford mustang 4.6 2008 8 manual(m5) r 15 23 r subcompact
ford mustang 4.6 2008 8 auto(l5) r 15 22 r subcompact
ford mustang 5.4 2008 8 manual(m6) r 14 20 p subcompact
honda civic 1.6 1999 4 manual(m5) f 28 33 r subcompact
honda civic 1.6 1999 4 auto(l4) f 24 32 r subcompact
honda civic 1.6 1999 4 manual(m5) f 25 32 r subcompact
honda civic 1.6 1999 4 manual(m5) f 23 29 p subcompact
honda civic 1.6 1999 4 auto(l4) f 24 32 r subcompact
honda civic 1.8 2008 4 manual(m5) f 26 34 r subcompact
honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact
honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact
honda civic 2.0 2008 4 manual(m6) f 21 29 p subcompact
hyundai sonata 2.4 1999 4 auto(l4) f 18 26 r midsize
hyundai sonata 2.4 1999 4 manual(m5) f 18 27 r midsize
hyundai sonata 2.4 2008 4 auto(l4) f 21 30 r midsize
hyundai sonata 2.4 2008 4 manual(m5) f 21 31 r midsize
hyundai sonata 2.5 1999 6 auto(l4) f 18 26 r midsize
hyundai sonata 2.5 1999 6 manual(m5) f 18 26 r midsize
hyundai sonata 3.3 2008 6 auto(l5) f 19 28 r midsize
hyundai tiburon 2.0 1999 4 auto(l4) f 19 26 r subcompact
hyundai tiburon 2.0 1999 4 manual(m5) f 19 29 r subcompact
hyundai tiburon 2.0 2008 4 manual(m5) f 20 28 r subcompact
hyundai tiburon 2.0 2008 4 auto(l4) f 20 27 r subcompact
hyundai tiburon 2.7 2008 6 auto(l4) f 17 24 r subcompact
hyundai tiburon 2.7 2008 6 manual(m6) f 16 24 r subcompact
hyundai tiburon 2.7 2008 6 manual(m5) f 17 24 r subcompact
jeep grand cherokee 4wd 3.0 2008 6 auto(l5) 4 17 22 d suv
jeep grand cherokee 4wd 3.7 2008 6 auto(l5) 4 15 19 r suv
jeep grand cherokee 4wd 4.0 1999 6 auto(l4) 4 15 20 r suv
jeep grand cherokee 4wd 4.7 1999 8 auto(l4) 4 14 17 r suv
jeep grand cherokee 4wd 4.7 2008 8 auto(l5) 4 9 12 e suv
jeep grand cherokee 4wd 4.7 2008 8 auto(l5) 4 14 19 r suv
jeep grand cherokee 4wd 5.7 2008 8 auto(l5) 4 13 18 r suv
jeep grand cherokee 4wd 6.1 2008 8 auto(l5) 4 11 14 p suv
land rover range rover 4.0 1999 8 auto(l4) 4 11 15 p suv
land rover range rover 4.2 2008 8 auto(s6) 4 12 18 r suv
land rover range rover 4.4 2008 8 auto(s6) 4 12 18 r suv
land rover range rover 4.6 1999 8 auto(l4) 4 11 15 p suv
lincoln navigator 2wd 5.4 1999 8 auto(l4) r 11 17 r suv
lincoln navigator 2wd 5.4 1999 8 auto(l4) r 11 16 p suv
lincoln navigator 2wd 5.4 2008 8 auto(l6) r 12 18 r suv
mercury mountaineer 4wd 4.0 1999 6 auto(l5) 4 14 17 r suv
mercury mountaineer 4wd 4.0 2008 6 auto(l5) 4 13 19 r suv
mercury mountaineer 4wd 4.6 2008 8 auto(l6) 4 13 19 r suv
mercury mountaineer 4wd 5.0 1999 8 auto(l4) 4 13 17 r suv
nissan altima 2.4 1999 4 manual(m5) f 21 29 r compact
nissan altima 2.4 1999 4 auto(l4) f 19 27 r compact
nissan altima 2.5 2008 4 auto(av) f 23 31 r midsize
nissan altima 2.5 2008 4 manual(m6) f 23 32 r midsize
nissan altima 3.5 2008 6 manual(m6) f 19 27 p midsize
nissan altima 3.5 2008 6 auto(av) f 19 26 p midsize
nissan maxima 3.0 1999 6 auto(l4) f 18 26 r midsize
nissan maxima 3.0 1999 6 manual(m5) f 19 25 r midsize
nissan maxima 3.5 2008 6 auto(av) f 19 25 p midsize
nissan pathfinder 4wd 3.3 1999 6 auto(l4) 4 14 17 r suv
nissan pathfinder 4wd 3.3 1999 6 manual(m5) 4 15 17 r suv
nissan pathfinder 4wd 4.0 2008 6 auto(l5) 4 14 20 p suv
nissan pathfinder 4wd 5.6 2008 8 auto(s5) 4 12 18 p suv
pontiac grand prix 3.1 1999 6 auto(l4) f 18 26 r midsize
pontiac grand prix 3.8 1999 6 auto(l4) f 16 26 p midsize
pontiac grand prix 3.8 1999 6 auto(l4) f 17 27 r midsize
pontiac grand prix 3.8 2008 6 auto(l4) f 18 28 r midsize
pontiac grand prix 5.3 2008 8 auto(s4) f 16 25 p midsize
subaru forester awd 2.5 1999 4 manual(m5) 4 18 25 r suv
subaru forester awd 2.5 1999 4 auto(l4) 4 18 24 r suv
subaru forester awd 2.5 2008 4 manual(m5) 4 20 27 r suv
subaru forester awd 2.5 2008 4 manual(m5) 4 19 25 p suv
subaru forester awd 2.5 2008 4 auto(l4) 4 20 26 r suv
subaru forester awd 2.5 2008 4 auto(l4) 4 18 23 p suv
subaru impreza awd 2.2 1999 4 auto(l4) 4 21 26 r subcompact
subaru impreza awd 2.2 1999 4 manual(m5) 4 19 26 r subcompact
subaru impreza awd 2.5 1999 4 manual(m5) 4 19 26 r subcompact
subaru impreza awd 2.5 1999 4 auto(l4) 4 19 26 r subcompact
subaru impreza awd 2.5 2008 4 auto(s4) 4 20 25 p compact
subaru impreza awd 2.5 2008 4 auto(s4) 4 20 27 r compact
subaru impreza awd 2.5 2008 4 manual(m5) 4 19 25 p compact
subaru impreza awd 2.5 2008 4 manual(m5) 4 20 27 r compact
toyota 4runner 4wd 2.7 1999 4 manual(m5) 4 15 20 r suv
toyota 4runner 4wd 2.7 1999 4 auto(l4) 4 16 20 r suv
toyota 4runner 4wd 3.4 1999 6 auto(l4) 4 15 19 r suv
toyota 4runner 4wd 3.4 1999 6 manual(m5) 4 15 17 r suv
toyota 4runner 4wd 4.0 2008 6 auto(l5) 4 16 20 r suv
toyota 4runner 4wd 4.7 2008 8 auto(l5) 4 14 17 r suv
toyota camry 2.2 1999 4 manual(m5) f 21 29 r midsize
toyota camry 2.2 1999 4 auto(l4) f 21 27 r midsize
toyota camry 2.4 2008 4 manual(m5) f 21 31 r midsize
toyota camry 2.4 2008 4 auto(l5) f 21 31 r midsize
toyota camry 3.0 1999 6 auto(l4) f 18 26 r midsize
toyota camry 3.0 1999 6 manual(m5) f 18 26 r midsize
toyota camry 3.5 2008 6 auto(s6) f 19 28 r midsize
toyota camry solara 2.2 1999 4 auto(l4) f 21 27 r compact
toyota camry solara 2.2 1999 4 manual(m5) f 21 29 r compact
toyota camry solara 2.4 2008 4 manual(m5) f 21 31 r compact
toyota camry solara 2.4 2008 4 auto(s5) f 22 31 r compact
toyota camry solara 3.0 1999 6 auto(l4) f 18 26 r compact
toyota camry solara 3.0 1999 6 manual(m5) f 18 26 r compact
toyota camry solara 3.3 2008 6 auto(s5) f 18 27 r compact
toyota corolla 1.8 1999 4 auto(l3) f 24 30 r compact
toyota corolla 1.8 1999 4 auto(l4) f 24 33 r compact
toyota corolla 1.8 1999 4 manual(m5) f 26 35 r compact
toyota corolla 1.8 2008 4 manual(m5) f 28 37 r compact
toyota corolla 1.8 2008 4 auto(l4) f 26 35 r compact
toyota land cruiser wagon 4wd 4.7 1999 8 auto(l4) 4 11 15 r suv
toyota land cruiser wagon 4wd 5.7 2008 8 auto(s6) 4 13 18 r suv
toyota toyota tacoma 4wd 2.7 1999 4 manual(m5) 4 15 20 r pickup
toyota toyota tacoma 4wd 2.7 1999 4 auto(l4) 4 16 20 r pickup
toyota toyota tacoma 4wd 2.7 2008 4 manual(m5) 4 17 22 r pickup
toyota toyota tacoma 4wd 3.4 1999 6 manual(m5) 4 15 17 r pickup
toyota toyota tacoma 4wd 3.4 1999 6 auto(l4) 4 15 19 r pickup
toyota toyota tacoma 4wd 4.0 2008 6 manual(m6) 4 15 18 r pickup
toyota toyota tacoma 4wd 4.0 2008 6 auto(l5) 4 16 20 r pickup
volkswagen gti 2.0 1999 4 manual(m5) f 21 29 r compact
volkswagen gti 2.0 1999 4 auto(l4) f 19 26 r compact
volkswagen gti 2.0 2008 4 manual(m6) f 21 29 p compact
volkswagen gti 2.0 2008 4 auto(s6) f 22 29 p compact
volkswagen gti 2.8 1999 6 manual(m5) f 17 24 r compact
volkswagen jetta 1.9 1999 4 manual(m5) f 33 44 d compact
volkswagen jetta 2.0 1999 4 manual(m5) f 21 29 r compact
volkswagen jetta 2.0 1999 4 auto(l4) f 19 26 r compact
volkswagen jetta 2.0 2008 4 auto(s6) f 22 29 p compact
volkswagen jetta 2.0 2008 4 manual(m6) f 21 29 p compact
volkswagen jetta 2.5 2008 5 auto(s6) f 21 29 r compact
volkswagen jetta 2.5 2008 5 manual(m5) f 21 29 r compact
volkswagen jetta 2.8 1999 6 auto(l4) f 16 23 r compact
volkswagen jetta 2.8 1999 6 manual(m5) f 17 24 r compact
volkswagen new beetle 1.9 1999 4 manual(m5) f 35 44 d subcompact
volkswagen new beetle 1.9 1999 4 auto(l4) f 29 41 d subcompact
volkswagen new beetle 2.0 1999 4 manual(m5) f 21 29 r subcompact
volkswagen new beetle 2.0 1999 4 auto(l4) f 19 26 r subcompact
volkswagen new beetle 2.5 2008 5 manual(m5) f 20 28 r subcompact
volkswagen new beetle 2.5 2008 5 auto(s6) f 20 29 r subcompact
volkswagen passat 1.8 1999 4 manual(m5) f 21 29 p midsize
volkswagen passat 1.8 1999 4 auto(l5) f 18 29 p midsize
volkswagen passat 2.0 2008 4 auto(s6) f 19 28 p midsize
volkswagen passat 2.0 2008 4 manual(m6) f 21 29 p midsize
volkswagen passat 2.8 1999 6 auto(l5) f 16 26 p midsize
volkswagen passat 2.8 1999 6 manual(m5) f 18 26 p midsize
volkswagen passat 3.6 2008 6 auto(s6) f 17 26 p midsize

Frequentist Concepts

To derive quantities describing an estimator we can use:

Bootstrap and Simulations

(Small data) example: mpg dataset, cty variable

Mean: \(16.8589744\), standard deviation: \(4.2559457\)

Frequentist Concepts

To derive quantities describing an estimator we can use:

Bootstrap and Simulations

(Small data) example: mpg dataset, cty variable

Mean: \(16.8589744\), standard error: \(0.2782199\)

Bayesian Inference

Fundamental building block is Bayes’ Theorem. Let

  1. \(f(x|\mu)\) be a family of probability densities
  2. \(g(\mu)\) be some probability distribution on possible parameters for \(f(x|\mu)\)
  3. \(f(x) = \int_\Omega f_\mu(x)g(\mu)d\mu\) the marginal density of \(x\) - the result of averaging over all possible values for \(\mu\)

Then \[ g(\mu | x) = \frac{f(x | \mu)g(\mu)}{f(x)} \]

If \(g\) measures our belief of possible distributions for \(\mu\), then Bayes rule provides a systematic update rule: how does new information change that belief.

Bayesian Inference

By changing our notation, Bayes rule can be rewritten using likelihoods as:

\[ g(\mu | x) = c_x\mathcal L(\mu|x)g(\mu) \]

where \(c_x\) is a constant ensuring \(\int_\Omega g(\mu|x)d\mu = 1\).

Likelihood ratio

When deciding between two specific points,

\[ \frac{g(\mu_1|x)}{g(\mu_2|x)} = \frac{g(\mu_1)}{g(\mu_2)}\cdot\frac{\mathcal L(\mu_1|x)}{\mathcal L(\mu_2|x)} \]

The posterior odds ratio is the prior odds ratio times the likelihood ratio

Bayesian Inference

In Bayesian Inference, instead of a single value \(\hat\theta\) for a parameter, an entire probability distribution is estimated and updated.

Everything starts with the prior; the distribution that gets updated. This prior should preferably encode everything we know about the situation going in.

Even with a badly chosen prior, sufficiently consistent results will often quickly adjust the distribution.

Bayesian Inference - Skew prior

Flip a coin to check if fair. Start with prior belief \(\mathbb{P}(H)\sim\text{Beta}(9,1)\) (mean \(\mathbb{P}(H)=0.9\))

Uninformative Priors

If not enough information is present at the start, one way is to pick a prior designed to not encode assumptions.

  • Uniform prior \(g^U(\theta) = c\).
  • Triangular prior \(g^\Delta(\theta) = 1-|\theta|\).
  • Jeffrey’s prior \[ g^{\text{Jeff}}(\theta) = \sqrt{\frac{1}{ \mathbb{E}\frac{\partial}{\partial\theta}\log\mathcal L(\theta|x) }} \approx \frac{1}{\sigma_{\text{MLE}}} \]

Frequentist vs Bayesian

The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

92.50 91.12 92.37 92.17 93.95 92.16
90.43 92.33 92.45 91.99 92.05 92.68

She calculates \(\overline{x} = 92.18\), an unbiased estimate of the true voltage.

Frequentist vs Bayesian

The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

92.50 91.12 92.37 92.17 93.95 92.16
90.43 92.33 92.45 91.99 92.05 92.68

She calculates \(\overline{x} = 92.18\), an unbiased estimate of the true voltage.

The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.

Is the estimate unbiased?

Frequentist vs Bayesian

The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

92.50 91.12 92.37 92.17 93.95 92.16
90.43 92.33 92.45 91.99 92.05 92.68

She calculates \(\overline{x} = 92.18\), an unbiased estimate of the true voltage.

The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.

Is the estimate unbiased?
Frequentist answer: NO - because the probability family has changed.

Frequentist vs Bayesian

The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

92.50 91.12 92.37 92.17 93.95 92.16
90.43 92.33 92.45 91.99 92.05 92.68

She calculates \(\overline{x} = 92.18\), an unbiased estimate of the true voltage.

The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.

Is the estimate unbiased?
Bayesian answer: YES - because the update rule only depends on the actual data points

Pervasive Trade-offs

Bias vs Variance

Total Error = Bias + Variance + Irreducible Error

Memory vs Processing

Speed can be increased by using more memory.
Memory footprint can be decreased by using more time.

Underfitting vs overfitting

More complex models adapt closer to training data.
More complex models may behave badly out-of-sample.