---
title: 'Big Data Analytics
Lecture 1'
author: "Mikael Vejdemo-Johansson
Ping Ji"
date: "2019-01-31"
output:
revealjs::revealjs_presentation:
transition: none
slideNumber: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(knitr)
library(tidyverse)
library(ggformula)
library(GGally)
library(mosaic)
```
## Welcome to Big Data Analytics
## Semester Overview
* Foundations of Big Data
* Statistical presentation, communication, inference, learning
* Different large scale data types, and how to use them
## Structure of the course
* Weekly lectures: Mikael Vejdemo-Johansson, Ping Ji
* Guest lecturers: Denis Khryashchev, Sara-Jayne Terp, Joshua Brown, Mario Gonzalez
* Lab / homework tasks: 2 programming tasks, using [Kaggle](http://kaggle.com)
* Term report:
Either a deep analysis of a dataset, or an in depth explanation of additional techniques.
End of semester: in class presentation
## Contact Information
* Mikael Vejdemo-Johansson
mvejdemojohansson@gc.cuny.edu
Office Hours, 4420, Thursdays 3pm to 4pm
* Ping Ji
PJi@gc.cuny.edu
Course information will come through Blackboard.
A detailed syllabus is available now.
## Course Literature
* Efron & Hastie: Computer age statistical inference
* James, Witten, Hastie, Tibshirani: An introduction to statistical learning
Free as ebook
* Stephens-Davidowitz: Everybody Lies
Read **now**: Efron & Hastie, Part I.
## Big Data
## Big Data
**Big Data** is when you have to think about handling your data:
* ...how will you fit this in memory?
* ...how will you fit this on disk(s)?
* ...how will you compute summaries quickly enough?
* ...how will you find what features to focus on?
## Three Vs of big data
* **V**olume - large data sets (Google; Wikipedia; CERN; High-throughput screening, ...)
* **V**elocity - fast data, real-time processing (high speed trading; Twitter firehose; networks)
* **V**ariety - complex data (image data; graph data; video data; data fusion; shape of data)
## Four Vs of big data
* **V**olume - large data sets (Google; Wikipedia; CERN; High-throughput screening, ...)
* **V**elocity - fast data, real-time processing (high speed trading; Twitter firehose; networks)
* **V**ariety - complex data (image data; graph data; video data; data fusion; shape of data)
* **V**eracity - reliable data (biases; ethical data analysis; data cleaning)
## Five Vs of big data
* **V**olume - large data sets (Google; Wikipedia; CERN; High-throughput screening, ...)
* **V**elocity - fast data, real-time processing (high speed trading; Twitter firehose; networks)
* **V**ariety - complex data (image data; graph data; video data; data fusion; shape of data)
* **V**eracity - reliable data (biases; ethical data analysis; data cleaning)
* **V**alue - valid and relevant data (applicability; relevance; actionable; impactful)
## Data Analytics
## Data Analytics
* Draw from past data to predict future behavior
* Multidisciplinary - extensive use of computation, mathematics, statistics
* Connected to business / consumer needs
## Classical Statistics
Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.
### Mean of a variable
Estimate the mean $\mu_X$ based on repeated observations $x_1,\dots,x_n$:
1. Calculate $\overline{x}$ and $s$ from the data
2. Calculate the confidence interval
$$
\overline{x} - t_{1-\alpha/2}\cdot s/\sqrt{n}
\leq \mu_X \leq\overline{x} + t_{1-\alpha/2}\cdot s/\sqrt{n}
$$
## Classical Statistics
Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.
### Mean of a variable
Check whether a mean $\mu_X$ is significantly different from a hypothesized mean $\mu_0$, based on repeated observations $x_1,\dots,x_n$:
1. Calculate $\overline{x}$ and $s$ from the data
2. Calculate the test statistic
$$
T = \frac{\overline{x}-\mu_0}{s/\sqrt{n}} \sim T(n-1)
$$
## Classical Statistics
Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.
### Mean of a variable
Check whether two means $\mu_X$ and $\mu_Y$ are significantly different from each other, based on repeated observations $x_1,\dots,x_n$ and $y_1,\dots,y_m$:
1. Calculate $\overline{x}, \overline{y}, s_x, s_y$ from the data
2. Calculate the test statistic
$$
T = \frac{\overline{x}-\overline{y}}{s_p\sqrt{s_x^2/n+s_y^2/m}} \sim T(...)
$$
## Classical Statistics
Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.
### Mean of a variable
Fundamental issue: as data sizes grow, $1/\sqrt{n}$ will dominate everything else.
Everything is statistically significant.
Standard suggestion from Stats 101: look at effect sizes! look at domain specific significance concepts!
## Power analysis, large sample sizes
Denote:
* $F_{\mathcal{N}} = \mathbb{P}(Z \leq z)$ for $Z\sim\mathcal{N}(0,1)$
* $z_{1-\alpha/2} = F_{\mathcal{N}}^{-1}(1-\alpha/2)$
* $\Delta$ the standardized smallest difference between means to be detected
Then the **power** of the T-test is the probability of detecting an effect of size $\Delta$:
\[
1-\beta \approx F_\mathcal{N}(\Delta\sqrt{n}-z_{1-\alpha/2})
\]
With a large sample size, we can solve for $\alpha$ to find a significance cutoff value.
```{r echo=F}
set.seed(42)
pts = data.frame(x = rnorm(100000, mean=-1e-2), y = rnorm(100000, mean=1e-2))
test = t.test(pts$x, pts$y)
```
## Power analysis, large sample sizes
100k samples each from two normal distributions:
Quantity | Estimate
-|-
$\overline x$ | $`r mean(pts$x)`$
$\overline y$ | $`r mean(pts$y)`$
$s_X$ | $`r sd(pts$x)`$
$s_Y$ | $`r sd(pts$y)`$
t-test $p$ | $`r test$p.value`$
## Power analysis, large sample sizes
100k samples each from two normal distributions:
```{r echo=F, error=F, warning=F, message=F}
pts %>%
gf_freqpoly(~x, color=~"x", binsize=0.01) %>%
gf_freqpoly(~y, color=~"y", binsize=0.01)
```
## Power analysis, large sample sizes
100k samples each from two normal distributions:
```{r echo=F, error=F, warning=F, message=F}
pts %>%
mutate(x=sort(x), y=sort(y)) %>%
gf_point(y~x) %>%
gf_labs(title="QQ-plot") +
coord_equal()
```
## Power analysis, large sample sizes
100k samples each from two normal distributions:
We consider a difference to be significant if greater than $0.1$.
```{r echo=T, error=F, warning=F, message=F}
power.t.test(n=100000, delta=0.1, sd=1, power=0.8, sig.level=NULL)
```
At $p = `r test$p.value`$ we are not able to reject the null.
## Drill Down
With large scale data, even **VERY** small subpopulations can be studied.
Example (ongoing research): study only taxi rides in NYC that start and end at the same position.
## An Overview of Statistical Schools of Thought
Basic division: Frequentism vs. Bayesianism
| Frequentist | Bayesian
-|-|-
Probability is... | ...asymptotic proportion of successes in repeated trials | ...measure of synthesized belief
Parametric inference is... | ...estimating value of parameters from data | ...updating probability distributions on parameters from data
## Frequentist Concepts
Core interest: estimating some value $\theta$ related to some real probability distribution on $X$ from some estimator $\hat\theta = t(X)$.
* Bias: $\mathbb{E}[\hat\theta]-\theta$
* Variance: $\mathbb{E}[(\hat\theta - \mathbb{E}\hat\theta)^2]$
Bias - Variance trade-off:
\[
\text{MSE} = \text{Bias}^2 + \text{Variance}
\]
We use **standard error** to refer to $\sqrt{\mathbb{V}(\hat\theta)}$ - the standard deviation of an estimator.
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Plug-in estimators
Given a formula relating a quantity to parameters, plug in an estimator directly.
Example: The sample mean $\overline{X} = \sum X_i/n$ has standard error
\[
\text{se}(\overline{X}) = \sqrt{\mathbb{V}(X)/n}
\]
We can estimate $\mathbb{V}(X)$ using the sample variance $\hat{\mathbb{V}}(X) = \sum (x_i-\overline{x})^2/(n-1)$.
This yields an estimated standard error
\[
\widehat{\text{se}}(\overline{X}) = \sqrt{\sum(x_i-\overline x)^2/(n(n-1))}
\]
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Taylor expansions
More complicated statistics can be related back using linear approximations.
For a function $s(\hat\theta)$ we can Taylor expand around $\theta=\mathbb{E}\hat\theta$:
\[
s(\hat\theta) - s(\theta) \approx s'(\theta)(\hat\theta-\theta)
\]
So $\mathbb{V}[s(\hat\theta)] = \mathbb{E}[(s(\hat\theta)-s(\theta))^2] \approx |s'(\theta)|^2\mathbb{V}\theta$.
$\hat\theta = \overline{x}^2$. Then $d\hat\theta/d\overline{x} = 2\overline{x}$.
Plugin into the Taylor expansion we get
\[
\text{se}(\overline{x}^2) = 2|\overline{x}|\widehat{\text{se}}(\overline{x})
\]
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Maximum Likelihood
We define the **likelihood function** as a function on parameter values:
\[
\mathcal{L}(\theta | x) = \mathbb{P}(x | \theta)
\]
**Neyman-Pearson's Lemma** When constructing a statistical testing rule to pick between two distributions $f_0$ and $f_1$, the smallest errors are achieved by
\[
t_c(x) = \begin{cases}
1 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) \geq c$} \\
0 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) < c$}
\end{cases}
\]
for $c$ chosen to achieve the desired confidence level
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Maximum Likelihood
We define the **likelihood function** as a function on parameter values:
\[
\mathcal{L}(\theta | x) = \mathbb{P}(x | \theta)
\]
The **Maximum Likelihood Estimator** tends to be *unbiased* and with *least possible variance* -- and even when not, tends to work very well.
\[
\hat\theta_{MLE} = \max_{\hat\theta} \mathcal{L}(\hat\theta | x)
\]
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Bootstrap and Simulations
Frequentism wants us to focus on repeated experiments.
...so let's repeat some experiments.
Use the dataset $x$ as a probability distribution itself; sample $x^{(1)}, \dots, x^{(B)}$ repeatedly from $x$ (with replacement).
This approximates the true distribution, and we can use the observed distributions of $t(x^{(k)})$ to study the statistical behavior of $t(x)$.
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Bootstrap and Simulations
(Small data) example: `mpg` dataset
```{r}
mpg %>% kable
```
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Bootstrap and Simulations
(Small data) example: `mpg` dataset, `cty` variable
Mean: $`r mean(mpg$cty)`$, standard deviation: $`r sd(mpg$cty)`$
```{r fig.height=2}
mpg %>% gf_histogram(~cty)
```
## Frequentist Concepts
To derive quantities describing an estimator we can use:
### Bootstrap and Simulations
(Small data) example: `mpg` dataset, `cty` variable
Mean: $`r mean(mpg$cty)`$, standard error: $`r sd(mpg$cty)/sqrt(nrow(mpg))`$
```{r fig.height=4}
mean.boot = do(1000)*mean((mpg %>% sample_frac(replace=TRUE))$cty)
mean.boot %>%
gf_qq(~mean, "norm", dparams=list(mean=mean(mpg$cty),
sd=sd(mpg$cty)/sqrt(nrow(mpg)))) %>%
gf_labs(title="QQ-plot, bootstrapped mean distribution",
subtitle="vs. theoretical distribution",
x="theoretical", y="bootstrap") %>%
gf_abline(slope=1, color="blue") + coord_equal()
```
## Bayesian Inference
Fundamental building block is Bayes' Theorem.
Let
1. $f(x|\mu)$ be a family of probability densities
2. $g(\mu)$ be some probability distribution on possible parameters for $f(x|\mu)$
3. $f(x) = \int_\Omega f_\mu(x)g(\mu)d\mu$ the marginal density of $x$ - the result of averaging over all possible values for $\mu$
Then
\[
g(\mu | x) = \frac{f(x | \mu)g(\mu)}{f(x)}
\]
If $g$ measures our belief of possible distributions for $\mu$, then Bayes rule provides a systematic update rule: how does new information *change* that belief.
## Bayesian Inference
By changing our notation, Bayes rule can be rewritten using likelihoods as:
\[
g(\mu | x) = c_x\mathcal L(\mu|x)g(\mu)
\]
where $c_x$ is a constant ensuring $\int_\Omega g(\mu|x)d\mu = 1$.
### Likelihood ratio
When deciding between two specific points,
\[
\frac{g(\mu_1|x)}{g(\mu_2|x)} = \frac{g(\mu_1)}{g(\mu_2)}\cdot\frac{\mathcal L(\mu_1|x)}{\mathcal L(\mu_2|x)}
\]
*The posterior odds ratio is the prior odds ratio times the likelihood ratio*
## Bayesian Inference
In Bayesian Inference, instead of a single value $\hat\theta$ for a parameter, an entire probability distribution is estimated and updated.
Everything starts with the *prior*; the distribution that gets updated.
This prior should preferably encode everything we know about the situation going in.
Even with a badly chosen prior, sufficiently consistent results will often quickly adjust the distribution.
## Bayesian Inference - Skew prior
Flip a coin to check if fair. Start with prior belief $\mathbb{P}(H)\sim\text{Beta}(9,1)$ (mean $\mathbb{P}(H)=0.9$)
```{r fig.height=5}
flips = runif(1000) > 0.5
p.h = list(c(9,1),
c(9+sum(flips[1:250]), 250-sum(flips[1:250])+1),
c(9+sum(flips[1:500]), 500-sum(flips[1:500])+1),
c(9+sum(flips[1:750]), 750-sum(flips[1:750])+1),
c(9+sum(flips), 1000-sum(flips)+1))
gf_dist("beta", shape1=p.h[[1]][1], shape2=p.h[[1]][2], color=~"prior") %>%
gf_dist("beta", shape1=p.h[[2]][1], shape2=p.h[[2]][2], color=~" 250 flips") %>%
gf_dist("beta", shape1=p.h[[3]][1], shape2=p.h[[3]][2], color=~" 500 flips") %>%
gf_dist("beta", shape1=p.h[[4]][1], shape2=p.h[[4]][2], color=~" 750 flips") %>%
gf_dist("beta", shape1=p.h[[5]][1], shape2=p.h[[5]][2], color=~"1000 flips") %>%
gf_labs(x="P(H)")
```
## Uninformative Priors
If not enough information is present at the start, one way is to pick a prior designed to not encode assumptions.
* Uniform prior $g^U(\theta) = c$.
* Triangular prior $g^\Delta(\theta) = 1-|\theta|$.
* Jeffrey's prior
\[
g^{\text{Jeff}}(\theta) =
\sqrt{\frac{1}{
\mathbb{E}\frac{\partial}{\partial\theta}\log\mathcal L(\theta|x)
}} \approx
\frac{1}{\sigma_{\text{MLE}}}
\]
## Frequentist vs Bayesian
### The meter reader
An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.
```{r}
meter.data = rnorm(12, 92, 1)
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```
She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.
## Frequentist vs Bayesian
### The meter reader
An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.
```{r}
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```
She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.
The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.
Is the estimate unbiased?
## Frequentist vs Bayesian
### The meter reader
An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.
```{r}
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```
She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.
The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.
Is the estimate unbiased?
Frequentist answer: **NO** - because the probability family has changed.
## Frequentist vs Bayesian
### The meter reader
An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.
```{r}
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```
She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.
The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.
Is the estimate unbiased?
Bayesian answer: **YES** - because the update rule only depends on the actual data points
## Pervasive Trade-offs
### Bias vs Variance
Total Error = Bias + Variance + Irreducible Error
### Memory vs Processing
Speed can be increased by using more memory.
Memory footprint can be decreased by using more time.
### Underfitting vs overfitting
More complex models adapt closer to training data.
More complex models may behave badly out-of-sample.