---
title: 'Big Data Analytics<br />Lecture 1'
author: "Mikael Vejdemo-Johansson<br />Ping Ji"
date: "2019-01-31"
output:
  revealjs::revealjs_presentation:
    transition: none
    slideNumber: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(knitr)
library(tidyverse)
library(ggformula)
library(GGally)
library(mosaic)
```


## Welcome to Big Data Analytics

## Semester Overview

* Foundations of Big Data
* Statistical presentation, communication, inference, learning
* Different large scale data types, and how to use them


## Structure of the course

* Weekly lectures: Mikael Vejdemo-Johansson, Ping Ji   
* Guest lecturers: Denis Khryashchev, Sara-Jayne Terp, Joshua Brown, Mario Gonzalez
* Lab / homework tasks: 2 programming tasks, using [Kaggle](http://kaggle.com)
* Term report:  
  Either a deep analysis of a dataset, or an in depth explanation of additional techniques.
  End of semester: in class presentation


## Contact Information

* Mikael Vejdemo-Johansson   
  mvejdemojohansson@gc.cuny.edu   
  Office Hours, 4420, Thursdays 3pm to 4pm
* Ping Ji   
  PJi@gc.cuny.edu

Course information will come through Blackboard.
A detailed syllabus is available now.


## Course Literature

* Efron & Hastie: Computer age statistical inference
* James, Witten, Hastie, Tibshirani: An introduction to statistical learning   
  Free as ebook
* Stephens-Davidowitz: Everybody Lies

Read **now**: Efron & Hastie, Part I.


## Big Data


## Big Data

**Big Data** is when you have to think about handling your data:

* ...how will you fit this in memory?
* ...how will you fit this on disk(s)?
* ...how will you compute summaries quickly enough?
* ...how will you find what features to focus on?


## Three Vs of big data

* **V**olume <span style="color:white">- large data sets (Google; Wikipedia; CERN; High-throughput screening, ...)</span>
* **V**elocity <span style="color:white">- fast data, real-time processing (high speed trading; Twitter firehose; networks)</fragment>
* **V**ariety <span style="color:white">- complex data (image data; graph data; video data; data fusion; shape of data)</fragment>

## Four Vs of big data

* **V**olume <span style="color:white">- large data sets (Google; Wikipedia; CERN; High-throughput screening, ...)</span>
* **V**elocity <span style="color:white">- fast data, real-time processing (high speed trading; Twitter firehose; networks)</fragment>
* **V**ariety <span style="color:white">- complex data (image data; graph data; video data; data fusion; shape of data)</fragment>
* **V**eracity <span style="color:white">- reliable data (biases; ethical data analysis; data cleaning)</fragment>


## Five Vs of big data

* **V**olume <span class="fragment">- large data sets (Google; Wikipedia; CERN; High-throughput screening, ...)</fragment>
* **V**elocity <span class="fragment">- fast data, real-time processing (high speed trading; Twitter firehose; networks)</fragment>
* **V**ariety <span class="fragment">- complex data (image data; graph data; video data; data fusion; shape of data)</fragment>
* **V**eracity <span class="fragment">- reliable data (biases; ethical data analysis; data cleaning)</fragment>
* **V**alue <span class="fragment">- valid and relevant data (applicability; relevance; actionable; impactful)</fragment>


## Data Analytics


## Data Analytics

* Draw from past data to predict future behavior
* Multidisciplinary - extensive use of computation, mathematics, statistics
* Connected to business / consumer needs


## Classical Statistics

Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.

### Mean of a variable

Estimate the mean $\mu_X$ based on repeated observations $x_1,\dots,x_n$:

1. Calculate $\overline{x}$ and $s$ from the data
2. Calculate the confidence interval
$$
\overline{x} - t_{1-\alpha/2}\cdot s/\sqrt{n}
\leq \mu_X \leq\overline{x} + t_{1-\alpha/2}\cdot s/\sqrt{n}
$$


## Classical Statistics

Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.

### Mean of a variable

Check whether a mean $\mu_X$ is significantly different from a hypothesized mean $\mu_0$, based on repeated observations $x_1,\dots,x_n$:

1. Calculate $\overline{x}$ and $s$ from the data
2. Calculate the test statistic
$$
T = \frac{\overline{x}-\mu_0}{s/\sqrt{n}} \sim T(n-1)
$$


## Classical Statistics

Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.

### Mean of a variable

Check whether two means $\mu_X$ and $\mu_Y$ are significantly different from each other, based on repeated observations $x_1,\dots,x_n$ and $y_1,\dots,y_m$:

1. Calculate $\overline{x}, \overline{y}, s_x, s_y$ from the data
2. Calculate the test statistic
$$
T = \frac{\overline{x}-\overline{y}}{s_p\sqrt{s_x^2/n+s_y^2/m}} \sim T(...)
$$


## Classical Statistics

Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*.

### Mean of a variable

Fundamental issue: as data sizes grow, $1/\sqrt{n}$ will dominate everything else.

Everything is statistically significant.

Standard suggestion from Stats 101: look at effect sizes! look at domain specific significance concepts! 


## Power analysis, large sample sizes

Denote:

* $F_{\mathcal{N}} = \mathbb{P}(Z \leq z)$ for $Z\sim\mathcal{N}(0,1)$
* $z_{1-\alpha/2} = F_{\mathcal{N}}^{-1}(1-\alpha/2)$
* $\Delta$ the standardized smallest difference between means to be detected

Then the **power** of the T-test is the probability of detecting an effect of size $\Delta$:

\[
1-\beta \approx F_\mathcal{N}(\Delta\sqrt{n}-z_{1-\alpha/2})
\]

With a large sample size, we can solve for $\alpha$ to find a significance cutoff value.


```{r echo=F}
set.seed(42)
pts = data.frame(x = rnorm(100000, mean=-1e-2), y = rnorm(100000, mean=1e-2))
test = t.test(pts$x, pts$y)
```

## Power analysis, large sample sizes

100k samples each from two normal distributions:

Quantity | Estimate
-|-
$\overline x$ | $`r mean(pts$x)`$
$\overline y$ | $`r mean(pts$y)`$
$s_X$ | $`r sd(pts$x)`$
$s_Y$ | $`r sd(pts$y)`$
t-test $p$ | $`r test$p.value`$

## Power analysis, large sample sizes

100k samples each from two normal distributions:

```{r echo=F, error=F, warning=F, message=F}
pts %>% 
  gf_freqpoly(~x, color=~"x", binsize=0.01) %>% 
  gf_freqpoly(~y, color=~"y", binsize=0.01)
```

## Power analysis, large sample sizes

100k samples each from two normal distributions:

```{r echo=F, error=F, warning=F, message=F}
pts %>%
  mutate(x=sort(x), y=sort(y)) %>%
  gf_point(y~x) %>%
  gf_labs(title="QQ-plot") +
  coord_equal()
```

## Power analysis, large sample sizes

100k samples each from two normal distributions:

We consider a difference to be significant if greater than $0.1$.

```{r echo=T, error=F, warning=F, message=F}
power.t.test(n=100000, delta=0.1, sd=1, power=0.8, sig.level=NULL)
```

At $p = `r test$p.value`$ we are not able to reject the null.


## Drill Down

With large scale data, even **VERY** small subpopulations can be studied.

Example (ongoing research): study only taxi rides in NYC that start and end at the same position.


## An Overview of Statistical Schools of Thought

Basic division: Frequentism vs. Bayesianism

&nbsp; | Frequentist | Bayesian
-|-|-
Probability is... | ...asymptotic proportion of successes in repeated trials | ...measure of synthesized belief
Parametric inference is... | ...estimating value of parameters from data | ...updating probability distributions on parameters from data


## Frequentist Concepts

Core interest: estimating some value $\theta$ related to some real probability distribution on $X$ from some estimator $\hat\theta = t(X)$.

* Bias: $\mathbb{E}[\hat\theta]-\theta$
* Variance: $\mathbb{E}[(\hat\theta - \mathbb{E}\hat\theta)^2]$

Bias - Variance trade-off: 
\[
\text{MSE} = \text{Bias}^2 + \text{Variance}
\]

We use **standard error** to refer to $\sqrt{\mathbb{V}(\hat\theta)}$ - the standard deviation of an estimator.

## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Plug-in estimators

Given a formula relating a quantity to parameters, plug in an estimator directly.

Example: The sample mean $\overline{X} = \sum X_i/n$ has standard error
\[
\text{se}(\overline{X}) = \sqrt{\mathbb{V}(X)/n}
\]

We can estimate $\mathbb{V}(X)$ using the sample variance $\hat{\mathbb{V}}(X) = \sum (x_i-\overline{x})^2/(n-1)$.
This yields an estimated standard error 
\[
\widehat{\text{se}}(\overline{X}) = \sqrt{\sum(x_i-\overline x)^2/(n(n-1))}
\]


## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Taylor expansions

More complicated statistics can be related back using linear approximations.
For a function $s(\hat\theta)$ we can Taylor expand around $\theta=\mathbb{E}\hat\theta$: 
\[
s(\hat\theta) - s(\theta) \approx s'(\theta)(\hat\theta-\theta)
\]

So $\mathbb{V}[s(\hat\theta)] = \mathbb{E}[(s(\hat\theta)-s(\theta))^2] \approx |s'(\theta)|^2\mathbb{V}\theta$.

$\hat\theta = \overline{x}^2$. Then $d\hat\theta/d\overline{x} = 2\overline{x}$.
Plugin into the Taylor expansion we get

\[
\text{se}(\overline{x}^2) = 2|\overline{x}|\widehat{\text{se}}(\overline{x})
\]

## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Maximum Likelihood

We define the **likelihood function** as a function on parameter values:

\[
\mathcal{L}(\theta | x) = \mathbb{P}(x | \theta)
\]

**Neyman-Pearson's Lemma** When constructing a statistical testing rule to pick between two distributions $f_0$ and $f_1$, the smallest errors are achieved by 
\[
t_c(x) = \begin{cases}
1 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) \geq c$} \\
0 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) < c$}
\end{cases}
\]
for $c$ chosen to achieve the desired confidence level


## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Maximum Likelihood

We define the **likelihood function** as a function on parameter values:

\[
\mathcal{L}(\theta | x) = \mathbb{P}(x | \theta)
\]

The **Maximum Likelihood Estimator** tends to be *unbiased* and with *least possible variance* -- and even when not, tends to work very well.

\[
\hat\theta_{MLE} = \max_{\hat\theta} \mathcal{L}(\hat\theta | x)
\]


## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Bootstrap and Simulations

Frequentism wants us to focus on repeated experiments.
...so let's repeat some experiments.

Use the dataset $x$ as a probability distribution itself; sample $x^{(1)}, \dots, x^{(B)}$ repeatedly from $x$ (with replacement).
This approximates the true distribution, and we can use the observed distributions of $t(x^{(k)})$ to study the statistical behavior of $t(x)$.


## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Bootstrap and Simulations

(Small data) example: `mpg` dataset

```{r}
mpg %>% kable
```

## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Bootstrap and Simulations

(Small data) example: `mpg` dataset, `cty` variable

Mean: $`r mean(mpg$cty)`$, standard deviation: $`r sd(mpg$cty)`$

```{r fig.height=2}
mpg %>% gf_histogram(~cty)
```

## Frequentist Concepts

To derive quantities describing an estimator we can use:

### Bootstrap and Simulations

(Small data) example: `mpg` dataset, `cty` variable

Mean: $`r mean(mpg$cty)`$, standard error: $`r sd(mpg$cty)/sqrt(nrow(mpg))`$

```{r fig.height=4}
mean.boot = do(1000)*mean((mpg %>% sample_frac(replace=TRUE))$cty)
mean.boot %>% 
  gf_qq(~mean, "norm", dparams=list(mean=mean(mpg$cty),
                                    sd=sd(mpg$cty)/sqrt(nrow(mpg)))) %>%
  gf_labs(title="QQ-plot, bootstrapped mean distribution",
          subtitle="vs. theoretical distribution",
          x="theoretical", y="bootstrap") %>%
  gf_abline(slope=1, color="blue") + coord_equal()
```


## Bayesian Inference

Fundamental building block is Bayes' Theorem.
Let

1. $f(x|\mu)$ be a family of probability densities
2. $g(\mu)$ be some probability distribution on possible parameters for $f(x|\mu)$
3. $f(x) = \int_\Omega f_\mu(x)g(\mu)d\mu$ the marginal density of $x$ - the result of averaging over all possible values for $\mu$

Then
\[
g(\mu | x) = \frac{f(x | \mu)g(\mu)}{f(x)}
\]

If $g$ measures our belief of possible distributions for $\mu$, then Bayes rule provides a systematic update rule: how does new information *change* that belief.


## Bayesian Inference

By changing our notation, Bayes rule can be rewritten using likelihoods as:

\[
g(\mu | x) = c_x\mathcal L(\mu|x)g(\mu)
\]

where $c_x$ is a constant ensuring $\int_\Omega g(\mu|x)d\mu = 1$.

### Likelihood ratio

When deciding between two specific points, 

\[
\frac{g(\mu_1|x)}{g(\mu_2|x)} = \frac{g(\mu_1)}{g(\mu_2)}\cdot\frac{\mathcal L(\mu_1|x)}{\mathcal L(\mu_2|x)}
\]

*The posterior odds ratio is the prior odds ratio times the likelihood ratio*


## Bayesian Inference

In Bayesian Inference, instead of a single value $\hat\theta$ for a parameter, an entire probability distribution is estimated and updated.

Everything starts with the *prior*; the distribution that gets updated.
This prior should preferably encode everything we know about the situation going in.

Even with a badly chosen prior, sufficiently consistent results will often quickly adjust the distribution.

## Bayesian Inference - Skew prior

Flip a coin to check if fair. Start with prior belief $\mathbb{P}(H)\sim\text{Beta}(9,1)$ (mean $\mathbb{P}(H)=0.9$)

```{r fig.height=5}
flips = runif(1000) > 0.5
p.h = list(c(9,1), 
           c(9+sum(flips[1:250]), 250-sum(flips[1:250])+1),
           c(9+sum(flips[1:500]), 500-sum(flips[1:500])+1),
           c(9+sum(flips[1:750]), 750-sum(flips[1:750])+1),
           c(9+sum(flips), 1000-sum(flips)+1))
gf_dist("beta", shape1=p.h[[1]][1], shape2=p.h[[1]][2], color=~"prior") %>%
  gf_dist("beta", shape1=p.h[[2]][1], shape2=p.h[[2]][2], color=~" 250 flips") %>%
  gf_dist("beta", shape1=p.h[[3]][1], shape2=p.h[[3]][2], color=~" 500 flips") %>%
  gf_dist("beta", shape1=p.h[[4]][1], shape2=p.h[[4]][2], color=~" 750 flips") %>%
  gf_dist("beta", shape1=p.h[[5]][1], shape2=p.h[[5]][2], color=~"1000 flips") %>%
  gf_labs(x="P(H)")

```


## Uninformative Priors

If not enough information is present at the start, one way is to pick a prior designed to not encode assumptions.

* Uniform prior $g^U(\theta) = c$.
* Triangular prior $g^\Delta(\theta) = 1-|\theta|$.
* Jeffrey's prior
\[
g^{\text{Jeff}}(\theta) = 
\sqrt{\frac{1}{
\mathbb{E}\frac{\partial}{\partial\theta}\log\mathcal L(\theta|x)
}} \approx
\frac{1}{\sigma_{\text{MLE}}}
\]

## Frequentist vs Bayesian

### The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

```{r}
meter.data = rnorm(12, 92, 1)
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```

She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.

## Frequentist vs Bayesian

### The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

```{r}
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```

She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.

The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.

Is the estimate unbiased?

## Frequentist vs Bayesian

### The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

```{r}
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```

She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.

The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.

Is the estimate unbiased?   
Frequentist answer: **NO** - because the probability family has changed.


## Frequentist vs Bayesian

### The meter reader

An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1.

```{r}
meter.data %>% matrix(nrow=2) %>% kable(digits=2)
```

She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage.

The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100.

Is the estimate unbiased?   
Bayesian answer: **YES** - because the update rule only depends on the actual data points


## Pervasive Trade-offs

### Bias vs Variance

Total Error = Bias + Variance + Irreducible Error

### Memory vs Processing

Speed can be increased by using more memory.   
Memory footprint can be decreased by using more time.

### Underfitting vs overfitting

More complex models adapt closer to training data.   
More complex models may behave badly out-of-sample.