--- title: 'Big Data Analytics
Lecture 1' author: "Mikael Vejdemo-Johansson
Ping Ji" date: "2019-01-31" output: revealjs::revealjs_presentation: transition: none slideNumber: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) library(knitr) library(tidyverse) library(ggformula) library(GGally) library(mosaic) ``` ## Welcome to Big Data Analytics ## Semester Overview * Foundations of Big Data * Statistical presentation, communication, inference, learning * Different large scale data types, and how to use them ## Structure of the course * Weekly lectures: Mikael Vejdemo-Johansson, Ping Ji * Guest lecturers: Denis Khryashchev, Sara-Jayne Terp, Joshua Brown, Mario Gonzalez * Lab / homework tasks: 2 programming tasks, using [Kaggle](http://kaggle.com) * Term report: Either a deep analysis of a dataset, or an in depth explanation of additional techniques. End of semester: in class presentation ## Contact Information * Mikael Vejdemo-Johansson mvejdemojohansson@gc.cuny.edu Office Hours, 4420, Thursdays 3pm to 4pm * Ping Ji PJi@gc.cuny.edu Course information will come through Blackboard. A detailed syllabus is available now. ## Course Literature * Efron & Hastie: Computer age statistical inference * James, Witten, Hastie, Tibshirani: An introduction to statistical learning Free as ebook * Stephens-Davidowitz: Everybody Lies Read **now**: Efron & Hastie, Part I. ## Big Data ## Big Data **Big Data** is when you have to think about handling your data: * ...how will you fit this in memory? * ...how will you fit this on disk(s)? * ...how will you compute summaries quickly enough? * ...how will you find what features to focus on? ## Three Vs of big data * **V**olume - large data sets (Google; Wikipedia; CERN; High-throughput screening, ...) * **V**elocity - fast data, real-time processing (high speed trading; Twitter firehose; networks) * **V**ariety - complex data (image data; graph data; video data; data fusion; shape of data) ## Four Vs of big data * **V**olume - large data sets (Google; Wikipedia; CERN; High-throughput screening, ...) * **V**elocity - fast data, real-time processing (high speed trading; Twitter firehose; networks) * **V**ariety - complex data (image data; graph data; video data; data fusion; shape of data) * **V**eracity - reliable data (biases; ethical data analysis; data cleaning) ## Five Vs of big data * **V**olume - large data sets (Google; Wikipedia; CERN; High-throughput screening, ...) * **V**elocity - fast data, real-time processing (high speed trading; Twitter firehose; networks) * **V**ariety - complex data (image data; graph data; video data; data fusion; shape of data) * **V**eracity - reliable data (biases; ethical data analysis; data cleaning) * **V**alue - valid and relevant data (applicability; relevance; actionable; impactful) ## Data Analytics ## Data Analytics * Draw from past data to predict future behavior * Multidisciplinary - extensive use of computation, mathematics, statistics * Connected to business / consumer needs ## Classical Statistics Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*. ### Mean of a variable Estimate the mean $\mu_X$ based on repeated observations $x_1,\dots,x_n$: 1. Calculate $\overline{x}$ and $s$ from the data 2. Calculate the confidence interval $$ \overline{x} - t_{1-\alpha/2}\cdot s/\sqrt{n} \leq \mu_X \leq\overline{x} + t_{1-\alpha/2}\cdot s/\sqrt{n} $$ ## Classical Statistics Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*. ### Mean of a variable Check whether a mean $\mu_X$ is significantly different from a hypothesized mean $\mu_0$, based on repeated observations $x_1,\dots,x_n$: 1. Calculate $\overline{x}$ and $s$ from the data 2. Calculate the test statistic $$ T = \frac{\overline{x}-\mu_0}{s/\sqrt{n}} \sim T(n-1) $$ ## Classical Statistics Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*. ### Mean of a variable Check whether two means $\mu_X$ and $\mu_Y$ are significantly different from each other, based on repeated observations $x_1,\dots,x_n$ and $y_1,\dots,y_m$: 1. Calculate $\overline{x}, \overline{y}, s_x, s_y$ from the data 2. Calculate the test statistic $$ T = \frac{\overline{x}-\overline{y}}{s_p\sqrt{s_x^2/n+s_y^2/m}} \sim T(...) $$ ## Classical Statistics Based on observations $x_i$ from a random variable $X$, describe $X$ sufficiently well to enable *inferences* and *predictions*. ### Mean of a variable Fundamental issue: as data sizes grow, $1/\sqrt{n}$ will dominate everything else. Everything is statistically significant. Standard suggestion from Stats 101: look at effect sizes! look at domain specific significance concepts! ## Power analysis, large sample sizes Denote: * $F_{\mathcal{N}} = \mathbb{P}(Z \leq z)$ for $Z\sim\mathcal{N}(0,1)$ * $z_{1-\alpha/2} = F_{\mathcal{N}}^{-1}(1-\alpha/2)$ * $\Delta$ the standardized smallest difference between means to be detected Then the **power** of the T-test is the probability of detecting an effect of size $\Delta$: \[ 1-\beta \approx F_\mathcal{N}(\Delta\sqrt{n}-z_{1-\alpha/2}) \] With a large sample size, we can solve for $\alpha$ to find a significance cutoff value. ```{r echo=F} set.seed(42) pts = data.frame(x = rnorm(100000, mean=-1e-2), y = rnorm(100000, mean=1e-2)) test = t.test(pts$x, pts$y) ``` ## Power analysis, large sample sizes 100k samples each from two normal distributions: Quantity | Estimate -|- $\overline x$ | $`r mean(pts$x)`$ $\overline y$ | $`r mean(pts$y)`$ $s_X$ | $`r sd(pts$x)`$ $s_Y$ | $`r sd(pts$y)`$ t-test $p$ | $`r test$p.value`$ ## Power analysis, large sample sizes 100k samples each from two normal distributions: ```{r echo=F, error=F, warning=F, message=F} pts %>% gf_freqpoly(~x, color=~"x", binsize=0.01) %>% gf_freqpoly(~y, color=~"y", binsize=0.01) ``` ## Power analysis, large sample sizes 100k samples each from two normal distributions: ```{r echo=F, error=F, warning=F, message=F} pts %>% mutate(x=sort(x), y=sort(y)) %>% gf_point(y~x) %>% gf_labs(title="QQ-plot") + coord_equal() ``` ## Power analysis, large sample sizes 100k samples each from two normal distributions: We consider a difference to be significant if greater than $0.1$. ```{r echo=T, error=F, warning=F, message=F} power.t.test(n=100000, delta=0.1, sd=1, power=0.8, sig.level=NULL) ``` At $p = `r test$p.value`$ we are not able to reject the null. ## Drill Down With large scale data, even **VERY** small subpopulations can be studied. Example (ongoing research): study only taxi rides in NYC that start and end at the same position. ## An Overview of Statistical Schools of Thought Basic division: Frequentism vs. Bayesianism   | Frequentist | Bayesian -|-|- Probability is... | ...asymptotic proportion of successes in repeated trials | ...measure of synthesized belief Parametric inference is... | ...estimating value of parameters from data | ...updating probability distributions on parameters from data ## Frequentist Concepts Core interest: estimating some value $\theta$ related to some real probability distribution on $X$ from some estimator $\hat\theta = t(X)$. * Bias: $\mathbb{E}[\hat\theta]-\theta$ * Variance: $\mathbb{E}[(\hat\theta - \mathbb{E}\hat\theta)^2]$ Bias - Variance trade-off: \[ \text{MSE} = \text{Bias}^2 + \text{Variance} \] We use **standard error** to refer to $\sqrt{\mathbb{V}(\hat\theta)}$ - the standard deviation of an estimator. ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Plug-in estimators Given a formula relating a quantity to parameters, plug in an estimator directly. Example: The sample mean $\overline{X} = \sum X_i/n$ has standard error \[ \text{se}(\overline{X}) = \sqrt{\mathbb{V}(X)/n} \] We can estimate $\mathbb{V}(X)$ using the sample variance $\hat{\mathbb{V}}(X) = \sum (x_i-\overline{x})^2/(n-1)$. This yields an estimated standard error \[ \widehat{\text{se}}(\overline{X}) = \sqrt{\sum(x_i-\overline x)^2/(n(n-1))} \] ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Taylor expansions More complicated statistics can be related back using linear approximations. For a function $s(\hat\theta)$ we can Taylor expand around $\theta=\mathbb{E}\hat\theta$: \[ s(\hat\theta) - s(\theta) \approx s'(\theta)(\hat\theta-\theta) \] So $\mathbb{V}[s(\hat\theta)] = \mathbb{E}[(s(\hat\theta)-s(\theta))^2] \approx |s'(\theta)|^2\mathbb{V}\theta$. $\hat\theta = \overline{x}^2$. Then $d\hat\theta/d\overline{x} = 2\overline{x}$. Plugin into the Taylor expansion we get \[ \text{se}(\overline{x}^2) = 2|\overline{x}|\widehat{\text{se}}(\overline{x}) \] ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Maximum Likelihood We define the **likelihood function** as a function on parameter values: \[ \mathcal{L}(\theta | x) = \mathbb{P}(x | \theta) \] **Neyman-Pearson's Lemma** When constructing a statistical testing rule to pick between two distributions $f_0$ and $f_1$, the smallest errors are achieved by \[ t_c(x) = \begin{cases} 1 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) \geq c$} \\ 0 & \text{if $\log(\mathcal{L}_1/\mathcal L_0) < c$} \end{cases} \] for $c$ chosen to achieve the desired confidence level ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Maximum Likelihood We define the **likelihood function** as a function on parameter values: \[ \mathcal{L}(\theta | x) = \mathbb{P}(x | \theta) \] The **Maximum Likelihood Estimator** tends to be *unbiased* and with *least possible variance* -- and even when not, tends to work very well. \[ \hat\theta_{MLE} = \max_{\hat\theta} \mathcal{L}(\hat\theta | x) \] ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Bootstrap and Simulations Frequentism wants us to focus on repeated experiments. ...so let's repeat some experiments. Use the dataset $x$ as a probability distribution itself; sample $x^{(1)}, \dots, x^{(B)}$ repeatedly from $x$ (with replacement). This approximates the true distribution, and we can use the observed distributions of $t(x^{(k)})$ to study the statistical behavior of $t(x)$. ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Bootstrap and Simulations (Small data) example: `mpg` dataset ```{r} mpg %>% kable ``` ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Bootstrap and Simulations (Small data) example: `mpg` dataset, `cty` variable Mean: $`r mean(mpg$cty)`$, standard deviation: $`r sd(mpg$cty)`$ ```{r fig.height=2} mpg %>% gf_histogram(~cty) ``` ## Frequentist Concepts To derive quantities describing an estimator we can use: ### Bootstrap and Simulations (Small data) example: `mpg` dataset, `cty` variable Mean: $`r mean(mpg$cty)`$, standard error: $`r sd(mpg$cty)/sqrt(nrow(mpg))`$ ```{r fig.height=4} mean.boot = do(1000)*mean((mpg %>% sample_frac(replace=TRUE))$cty) mean.boot %>% gf_qq(~mean, "norm", dparams=list(mean=mean(mpg$cty), sd=sd(mpg$cty)/sqrt(nrow(mpg)))) %>% gf_labs(title="QQ-plot, bootstrapped mean distribution", subtitle="vs. theoretical distribution", x="theoretical", y="bootstrap") %>% gf_abline(slope=1, color="blue") + coord_equal() ``` ## Bayesian Inference Fundamental building block is Bayes' Theorem. Let 1. $f(x|\mu)$ be a family of probability densities 2. $g(\mu)$ be some probability distribution on possible parameters for $f(x|\mu)$ 3. $f(x) = \int_\Omega f_\mu(x)g(\mu)d\mu$ the marginal density of $x$ - the result of averaging over all possible values for $\mu$ Then \[ g(\mu | x) = \frac{f(x | \mu)g(\mu)}{f(x)} \] If $g$ measures our belief of possible distributions for $\mu$, then Bayes rule provides a systematic update rule: how does new information *change* that belief. ## Bayesian Inference By changing our notation, Bayes rule can be rewritten using likelihoods as: \[ g(\mu | x) = c_x\mathcal L(\mu|x)g(\mu) \] where $c_x$ is a constant ensuring $\int_\Omega g(\mu|x)d\mu = 1$. ### Likelihood ratio When deciding between two specific points, \[ \frac{g(\mu_1|x)}{g(\mu_2|x)} = \frac{g(\mu_1)}{g(\mu_2)}\cdot\frac{\mathcal L(\mu_1|x)}{\mathcal L(\mu_2|x)} \] *The posterior odds ratio is the prior odds ratio times the likelihood ratio* ## Bayesian Inference In Bayesian Inference, instead of a single value $\hat\theta$ for a parameter, an entire probability distribution is estimated and updated. Everything starts with the *prior*; the distribution that gets updated. This prior should preferably encode everything we know about the situation going in. Even with a badly chosen prior, sufficiently consistent results will often quickly adjust the distribution. ## Bayesian Inference - Skew prior Flip a coin to check if fair. Start with prior belief $\mathbb{P}(H)\sim\text{Beta}(9,1)$ (mean $\mathbb{P}(H)=0.9$) ```{r fig.height=5} flips = runif(1000) > 0.5 p.h = list(c(9,1), c(9+sum(flips[1:250]), 250-sum(flips[1:250])+1), c(9+sum(flips[1:500]), 500-sum(flips[1:500])+1), c(9+sum(flips[1:750]), 750-sum(flips[1:750])+1), c(9+sum(flips), 1000-sum(flips)+1)) gf_dist("beta", shape1=p.h[[1]][1], shape2=p.h[[1]][2], color=~"prior") %>% gf_dist("beta", shape1=p.h[[2]][1], shape2=p.h[[2]][2], color=~" 250 flips") %>% gf_dist("beta", shape1=p.h[[3]][1], shape2=p.h[[3]][2], color=~" 500 flips") %>% gf_dist("beta", shape1=p.h[[4]][1], shape2=p.h[[4]][2], color=~" 750 flips") %>% gf_dist("beta", shape1=p.h[[5]][1], shape2=p.h[[5]][2], color=~"1000 flips") %>% gf_labs(x="P(H)") ``` ## Uninformative Priors If not enough information is present at the start, one way is to pick a prior designed to not encode assumptions. * Uniform prior $g^U(\theta) = c$. * Triangular prior $g^\Delta(\theta) = 1-|\theta|$. * Jeffrey's prior \[ g^{\text{Jeff}}(\theta) = \sqrt{\frac{1}{ \mathbb{E}\frac{\partial}{\partial\theta}\log\mathcal L(\theta|x) }} \approx \frac{1}{\sigma_{\text{MLE}}} \] ## Frequentist vs Bayesian ### The meter reader An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1. ```{r} meter.data = rnorm(12, 92, 1) meter.data %>% matrix(nrow=2) %>% kable(digits=2) ``` She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage. ## Frequentist vs Bayesian ### The meter reader An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1. ```{r} meter.data %>% matrix(nrow=2) %>% kable(digits=2) ``` She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage. The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100. Is the estimate unbiased? ## Frequentist vs Bayesian ### The meter reader An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1. ```{r} meter.data %>% matrix(nrow=2) %>% kable(digits=2) ``` She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage. The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100. Is the estimate unbiased? Frequentist answer: **NO** - because the probability family has changed. ## Frequentist vs Bayesian ### The meter reader An engineer makes 12 measurements, using a calibrated volt meter with normally distributed error, sd = 1. ```{r} meter.data %>% matrix(nrow=2) %>% kable(digits=2) ``` She calculates $\overline{x} = `r mean(meter.data) %>% round(2)`$, an unbiased estimate of the true voltage. The next day, she discovers the voltmeter truncates measurements at 100 - anything larger is reported as 100. Is the estimate unbiased? Bayesian answer: **YES** - because the update rule only depends on the actual data points ## Pervasive Trade-offs ### Bias vs Variance Total Error = Bias + Variance + Irreducible Error ### Memory vs Processing Speed can be increased by using more memory. Memory footprint can be decreased by using more time. ### Underfitting vs overfitting More complex models adapt closer to training data. More complex models may behave badly out-of-sample.