---
title: 'Big Data Analytics<br />Lecture 4'
author: "Mikael Vejdemo-Johansson"
date: "21 February 2019"
output:
  revealjs::revealjs_presentation:
    transition: none
    slideNumber: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
library(tidyverse)
library(GGally)
```


## Small/medium data

This is most likely the majority of tasks you will meet.

* `Python` (with `PyLab`, `scikit-learn`, `seaborn`, `pandas`)
* `R` (with `tidyverse`, `caret`)
* `SPSS` (especially for social sciences / psychology / etc)
* `MySQL` / `MSSQL` / `Postgres`


## As Volume grows

* Read and process in chunks
* Parallelize
    + Dask
    + Spark
* Column store
    + Amazon Redshift
    + Apache Cassandra
    + Apache Parquet
* At very large scales: everything looks like SQL.
    + Hive
    + BigQuery

## Chunking Data

Supported out of the box by many methods in both the `Python` stack and the `R` stack.

1. Read data into memory in manageable chunks.
2. Process each chunk separately.
3. Aggregate results - or write transformed data out again.


## Dask

Builds on the `Pandas` / `PyLab` stack, adds transparent cluster deployment and distributed data frames.

> If you are looking to manage a terabyte or less of tabular CSV or JSON data, then you should forget both Spark and Dask and use Postgres or MongoDB. [Dask documentation]


## Hadoop

Apache Hadoop is a general purpose distributed storage and computation platform, based on MapReduce.
Hadoop provides

* Hardware failure resistance by design
* Distributed file system
* MapReduce implementation
* Job scheduler and cluster resource manager
* Java-based platform

Many projects extend Hadoop's functionality.

## Spark

Apache Spark is a Hadoop based platform for distributed data flow and computation.
Spark includes Spark SQL, providing data set functionality similar to how Dask extends the `pandas` / `numpy` data models.
On top of Spark are available:

* Spark Streaming - highspeed streaming analytics
* MLLib - Machine Learning with serializable models (difficult in Python) and intrinsic parallelization
* GraphX - Graph processing

Spark can be used from Java, Scala, Python, R.


## Tensorflow

Google's Tensorflow is a tensor computation platform with a compute graph abstraction and effective parallelization and GPU delegation.
The compute graph abstraction makes it easy to build custom backends for more clever optimization strategies.

In its latest versions, most Tensorflow programming is very similar to classic Python programming.

Recently absorbed by Tensorflow, the `Keras` project makes construction of neural network models very smooth and easy.

## Column Stores

Classical data storage paradigms store row by row:

ID | Last | First | Salary
-|-|-|-
1 | Smith | Joe | 40000
2 | Jones | Mary | 50000
3 | Johnson | Cathy | 44000
4 | Smith | Marsha | 55000
... | ... | ... | ... 

Finding all information relevant for a single record is easy and localized in memory.

Scanning a single column, or combining columns requires a scan for the full database, jumping through memory in strides and likely triggering expensive page lookups or file seeks.

## Column Stores

An alternative paradigm is store column by column instead

1 | 2 | 3 | 4 | ...
-|-|-|-|-
Smith | Jones | Johnson | Smith | ...
Joe | Mary | Cathy | Marsha | ...
40000 | 50000 | 44000 | 55000 | ...

* Column-by-column selection is easy
* Seeking through a column is memory-local
* Data compression, through sparsity, through value-index-list storage, through compression algorithms

Implementation for big data applications: Apache Parquet.


## Hive

Apache Hive is a large scale SQL-like data warehouse built on Hadoop.
Queries are converted to MapReduce or Spark jobs, with parallelization strategies built in.


## BigQuery

BigQuery is the Google Cloud platform for massive dataset storage, querying and processing. Accessible through web UI, command line tool, or a REST API. Client libraries in Java, .NET and Python.

Built-in support for GIS and for Machine Learning.


## As Velocity grows

* Real-time processing.
* Stream Architectures
    - Apache Storm
    - Apache Hadoop


## As Variety grows

* Data lakes
    - Apache Hadoop
    - Azure Data Lake
    - Amazon S3
* Extract Transform Load (ETL)
    - don't process data before storing: transform it when accessing

## ETL

```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir=LR
DS [label="Data Source", shape="rect"]
DL [label="Data Lake", shape="rect", penwidth=4.0]
Proc [label="Processing", shape="rect"]
DS -> DL [label="Store"]
DL -> Proc [label="Transform"]
}
}
', height=200)
```

Data inside the Data Lake remains unchanged from the data source.
Transformations are executed during transport out to a processing unit.


## Paradigms

## Backend Optimizable Paradigms

Two computational paradigms have emerged for computation on large data sets:

### Tensor Computation

Examples: Tensorflow, Theano

Everything is a **Tensor**: a generalized matrix.


### Map Reduce

All computations are split into a **Map** phase and a **Reduce** phase:

* **Map**: each computational unit applies some transformation.
* **Reduce**: the transformed data units are combined.

## Compute Graphs

Tensorflow uses **compute graphs** as its fundamental optimizing paradigms:

Each code block gets translated into a data flow graph.
This graph is compiled and optimized.
Data flows through the optimized graph.

Through analyzing the graph, many automatic optimization steps can be produced.

## Compute Graphs

$$
f(x,y,z) = x^2y + x^2z + z^2
$$

```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir = LR
x [label="x"]
y [label="y"]
z [label="z"]
x2 [label="x²"]
z2 [label="z²"]
x2y [label="x²y"]
x2z [label="x²z"]
f [label="f(x,y,z)"]

sqrx [label="*²", shape=rect]
sqrz [label="*²", shape=rect]
mulx2y [label="*", shape=rect]
mulx2z [label="*", shape=rect]
sum [label="+", shape=rect]

x -> sqrx
sqrx -> x2

z -> sqrz
sqrz -> z2

x2 -> mulx2y
y -> mulx2y
mulx2y -> x2y

x2 -> mulx2z
z -> mulx2z
mulx2z -> x2z

x2y -> sum
x2z -> sum
z2 -> sum
sum -> f
}
', height=200)
```


## Compute Graphs

$$
f(x,y,z) = x^2y + x^2z + z^2
$$

```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir = LR
x [label="x", penwidth=4]
y [label="y", penwidth=4]
z [label="z", penwidth=4]
x2 [label="x²"]
z2 [label="z²"]
x2y [label="x²y"]
x2z [label="x²z"]
f [label="f(x,y,z)"]

sqrx [label="*²", shape=rect, style=filled, fillcolor="#beaed4"]
sqrz [label="*²", shape=rect, style=filled, fillcolor="#fdc086"]
mulx2y [label="*", shape=rect]
mulx2z [label="*", shape=rect]
sum [label="+", shape=rect]

x -> sqrx
sqrx -> x2

z -> sqrz
sqrz -> z2

x2 -> mulx2y
y -> mulx2y
mulx2y -> x2y

x2 -> mulx2z
z -> mulx2z
mulx2z -> x2z

x2y -> sum
x2z -> sum
z2 -> sum
sum -> f
}
', height=200)
```


## Compute Graphs

$$
f(x,y,z) = x^2y + x^2z + z^2
$$

```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir = LR
x [label="x", penwidth=4]
y [label="y", penwidth=4]
z [label="z", penwidth=4]
x2 [label="x²", penwidth=4]
z2 [label="z²", penwidth=4]
x2y [label="x²y"]
x2z [label="x²z"]
f [label="f(x,y,z)"]

sqrx [label="*²", shape=rect]
sqrz [label="*²", shape=rect]
mulx2y [label="*", shape=rect, style=filled, fillcolor="#beaed4"]
mulx2z [label="*", shape=rect, style=filled, fillcolor="#fdc086"]
sum [label="+", shape=rect]

x -> sqrx
sqrx -> x2

z -> sqrz
sqrz -> z2

x2 -> mulx2y
y -> mulx2y
mulx2y -> x2y

x2 -> mulx2z
z -> mulx2z
mulx2z -> x2z

x2y -> sum
x2z -> sum
z2 -> sum
sum -> f
}
', height=200)
```


## Compute Graphs

$$
f(x,y,z) = x^2y + x^2z + z^2
$$

```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir = LR
x [label="x", penwidth=4]
y [label="y", penwidth=4]
z [label="z", penwidth=4]
x2 [label="x²", penwidth=4]
z2 [label="z²", penwidth=4]
x2y [label="x²y", penwidth=4]
x2z [label="x²z", penwidth=4]
f [label="f(x,y,z)"]

sqrx [label="*²", shape=rect]
sqrz [label="*²", shape=rect]
mulx2y [label="*", shape=rect]
mulx2z [label="*", shape=rect]
sum [label="+", shape=rect, style=filled, fillcolor="#beaed4"]

x -> sqrx
sqrx -> x2

z -> sqrz
sqrz -> z2

x2 -> mulx2y
y -> mulx2y
mulx2y -> x2y

x2 -> mulx2z
z -> mulx2z
mulx2z -> x2z

x2y -> sum
x2z -> sum
z2 -> sum
sum -> f
}
', height=200)
```


## Compute Graphs

$$
f(x,y,z) = x^2y + x^2z + z^2
$$

```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir = LR
x [label="x", penwidth=4]
y [label="y", penwidth=4]
z [label="z", penwidth=4]
x2 [label="x²", penwidth=4]
z2 [label="z²", penwidth=4]
x2y [label="x²y", penwidth=4]
x2z [label="x²z", penwidth=4]
f [label="f(x,y,z)", penwidth=4]

sqrx [label="*²", shape=rect]
sqrz [label="*²", shape=rect]
mulx2y [label="*", shape=rect]
mulx2z [label="*", shape=rect]
sum [label="+", shape=rect]

x -> sqrx
sqrx -> x2

z -> sqrz
sqrz -> z2

x2 -> mulx2y
y -> mulx2y
mulx2y -> x2y

x2 -> mulx2z
z -> mulx2z
mulx2z -> x2z

x2y -> sum
x2z -> sum
z2 -> sum
sum -> f
}
', height=200)
```


## Map Reduce


```{r}
DiagrammeR::grViz('
digraph rmarkdown {
rankdir = LR
node [shape="box"]

hdfs [label="Input"]
out [label="Output"]

mapper0 [label="Mapper 0"]
mapper1 [label="Mapper 1"]
mapper2 [label="Mapper 2"]

combiner0 [label="Combiner 0"]
combiner1 [label="Combiner 1"]
combiner2 [label="Combiner 2"]

reducer0 [label="Reducer 0"]
reducer1 [label="Reducer 1"]
reducer2 [label="Reducer 2"]


hdfs -> mapper0
hdfs -> mapper1
hdfs -> mapper2

mapper0 -> combiner0
mapper1 -> combiner1
mapper2 -> combiner2

combiner0 -> reducer0
combiner0 -> reducer1
combiner0 -> reducer2

combiner1 -> reducer0 [style=invis]
combiner1 -> reducer1 [style=invis]
combiner1 -> reducer2 [style=invis]

combiner2 -> reducer0 [style=invis]
combiner2 -> reducer1 [style=invis]
combiner2 -> reducer2 [style=invis]

reducer0 -> out
reducer1 -> out
reducer2 -> out

}
', height=200)
```

* Each Mapper has a simple processing task.   
  Ex: tokenize a string, emit pairs `(word, 1)`
* Each Combiner collates its Mapper output.    
  Ex: sum up pairs with same word
* Each Reducer collates a subset of Combiner outputs.   
  Ex: sum up pairs with same word

Through for instance hash functions, we can ensure load balance and non-overlap for the reduction step. The combiner serves to decrease network communication load.

## Map Reduce

Hadoop: Java-based platform, with **many** extensions.

Some major companies are moving away from this paradigm.


## Data Driven Architecture

For sufficiently large computational needs, geography becomes important.

* Volume: moving large amounts of data is difficult - put processing units near or in data storage to avoid moving data
* Velocity: moving data introduces latency - put real-time processing units near the data stream, diverting data into the analytics
* Variety: transforming data locks down transformation choices - put data munging at the border between data store and processing

## Data Driven Architecture

On the scale of Google or Amazon data centers, scale drives VERY many design decisions.

* Light speed vs. flash memory access speed: dictates where to place storage vs. CPUs
* Component reliability: Even very high component reliability produces constant hardware failures at large enough scales. Build code failure resistant.
* Build power plants next to data centers / data centers next to power plants
* Hamina, Finland: Google's data centre pumps sea water through hollow walls for cooling


## Code samples: `pylab` / `pandas` / `scikit-learn`

```
%pylab inline # to make Jupyter notebooks pretty
import pandas
from sklearn import linear_model
data = pandas.read_csv("datafile.csv")
model = linear_model.Logistic_Regression()
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)
```


## Code samples: `dask`

```
%pylab inline # to make Jupyter notebooks pretty
import dask.dataframe as dd
from dask_ml import linear_model
data = dd.read_parquet("datafile.parquet")
model = linear_model.Logistic_Regression()
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)
```

## Code samples: `pyspark`

```
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
spark = SparkSession.builder.appName("logreg").getOrCreate()
data = spark.read.csv("datafile.csv")
model = LogisticRegression()
model.fit(data)
model.transform(newdata)
```

Some data munging may be needed for `data` to be arranged on a format that fits `LogisticRegression`.


## Code samples: `tensorflow` (old style)

```
import tensorflow as tf
data = pandas.read_csv("datafile.csv")
X = tf.placeholder(tf.float32, [None, data.shape[0]-1])
y = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([data.shape[0]-1,1]))
b = tf.Variable(tf.zeros([1]))
pred = tf.nn.softmax(tf.matmul(X,W) + b)
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session as sess:
  sess.run(init)
  for epoch in range(N):
    for i in range(batches):
      _, c = sess.run([optimizer, cost], feed_dict={X: batch_x, y: batch_y})
y_pred = sess.run(pred, feed_dict={X: newdata})
```

## Code samples: `tensorflow` (Keras)

```
import tensorflow as tf
from tf.keras import models, layers
data = pandas.read_csv("datafile.csv")
inputs = layers.Input(shape=(data.shape[0]-1,))
outputs = layers.Dense(1, activation="softmax")(inputs)
model = models.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer="sgd", loss="categorical_crossentropy")
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)
```


## Code samples: `BigQuery`

```
CREATE MODEL 
  `mydataset.mymodel`
OPTIONS
 ( model_type="logistic_reg",
   input_label_cols="target" ) AS
SELECT 
 * 
FROM
 `mydataset.mytable`
```