--- title: 'Big Data Analytics
Lecture 4' author: "Mikael Vejdemo-Johansson" date: "21 February 2019" output: revealjs::revealjs_presentation: transition: none slideNumber: true --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = FALSE) library(tidyverse) library(GGally) ``` ## Small/medium data This is most likely the majority of tasks you will meet. * `Python` (with `PyLab`, `scikit-learn`, `seaborn`, `pandas`) * `R` (with `tidyverse`, `caret`) * `SPSS` (especially for social sciences / psychology / etc) * `MySQL` / `MSSQL` / `Postgres` ## As Volume grows * Read and process in chunks * Parallelize + Dask + Spark * Column store + Amazon Redshift + Apache Cassandra + Apache Parquet * At very large scales: everything looks like SQL. + Hive + BigQuery ## Chunking Data Supported out of the box by many methods in both the `Python` stack and the `R` stack. 1. Read data into memory in manageable chunks. 2. Process each chunk separately. 3. Aggregate results - or write transformed data out again. ## Dask Builds on the `Pandas` / `PyLab` stack, adds transparent cluster deployment and distributed data frames. > If you are looking to manage a terabyte or less of tabular CSV or JSON data, then you should forget both Spark and Dask and use Postgres or MongoDB. [Dask documentation] ## Hadoop Apache Hadoop is a general purpose distributed storage and computation platform, based on MapReduce. Hadoop provides * Hardware failure resistance by design * Distributed file system * MapReduce implementation * Job scheduler and cluster resource manager * Java-based platform Many projects extend Hadoop's functionality. ## Spark Apache Spark is a Hadoop based platform for distributed data flow and computation. Spark includes Spark SQL, providing data set functionality similar to how Dask extends the `pandas` / `numpy` data models. On top of Spark are available: * Spark Streaming - highspeed streaming analytics * MLLib - Machine Learning with serializable models (difficult in Python) and intrinsic parallelization * GraphX - Graph processing Spark can be used from Java, Scala, Python, R. ## Tensorflow Google's Tensorflow is a tensor computation platform with a compute graph abstraction and effective parallelization and GPU delegation. The compute graph abstraction makes it easy to build custom backends for more clever optimization strategies. In its latest versions, most Tensorflow programming is very similar to classic Python programming. Recently absorbed by Tensorflow, the `Keras` project makes construction of neural network models very smooth and easy. ## Column Stores Classical data storage paradigms store row by row: ID | Last | First | Salary -|-|-|- 1 | Smith | Joe | 40000 2 | Jones | Mary | 50000 3 | Johnson | Cathy | 44000 4 | Smith | Marsha | 55000 ... | ... | ... | ... Finding all information relevant for a single record is easy and localized in memory. Scanning a single column, or combining columns requires a scan for the full database, jumping through memory in strides and likely triggering expensive page lookups or file seeks. ## Column Stores An alternative paradigm is store column by column instead 1 | 2 | 3 | 4 | ... -|-|-|-|- Smith | Jones | Johnson | Smith | ... Joe | Mary | Cathy | Marsha | ... 40000 | 50000 | 44000 | 55000 | ... * Column-by-column selection is easy * Seeking through a column is memory-local * Data compression, through sparsity, through value-index-list storage, through compression algorithms Implementation for big data applications: Apache Parquet. ## Hive Apache Hive is a large scale SQL-like data warehouse built on Hadoop. Queries are converted to MapReduce or Spark jobs, with parallelization strategies built in. ## BigQuery BigQuery is the Google Cloud platform for massive dataset storage, querying and processing. Accessible through web UI, command line tool, or a REST API. Client libraries in Java, .NET and Python. Built-in support for GIS and for Machine Learning. ## As Velocity grows * Real-time processing. * Stream Architectures - Apache Storm - Apache Hadoop ## As Variety grows * Data lakes - Apache Hadoop - Azure Data Lake - Amazon S3 * Extract Transform Load (ETL) - don't process data before storing: transform it when accessing ## ETL ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir=LR DS [label="Data Source", shape="rect"] DL [label="Data Lake", shape="rect", penwidth=4.0] Proc [label="Processing", shape="rect"] DS -> DL [label="Store"] DL -> Proc [label="Transform"] } } ', height=200) ``` Data inside the Data Lake remains unchanged from the data source. Transformations are executed during transport out to a processing unit. ## Paradigms ## Backend Optimizable Paradigms Two computational paradigms have emerged for computation on large data sets: ### Tensor Computation Examples: Tensorflow, Theano Everything is a **Tensor**: a generalized matrix. ### Map Reduce All computations are split into a **Map** phase and a **Reduce** phase: * **Map**: each computational unit applies some transformation. * **Reduce**: the transformed data units are combined. ## Compute Graphs Tensorflow uses **compute graphs** as its fundamental optimizing paradigms: Each code block gets translated into a data flow graph. This graph is compiled and optimized. Data flows through the optimized graph. Through analyzing the graph, many automatic optimization steps can be produced. ## Compute Graphs $$ f(x,y,z) = x^2y + x^2z + z^2 $$ ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir = LR x [label="x"] y [label="y"] z [label="z"] x2 [label="x²"] z2 [label="z²"] x2y [label="x²y"] x2z [label="x²z"] f [label="f(x,y,z)"] sqrx [label="*²", shape=rect] sqrz [label="*²", shape=rect] mulx2y [label="*", shape=rect] mulx2z [label="*", shape=rect] sum [label="+", shape=rect] x -> sqrx sqrx -> x2 z -> sqrz sqrz -> z2 x2 -> mulx2y y -> mulx2y mulx2y -> x2y x2 -> mulx2z z -> mulx2z mulx2z -> x2z x2y -> sum x2z -> sum z2 -> sum sum -> f } ', height=200) ``` ## Compute Graphs $$ f(x,y,z) = x^2y + x^2z + z^2 $$ ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir = LR x [label="x", penwidth=4] y [label="y", penwidth=4] z [label="z", penwidth=4] x2 [label="x²"] z2 [label="z²"] x2y [label="x²y"] x2z [label="x²z"] f [label="f(x,y,z)"] sqrx [label="*²", shape=rect, style=filled, fillcolor="#beaed4"] sqrz [label="*²", shape=rect, style=filled, fillcolor="#fdc086"] mulx2y [label="*", shape=rect] mulx2z [label="*", shape=rect] sum [label="+", shape=rect] x -> sqrx sqrx -> x2 z -> sqrz sqrz -> z2 x2 -> mulx2y y -> mulx2y mulx2y -> x2y x2 -> mulx2z z -> mulx2z mulx2z -> x2z x2y -> sum x2z -> sum z2 -> sum sum -> f } ', height=200) ``` ## Compute Graphs $$ f(x,y,z) = x^2y + x^2z + z^2 $$ ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir = LR x [label="x", penwidth=4] y [label="y", penwidth=4] z [label="z", penwidth=4] x2 [label="x²", penwidth=4] z2 [label="z²", penwidth=4] x2y [label="x²y"] x2z [label="x²z"] f [label="f(x,y,z)"] sqrx [label="*²", shape=rect] sqrz [label="*²", shape=rect] mulx2y [label="*", shape=rect, style=filled, fillcolor="#beaed4"] mulx2z [label="*", shape=rect, style=filled, fillcolor="#fdc086"] sum [label="+", shape=rect] x -> sqrx sqrx -> x2 z -> sqrz sqrz -> z2 x2 -> mulx2y y -> mulx2y mulx2y -> x2y x2 -> mulx2z z -> mulx2z mulx2z -> x2z x2y -> sum x2z -> sum z2 -> sum sum -> f } ', height=200) ``` ## Compute Graphs $$ f(x,y,z) = x^2y + x^2z + z^2 $$ ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir = LR x [label="x", penwidth=4] y [label="y", penwidth=4] z [label="z", penwidth=4] x2 [label="x²", penwidth=4] z2 [label="z²", penwidth=4] x2y [label="x²y", penwidth=4] x2z [label="x²z", penwidth=4] f [label="f(x,y,z)"] sqrx [label="*²", shape=rect] sqrz [label="*²", shape=rect] mulx2y [label="*", shape=rect] mulx2z [label="*", shape=rect] sum [label="+", shape=rect, style=filled, fillcolor="#beaed4"] x -> sqrx sqrx -> x2 z -> sqrz sqrz -> z2 x2 -> mulx2y y -> mulx2y mulx2y -> x2y x2 -> mulx2z z -> mulx2z mulx2z -> x2z x2y -> sum x2z -> sum z2 -> sum sum -> f } ', height=200) ``` ## Compute Graphs $$ f(x,y,z) = x^2y + x^2z + z^2 $$ ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir = LR x [label="x", penwidth=4] y [label="y", penwidth=4] z [label="z", penwidth=4] x2 [label="x²", penwidth=4] z2 [label="z²", penwidth=4] x2y [label="x²y", penwidth=4] x2z [label="x²z", penwidth=4] f [label="f(x,y,z)", penwidth=4] sqrx [label="*²", shape=rect] sqrz [label="*²", shape=rect] mulx2y [label="*", shape=rect] mulx2z [label="*", shape=rect] sum [label="+", shape=rect] x -> sqrx sqrx -> x2 z -> sqrz sqrz -> z2 x2 -> mulx2y y -> mulx2y mulx2y -> x2y x2 -> mulx2z z -> mulx2z mulx2z -> x2z x2y -> sum x2z -> sum z2 -> sum sum -> f } ', height=200) ``` ## Map Reduce ```{r} DiagrammeR::grViz(' digraph rmarkdown { rankdir = LR node [shape="box"] hdfs [label="Input"] out [label="Output"] mapper0 [label="Mapper 0"] mapper1 [label="Mapper 1"] mapper2 [label="Mapper 2"] combiner0 [label="Combiner 0"] combiner1 [label="Combiner 1"] combiner2 [label="Combiner 2"] reducer0 [label="Reducer 0"] reducer1 [label="Reducer 1"] reducer2 [label="Reducer 2"] hdfs -> mapper0 hdfs -> mapper1 hdfs -> mapper2 mapper0 -> combiner0 mapper1 -> combiner1 mapper2 -> combiner2 combiner0 -> reducer0 combiner0 -> reducer1 combiner0 -> reducer2 combiner1 -> reducer0 [style=invis] combiner1 -> reducer1 [style=invis] combiner1 -> reducer2 [style=invis] combiner2 -> reducer0 [style=invis] combiner2 -> reducer1 [style=invis] combiner2 -> reducer2 [style=invis] reducer0 -> out reducer1 -> out reducer2 -> out } ', height=200) ``` * Each Mapper has a simple processing task. Ex: tokenize a string, emit pairs `(word, 1)` * Each Combiner collates its Mapper output. Ex: sum up pairs with same word * Each Reducer collates a subset of Combiner outputs. Ex: sum up pairs with same word Through for instance hash functions, we can ensure load balance and non-overlap for the reduction step. The combiner serves to decrease network communication load. ## Map Reduce Hadoop: Java-based platform, with **many** extensions. Some major companies are moving away from this paradigm. ## Data Driven Architecture For sufficiently large computational needs, geography becomes important. * Volume: moving large amounts of data is difficult - put processing units near or in data storage to avoid moving data * Velocity: moving data introduces latency - put real-time processing units near the data stream, diverting data into the analytics * Variety: transforming data locks down transformation choices - put data munging at the border between data store and processing ## Data Driven Architecture On the scale of Google or Amazon data centers, scale drives VERY many design decisions. * Light speed vs. flash memory access speed: dictates where to place storage vs. CPUs * Component reliability: Even very high component reliability produces constant hardware failures at large enough scales. Build code failure resistant. * Build power plants next to data centers / data centers next to power plants * Hamina, Finland: Google's data centre pumps sea water through hollow walls for cooling ## Code samples: `pylab` / `pandas` / `scikit-learn` ``` %pylab inline # to make Jupyter notebooks pretty import pandas from sklearn import linear_model data = pandas.read_csv("datafile.csv") model = linear_model.Logistic_Regression() model.fit(data.drop(["target"], axis=1), data["target"]) model.predict(newdata) ``` ## Code samples: `dask` ``` %pylab inline # to make Jupyter notebooks pretty import dask.dataframe as dd from dask_ml import linear_model data = dd.read_parquet("datafile.parquet") model = linear_model.Logistic_Regression() model.fit(data.drop(["target"], axis=1), data["target"]) model.predict(newdata) ``` ## Code samples: `pyspark` ``` from pyspark.sql import SparkSession from pyspark.ml.classification import LogisticRegression spark = SparkSession.builder.appName("logreg").getOrCreate() data = spark.read.csv("datafile.csv") model = LogisticRegression() model.fit(data) model.transform(newdata) ``` Some data munging may be needed for `data` to be arranged on a format that fits `LogisticRegression`. ## Code samples: `tensorflow` (old style) ``` import tensorflow as tf data = pandas.read_csv("datafile.csv") X = tf.placeholder(tf.float32, [None, data.shape[0]-1]) y = tf.placeholder(tf.float32, [None, 1]) W = tf.Variable(tf.zeros([data.shape[0]-1,1])) b = tf.Variable(tf.zeros([1])) pred = tf.nn.softmax(tf.matmul(X,W) + b) cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1)) optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(cost) init = tf.global_variables_initializer() with tf.Session as sess: sess.run(init) for epoch in range(N): for i in range(batches): _, c = sess.run([optimizer, cost], feed_dict={X: batch_x, y: batch_y}) y_pred = sess.run(pred, feed_dict={X: newdata}) ``` ## Code samples: `tensorflow` (Keras) ``` import tensorflow as tf from tf.keras import models, layers data = pandas.read_csv("datafile.csv") inputs = layers.Input(shape=(data.shape[0]-1,)) outputs = layers.Dense(1, activation="softmax")(inputs) model = models.Model(inputs=inputs, outputs=outputs) model.compile(optimizer="sgd", loss="categorical_crossentropy") model.fit(data.drop(["target"], axis=1), data["target"]) model.predict(newdata) ``` ## Code samples: `BigQuery` ``` CREATE MODEL `mydataset.mymodel` OPTIONS ( model_type="logistic_reg", input_label_cols="target" ) AS SELECT * FROM `mydataset.mytable` ```