Big Data Analytics
Lecture 4

Mikael Vejdemo-Johansson

21 February 2019

Small/medium data

This is most likely the majority of tasks you will meet.

  • Python (with PyLab, scikit-learn, seaborn, pandas)
  • R (with tidyverse, caret)
  • SPSS (especially for social sciences / psychology / etc)
  • MySQL / MSSQL / Postgres

As Volume grows

  • Read and process in chunks
  • Parallelize
    • Dask
    • Spark
  • Column store
    • Amazon Redshift
    • Apache Cassandra
    • Apache Parquet
  • At very large scales: everything looks like SQL.
    • Hive
    • BigQuery

Chunking Data

Supported out of the box by many methods in both the Python stack and the R stack.

  1. Read data into memory in manageable chunks.
  2. Process each chunk separately.
  3. Aggregate results - or write transformed data out again.

Dask

Builds on the Pandas / PyLab stack, adds transparent cluster deployment and distributed data frames.

If you are looking to manage a terabyte or less of tabular CSV or JSON data, then you should forget both Spark and Dask and use Postgres or MongoDB. [Dask documentation]

Hadoop

Apache Hadoop is a general purpose distributed storage and computation platform, based on MapReduce. Hadoop provides

  • Hardware failure resistance by design
  • Distributed file system
  • MapReduce implementation
  • Job scheduler and cluster resource manager
  • Java-based platform

Many projects extend Hadoop’s functionality.

Spark

Apache Spark is a Hadoop based platform for distributed data flow and computation. Spark includes Spark SQL, providing data set functionality similar to how Dask extends the pandas / numpy data models. On top of Spark are available:

  • Spark Streaming - highspeed streaming analytics
  • MLLib - Machine Learning with serializable models (difficult in Python) and intrinsic parallelization
  • GraphX - Graph processing

Spark can be used from Java, Scala, Python, R.

Tensorflow

Google’s Tensorflow is a tensor computation platform with a compute graph abstraction and effective parallelization and GPU delegation. The compute graph abstraction makes it easy to build custom backends for more clever optimization strategies.

In its latest versions, most Tensorflow programming is very similar to classic Python programming.

Recently absorbed by Tensorflow, the Keras project makes construction of neural network models very smooth and easy.

Column Stores

Classical data storage paradigms store row by row:

ID Last First Salary
1 Smith Joe 40000
2 Jones Mary 50000
3 Johnson Cathy 44000
4 Smith Marsha 55000
… … … …

Finding all information relevant for a single record is easy and localized in memory.

Scanning a single column, or combining columns requires a scan for the full database, jumping through memory in strides and likely triggering expensive page lookups or file seeks.

Column Stores

An alternative paradigm is store column by column instead

1 2 3 4 …
Smith Jones Johnson Smith …
Joe Mary Cathy Marsha …
40000 50000 44000 55000 …
  • Column-by-column selection is easy
  • Seeking through a column is memory-local
  • Data compression, through sparsity, through value-index-list storage, through compression algorithms

Implementation for big data applications: Apache Parquet.

Hive

Apache Hive is a large scale SQL-like data warehouse built on Hadoop. Queries are converted to MapReduce or Spark jobs, with parallelization strategies built in.

BigQuery

BigQuery is the Google Cloud platform for massive dataset storage, querying and processing. Accessible through web UI, command line tool, or a REST API. Client libraries in Java, .NET and Python.

Built-in support for GIS and for Machine Learning.

As Velocity grows

  • Real-time processing.
  • Stream Architectures
    • Apache Storm
    • Apache Hadoop

As Variety grows

  • Data lakes
    • Apache Hadoop
    • Azure Data Lake
    • Amazon S3
  • Extract Transform Load (ETL)
    • don’t process data before storing: transform it when accessing

ETL

Data inside the Data Lake remains unchanged from the data source. Transformations are executed during transport out to a processing unit.

Paradigms

Backend Optimizable Paradigms

Two computational paradigms have emerged for computation on large data sets:

Tensor Computation

Examples: Tensorflow, Theano

Everything is a Tensor: a generalized matrix.

Map Reduce

All computations are split into a Map phase and a Reduce phase:

  • Map: each computational unit applies some transformation.
  • Reduce: the transformed data units are combined.

Compute Graphs

Tensorflow uses compute graphs as its fundamental optimizing paradigms:

Each code block gets translated into a data flow graph. This graph is compiled and optimized. Data flows through the optimized graph.

Through analyzing the graph, many automatic optimization steps can be produced.

Compute Graphs

\[ f(x,y,z) = x^2y + x^2z + z^2 \]

Compute Graphs

\[ f(x,y,z) = x^2y + x^2z + z^2 \]

Compute Graphs

\[ f(x,y,z) = x^2y + x^2z + z^2 \]

Compute Graphs

\[ f(x,y,z) = x^2y + x^2z + z^2 \]

Compute Graphs

\[ f(x,y,z) = x^2y + x^2z + z^2 \]

Map Reduce

  • Each Mapper has a simple processing task.
    Ex: tokenize a string, emit pairs (word, 1)
  • Each Combiner collates its Mapper output.
    Ex: sum up pairs with same word
  • Each Reducer collates a subset of Combiner outputs.
    Ex: sum up pairs with same word

Through for instance hash functions, we can ensure load balance and non-overlap for the reduction step. The combiner serves to decrease network communication load.

Map Reduce

Hadoop: Java-based platform, with many extensions.

Some major companies are moving away from this paradigm.

Data Driven Architecture

For sufficiently large computational needs, geography becomes important.

  • Volume: moving large amounts of data is difficult - put processing units near or in data storage to avoid moving data
  • Velocity: moving data introduces latency - put real-time processing units near the data stream, diverting data into the analytics
  • Variety: transforming data locks down transformation choices - put data munging at the border between data store and processing

Data Driven Architecture

On the scale of Google or Amazon data centers, scale drives VERY many design decisions.

  • Light speed vs. flash memory access speed: dictates where to place storage vs. CPUs
  • Component reliability: Even very high component reliability produces constant hardware failures at large enough scales. Build code failure resistant.
  • Build power plants next to data centers / data centers next to power plants
  • Hamina, Finland: Google’s data centre pumps sea water through hollow walls for cooling

Code samples: pylab / pandas / scikit-learn

%pylab inline # to make Jupyter notebooks pretty
import pandas
from sklearn import linear_model
data = pandas.read_csv("datafile.csv")
model = linear_model.Logistic_Regression()
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)

Code samples: dask

%pylab inline # to make Jupyter notebooks pretty
import dask.dataframe as dd
from dask_ml import linear_model
data = dd.read_parquet("datafile.parquet")
model = linear_model.Logistic_Regression()
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)

Code samples: pyspark

from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
spark = SparkSession.builder.appName("logreg").getOrCreate()
data = spark.read.csv("datafile.csv")
model = LogisticRegression()
model.fit(data)
model.transform(newdata)

Some data munging may be needed for data to be arranged on a format that fits LogisticRegression.

Code samples: tensorflow (old style)

import tensorflow as tf
data = pandas.read_csv("datafile.csv")
X = tf.placeholder(tf.float32, [None, data.shape[0]-1])
y = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([data.shape[0]-1,1]))
b = tf.Variable(tf.zeros([1]))
pred = tf.nn.softmax(tf.matmul(X,W) + b)
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session as sess:
  sess.run(init)
  for epoch in range(N):
    for i in range(batches):
      _, c = sess.run([optimizer, cost], feed_dict={X: batch_x, y: batch_y})
y_pred = sess.run(pred, feed_dict={X: newdata})

Code samples: tensorflow (Keras)

import tensorflow as tf
from tf.keras import models, layers
data = pandas.read_csv("datafile.csv")
inputs = layers.Input(shape=(data.shape[0]-1,))
outputs = layers.Dense(1, activation="softmax")(inputs)
model = models.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer="sgd", loss="categorical_crossentropy")
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)

Code samples: BigQuery

CREATE MODEL 
  `mydataset.mymodel`
OPTIONS
 ( model_type="logistic_reg",
   input_label_cols="target" ) AS
SELECT 
 * 
FROM
 `mydataset.mytable`