Big Data Analytics
Lecture 4

Mikael Vejdemo-Johansson

21 February 2019

Small/medium data

This is most likely the majority of tasks you will meet.

Python (with PyLab, scikit-learn, seaborn, pandas)
R (with tidyverse, caret)
SPSS (especially for social sciences / psychology / etc)
MySQL / MSSQL / Postgres

As Volume grows

Read and process in chunks
Parallelize
- Dask
- Spark
Column store
- Amazon Redshift
- Apache Cassandra
- Apache Parquet
At very large scales: everything looks like SQL.
- Hive
- BigQuery

Chunking Data

Supported out of the box by many methods in both the Python stack and the R stack.

Read data into memory in manageable chunks.
Process each chunk separately.
Aggregate results - or write transformed data out again.

Dask

Builds on the Pandas / PyLab stack, adds transparent cluster deployment and distributed data frames.

If you are looking to manage a terabyte or less of tabular CSV or JSON data, then you should forget both Spark and Dask and use Postgres or MongoDB. [Dask documentation]

Hadoop

Apache Hadoop is a general purpose distributed storage and computation platform, based on MapReduce. Hadoop provides

Hardware failure resistance by design
Distributed file system
MapReduce implementation
Job scheduler and cluster resource manager
Java-based platform

Many projects extend Hadoop’s functionality.

Spark

Apache Spark is a Hadoop based platform for distributed data flow and computation. Spark includes Spark SQL, providing data set functionality similar to how Dask extends the pandas / numpy data models. On top of Spark are available:

Spark Streaming - highspeed streaming analytics
MLLib - Machine Learning with serializable models (difficult in Python) and intrinsic parallelization
GraphX - Graph processing

Spark can be used from Java, Scala, Python, R.

Tensorflow

Google’s Tensorflow is a tensor computation platform with a compute graph abstraction and effective parallelization and GPU delegation. The compute graph abstraction makes it easy to build custom backends for more clever optimization strategies.

In its latest versions, most Tensorflow programming is very similar to classic Python programming.

Recently absorbed by Tensorflow, the Keras project makes construction of neural network models very smooth and easy.

Column Stores

Classical data storage paradigms store row by row:

ID	Last	First	Salary
1	Smith	Joe	40000
2	Jones	Mary	50000
3	Johnson	Cathy	44000
4	Smith	Marsha	55000
…	…	…	…

Finding all information relevant for a single record is easy and localized in memory.

Scanning a single column, or combining columns requires a scan for the full database, jumping through memory in strides and likely triggering expensive page lookups or file seeks.

Column Stores

An alternative paradigm is store column by column instead

1	2	3	4	…
Smith	Jones	Johnson	Smith	…
Joe	Mary	Cathy	Marsha	…
40000	50000	44000	55000	…

Column-by-column selection is easy
Seeking through a column is memory-local
Data compression, through sparsity, through value-index-list storage, through compression algorithms

Implementation for big data applications: Apache Parquet.

Hive

Apache Hive is a large scale SQL-like data warehouse built on Hadoop. Queries are converted to MapReduce or Spark jobs, with parallelization strategies built in.

BigQuery

BigQuery is the Google Cloud platform for massive dataset storage, querying and processing. Accessible through web UI, command line tool, or a REST API. Client libraries in Java, .NET and Python.

Built-in support for GIS and for Machine Learning.

As Velocity grows

Real-time processing.
Stream Architectures
- Apache Storm
- Apache Hadoop

As Variety grows

Data lakes
- Apache Hadoop
- Azure Data Lake
- Amazon S3
Extract Transform Load (ETL)
- don’t process data before storing: transform it when accessing

ETL

Data inside the Data Lake remains unchanged from the data source. Transformations are executed during transport out to a processing unit.

Paradigms

Backend Optimizable Paradigms

Two computational paradigms have emerged for computation on large data sets:

Tensor Computation

Examples: Tensorflow, Theano

Everything is a Tensor: a generalized matrix.

Map Reduce

All computations are split into a Map phase and a Reduce phase:

Map: each computational unit applies some transformation.
Reduce: the transformed data units are combined.

Compute Graphs

Tensorflow uses compute graphs as its fundamental optimizing paradigms:

Each code block gets translated into a data flow graph. This graph is compiled and optimized. Data flows through the optimized graph.

Through analyzing the graph, many automatic optimization steps can be produced.

Compute Graphs