This is most likely the majority of tasks you will meet.
Python
(with PyLab
, scikit-learn
, seaborn
, pandas
)R
(with tidyverse
, caret
)SPSS
(especially for social sciences / psychology / etc)MySQL
/ MSSQL
/ Postgres
Supported out of the box by many methods in both the Python
stack and the R
stack.
Builds on the Pandas
/ PyLab
stack, adds transparent cluster deployment and distributed data frames.
If you are looking to manage a terabyte or less of tabular CSV or JSON data, then you should forget both Spark and Dask and use Postgres or MongoDB. [Dask documentation]
Apache Hadoop is a general purpose distributed storage and computation platform, based on MapReduce. Hadoop provides
Many projects extend Hadoop’s functionality.
Apache Spark is a Hadoop based platform for distributed data flow and computation. Spark includes Spark SQL, providing data set functionality similar to how Dask extends the pandas
/ numpy
data models. On top of Spark are available:
Spark can be used from Java, Scala, Python, R.
Google’s Tensorflow is a tensor computation platform with a compute graph abstraction and effective parallelization and GPU delegation. The compute graph abstraction makes it easy to build custom backends for more clever optimization strategies.
In its latest versions, most Tensorflow programming is very similar to classic Python programming.
Recently absorbed by Tensorflow, the Keras
project makes construction of neural network models very smooth and easy.
Classical data storage paradigms store row by row:
ID | Last | First | Salary |
---|---|---|---|
1 | Smith | Joe | 40000 |
2 | Jones | Mary | 50000 |
3 | Johnson | Cathy | 44000 |
4 | Smith | Marsha | 55000 |
… | … | … | … |
Finding all information relevant for a single record is easy and localized in memory.
Scanning a single column, or combining columns requires a scan for the full database, jumping through memory in strides and likely triggering expensive page lookups or file seeks.
An alternative paradigm is store column by column instead
1 | 2 | 3 | 4 | … |
---|---|---|---|---|
Smith | Jones | Johnson | Smith | … |
Joe | Mary | Cathy | Marsha | … |
40000 | 50000 | 44000 | 55000 | … |
Implementation for big data applications: Apache Parquet.
Apache Hive is a large scale SQL-like data warehouse built on Hadoop. Queries are converted to MapReduce or Spark jobs, with parallelization strategies built in.
BigQuery is the Google Cloud platform for massive dataset storage, querying and processing. Accessible through web UI, command line tool, or a REST API. Client libraries in Java, .NET and Python.
Built-in support for GIS and for Machine Learning.
Data inside the Data Lake remains unchanged from the data source. Transformations are executed during transport out to a processing unit.
Two computational paradigms have emerged for computation on large data sets:
Examples: Tensorflow, Theano
Everything is a Tensor: a generalized matrix.
All computations are split into a Map phase and a Reduce phase:
Tensorflow uses compute graphs as its fundamental optimizing paradigms:
Each code block gets translated into a data flow graph. This graph is compiled and optimized. Data flows through the optimized graph.
Through analyzing the graph, many automatic optimization steps can be produced.
\[ f(x,y,z) = x^2y + x^2z + z^2 \]
\[ f(x,y,z) = x^2y + x^2z + z^2 \]
\[ f(x,y,z) = x^2y + x^2z + z^2 \]
\[ f(x,y,z) = x^2y + x^2z + z^2 \]
\[ f(x,y,z) = x^2y + x^2z + z^2 \]
(word, 1)
Through for instance hash functions, we can ensure load balance and non-overlap for the reduction step. The combiner serves to decrease network communication load.
Hadoop: Java-based platform, with many extensions.
Some major companies are moving away from this paradigm.
For sufficiently large computational needs, geography becomes important.
On the scale of Google or Amazon data centers, scale drives VERY many design decisions.
pylab
/ pandas
/ scikit-learn
%pylab inline # to make Jupyter notebooks pretty
import pandas
from sklearn import linear_model
data = pandas.read_csv("datafile.csv")
model = linear_model.Logistic_Regression()
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)
dask
%pylab inline # to make Jupyter notebooks pretty
import dask.dataframe as dd
from dask_ml import linear_model
data = dd.read_parquet("datafile.parquet")
model = linear_model.Logistic_Regression()
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)
pyspark
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
spark = SparkSession.builder.appName("logreg").getOrCreate()
data = spark.read.csv("datafile.csv")
model = LogisticRegression()
model.fit(data)
model.transform(newdata)
Some data munging may be needed for data
to be arranged on a format that fits LogisticRegression
.
tensorflow
(old style)import tensorflow as tf
data = pandas.read_csv("datafile.csv")
X = tf.placeholder(tf.float32, [None, data.shape[0]-1])
y = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([data.shape[0]-1,1]))
b = tf.Variable(tf.zeros([1]))
pred = tf.nn.softmax(tf.matmul(X,W) + b)
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred), reduction_indices=1))
optimizer = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session as sess:
sess.run(init)
for epoch in range(N):
for i in range(batches):
_, c = sess.run([optimizer, cost], feed_dict={X: batch_x, y: batch_y})
y_pred = sess.run(pred, feed_dict={X: newdata})
tensorflow
(Keras)import tensorflow as tf
from tf.keras import models, layers
data = pandas.read_csv("datafile.csv")
inputs = layers.Input(shape=(data.shape[0]-1,))
outputs = layers.Dense(1, activation="softmax")(inputs)
model = models.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer="sgd", loss="categorical_crossentropy")
model.fit(data.drop(["target"], axis=1), data["target"])
model.predict(newdata)
BigQuery
CREATE MODEL
`mydataset.mymodel`
OPTIONS
( model_type="logistic_reg",
input_label_cols="target" ) AS
SELECT
*
FROM
`mydataset.mytable`