Intro to Neural Networks¶

Our next competition will be on deep learning for computer vision. We are starting to look through the underlying theory early, so that you will be ready to take off running when we switch competitions.

When dealing with deep learning in particular, we will switch libraries - away from scikit-learn and into Tensorflow and Keras. Tensorflow is Google Labs' platform for massively parallelizable machine learning code, and Keras is a particularly nice platform for specifying neural network models. In the end, we get model objects that mostly behave similarly to scikit-learn models.

Stacking revisited¶

Recall stacking: transform using models $C_1,\dots,C_k$, then classify using a meta-model $M$.

$$ \hat{y} = M(C_1,\dots,C_k) $$

Linear Regressions¶

What happens if we stack linear regressions?

$$ \hat{y} = W_M(Wx+b) + b_M = W_MWx + (W_Mb + b_M) $$

Stacked linear regressions are linear regressions - because composing linear functions produces a linear function.

Stacking Logistic Regressions¶

When stacking the logistic regression $\hat{y} = \text{logit}(Wx+b)$, the logit function is non-linear - so the linear map collapse does not happen.

When stacking logistic regressions, the earlier logistic regressions tend to fit to structures in the data that are leveraged by the later logistic regressions to classify.

The meta-classifier works on a higher level of abstraction.

Artificial Neural Networks¶

  • Bain (1873), James (1890): the brain works by firing sets of neurons transferring electrical currents, strengthening connections to form memory.
  • Sherrington (1898): neurons habituate to electrical currents
  • McCulloch & Pitts (1943): threshold logic - a computational process to describe networks of neurons
  • Hebb (194x): Hebbian learning - learning using neural plasticity; early unsupervised learning
  • Farley & Clark (1954): simulated Hebbian networks on computing machines
  • Rosenblatt (1958): created the perceptron - generalizing logistic regression
  • Minsky & Papert (1969): single-layer neural networks cannot calculate XOR. Research in the field slowed down.
  • Werbos (1975): The back-propagation algorithm
  • Decther (1986): deep learning
  • LeCun (1989): trained an 8-layer network to recognize hand-written ZIP codes. Training took 3 days.
  • Heck (1998): Deep neural networks won a competition in speech recognition
  • Schmidhuber (1997): Long short-term memory
  • Hinton et al (2006): Pre-train deep networks one layer at the time

By the late 200x, early 201x compute power and data availablility enabled the deep learning revolution.

Universal Approximation Theorem¶

Let $\phi:\mathbb{R}\to\mathbb{R}$ be non-constant, bounded and continuous. Let $f$ be a continuous function $[0,1]^m\to\mathbb{R}$.

There is a one-layer neural network

$$ F(x) = V\cdot \phi(Wx + b) $$

that approximates $\phi$

In [2]:
fig
Out[2]:

Layers¶

A Neural Network (in the Machine Learning sense) is built out of layers.

Each layer is a vector.

Each transition is a linear transformation followed by a non-linear function.

  • Input (yellow) - data input
  • Hidden (green) - intermediate results
  • Output (red) - prediction

Non-linear functions¶

In [4]:
fig
Out[4]:

Vectorwise Non-linear functions¶

  • Softmax $\hat{y}_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$
  • Maxout $\hat{y}_i = \max_j x_j$

Software¶

My recommendation is to use Keras and Tensorflow.

 Tensorflow¶

  • Developed by Google
  • Tensor computing Python platform, heavily optimized
  • Transparent use of GPU; high control for cluster computing
  • Open source since 2015

Keras¶

  • Developed by Francois Chollet
  • Frontend for coding neural networks over a tensor computing platform
  • Included in Tensorflow since 2017

Keras code example¶

784 dimensional input (Fashion-MNIST); hidden layers with 1024, 2048 and 4096 dimensions; 10 outputs. ReLU and softmax.

In [5]:
from tensorflow.keras import layers, models, datasets, utils

inp = layers.Input((28,28))
x = layers.Flatten()(inp)
x = layers.Dense(1024, activation="relu")(x)
x = layers.Dense(2048, activation="relu")(x)
x = layers.Dense(4096, activation="relu")(x)
oup = layers.Dense(10, activation="softmax")(x)

model = models.Model(inputs=inp, outputs=oup)
model.compile("sgd", loss="categorical_crossentropy", metrics=["accuracy"])
2022-03-16 16:24:21.790061: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [6]:
model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 28, 28)]          0         
                                                                 
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 1024)              803840    
                                                                 
 dense_1 (Dense)             (None, 2048)              2099200   
                                                                 
 dense_2 (Dense)             (None, 4096)              8392704   
                                                                 
 dense_3 (Dense)             (None, 10)                40970     
                                                                 
=================================================================
Total params: 11,336,714
Trainable params: 11,336,714
Non-trainable params: 0
_________________________________________________________________
In [7]:
(X_train,y_train),(X_test,y_test) = datasets.fashion_mnist.load_data()
y_train = utils.to_categorical(y_train)
y_test = utils.to_categorical(y_test)

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2)
Epoch 1/2
1875/1875 [==============================] - 333s 177ms/step - loss: nan - accuracy: 0.1000 - val_loss: nan - val_accuracy: 0.1000
Epoch 2/2
1875/1875 [==============================] - 291s 155ms/step - loss: nan - accuracy: 0.1000 - val_loss: nan - val_accuracy: 0.1000
Out[7]:
<keras.callbacks.History at 0x143c62950>
In [ ]: