Our next competition will be on deep learning for computer vision. We are starting to look through the underlying theory early, so that you will be ready to take off running when we switch competitions.
When dealing with deep learning in particular, we will switch libraries - away from scikit-learn and into Tensorflow and Keras. Tensorflow is Google Labs' platform for massively parallelizable machine learning code, and Keras is a particularly nice platform for specifying neural network models. In the end, we get model objects that mostly behave similarly to scikit-learn models.
Recall stacking: transform using models $C_1,\dots,C_k$, then classify using a meta-model $M$.
$$ \hat{y} = M(C_1,\dots,C_k) $$What happens if we stack linear regressions?
$$ \hat{y} = W_M(Wx+b) + b_M = W_MWx + (W_Mb + b_M) $$Stacked linear regressions are linear regressions - because composing linear functions produces a linear function.
When stacking the logistic regression $\hat{y} = \text{logit}(Wx+b)$, the logit function is non-linear - so the linear map collapse does not happen.
When stacking logistic regressions, the earlier logistic regressions tend to fit to structures in the data that are leveraged by the later logistic regressions to classify.
The meta-classifier works on a higher level of abstraction.
By the late 200x, early 201x compute power and data availablility enabled the deep learning revolution.
Let $\phi:\mathbb{R}\to\mathbb{R}$ be non-constant, bounded and continuous. Let $f$ be a continuous function $[0,1]^m\to\mathbb{R}$.
There is a one-layer neural network
$$ F(x) = V\cdot \phi(Wx + b) $$that approximates $\phi$
fig
A Neural Network (in the Machine Learning sense) is built out of layers.
Each layer is a vector.
Each transition is a linear transformation followed by a non-linear function.
fig
My recommendation is to use Keras and Tensorflow.
784 dimensional input (Fashion-MNIST); hidden layers with 1024, 2048 and 4096 dimensions; 10 outputs. ReLU and softmax.
from tensorflow.keras import layers, models, datasets, utils
inp = layers.Input((28,28))
x = layers.Flatten()(inp)
x = layers.Dense(1024, activation="relu")(x)
x = layers.Dense(2048, activation="relu")(x)
x = layers.Dense(4096, activation="relu")(x)
oup = layers.Dense(10, activation="softmax")(x)
model = models.Model(inputs=inp, outputs=oup)
model.compile("sgd", loss="categorical_crossentropy", metrics=["accuracy"])
2022-03-16 16:24:21.790061: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
model.summary()
Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 28, 28)] 0 flatten (Flatten) (None, 784) 0 dense (Dense) (None, 1024) 803840 dense_1 (Dense) (None, 2048) 2099200 dense_2 (Dense) (None, 4096) 8392704 dense_3 (Dense) (None, 10) 40970 ================================================================= Total params: 11,336,714 Trainable params: 11,336,714 Non-trainable params: 0 _________________________________________________________________
(X_train,y_train),(X_test,y_test) = datasets.fashion_mnist.load_data()
y_train = utils.to_categorical(y_train)
y_test = utils.to_categorical(y_test)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2)
Epoch 1/2 1875/1875 [==============================] - 333s 177ms/step - loss: nan - accuracy: 0.1000 - val_loss: nan - val_accuracy: 0.1000 Epoch 2/2 1875/1875 [==============================] - 291s 155ms/step - loss: nan - accuracy: 0.1000 - val_loss: nan - val_accuracy: 0.1000
<keras.callbacks.History at 0x143c62950>