Reminder: Choose your semester projects by today.
In Supervised Learning, data comes as matched pairs $(x_i,y_i)$, and the learning task is to construct a function $f:X\to Y$ that best generalizes observations.
Unsupervised Learning treats unpaired data $x_i$, and tries to generate structures an dinsight for $X$ itself.
Regression is supervised learning where the target $Y$ is continuous.
Classification is supervised learning where $Y$ is discrete.
Most will have seen this at some point.
When linear independence is violated, coefficients can grow out of bounds. One way around this is to add penalty term to the optimization task that penalize the coefficients:
Once regularization is introduced, closed form estimate of $\hat\beta$ is no longer available - instead we use optimization techniques.
Logistic Regression chains a linear regression with the logistic function:
fig
The resulting output can be interpreted as a probability measuring confidence in class membership.
Logistic regression produces a linear classifier -- with a linear decision boundary.
Method for both regression and classification. To predict $\hat{f}(x^*)$:
For regression, the mean or the median of the neighbors is often used.
For classification, majority or plurality vote is commonly used.
Degree 4 curve. Cubic and degree 4 polynomial regression. k-nearest neighbors regression.
plot(xi, yi, '*b')
plot(xs, y_1, 'r')
plot(xs, y_2, 'g')
plot(xs, y_3, 'orange')
[<matplotlib.lines.Line2D at 0x11e960128>]
Two interleaved half-moons. Logistic regression vs. 5-nearest neighbor majority vote.
fig
Just like with $k$-NN, can be used both for (piecewise constant) regression and for classification.
Different from k-NN, does not require that we store and search the entire training set.
Produces interpretable prediction processes.
fig
fig
fig
fig
fig
fig
Residual sum of squares (RSS) emphasizes model precision over model simplicity. To compensate for model complexity, use an adjusted error measure.
With $\mathcal{L}$ the maximum likelihood using all data and $p$ free parameters:
AIC is derived from Maximum Likelihood. BIC is derived from Bayesian theory.
Linear regression minimizes $$ RSS = \sum\left(y_i-\beta_0-\sum_j\beta_jx_{ij}\right)^2 $$
Adding a penalty to the size of the coefficients reduces an infinitude of solutions to (hopefully) a unique solution. New minimization target: $$ RSS + \lambda\sum_j|\beta_j|^p $$
Find a (linear) projection $\pi$ to a lower dimensional space. Try to keep distortion low.
fig
Try to decorrelate the data.
Suppose $P$ is a transformation that decorrelates $X$: ie $PX$ has a diagonal covariance matrix.
$$ \text{Cov}(PX) = \mathbb{E}[PX(PX)^T] = \mathbb{E}[PXX^TP^T] = P\mathbb{E}[XX^T]P^T = P\text{Cov}(X)P^T $$A matrix that decorrelates $X$ is a matrix that diagonalizes $\text{Cov}(X)$.
A diagonalizing matrix is a matrix with eigenvectors as columns. This leads us to define:
Definition The principal components of a dataset $X$ are the eigenvectors of $\text{Cov}(X)$.
Their eigenvalues determine the amount of total variability determined by that component.
Principal components produce an orthonormal basis such that the basis element $p_i$ determines the direction of largest variance in the orthogonal complement of $\langle p_1,\dots,p_{i-1}\rangle$.
fig
Johnson-Lindenstrauss Lemma Given $n$ points $X\subset\mathbb{R}^p$, $0<\epsilon<1$, $m>8\log(n)/\epsilon^2$, there is a linear map $f:\mathbb{R}^p\to\mathbb{R}^m$ such that $$ (1-\epsilon)\|u-v\|^2 \leq \|f(u)-f(v)\|^2 \leq (1+\epsilon)\|u-v\|^2 $$
The maps $f$ are dense enough that randomly sampling an orthonormal matrix can be checked and re-sampled creating a good enough mapping efficiently.