Lecture 1

class: center, middle, inverse, title-slide

# Lecture 1

---

## Welcome to MTH 513

Competition based introduction to Machine Learning

Course Information is available on:

* Blackboard (syllabus, announcements) *
http://www.math.csi.cuny.edu/~mvj/MTH513 (lecture slides, extra information)

---

## Course Design

We will divide into teams of 3. These teams will work together for the entire
semester.

Most of class time will be spent working on Kaggle competition problems. 
Whenever the need for additional theory comes up, we will break for a theory
lecture.

For this design to work, you need to help. Let me know ASAP when...

* ...you find yourself struggling with anything * ...you see terminology or
concepts you don't understand

---

## Contact and Office Hours

Mikael.VejdemoJohansson@csi.cuny.edu

1S-208

Office hours Monday, Wednesday, 12.45 - 14.15.

## Grading

Your course grades will be determined by

* 10% Attendance * 20% Final exam * 70% Written report

---

## Grading

Your course grades will be determined by

* 10% Attendance * 20% Final exam * 70% Written report

### Attendance

You will be learning from your peers; your peers will be learning from you. For
this reason, attendance is mandatory for this course.

You have an allowance of 5 unexcused absences. Each additional absence will drop
your final score by 0.5%pt

---

## Grading

Your course grades will be determined by

* 10% Attendance * 20% Final exam * 70% Written report

### Final exam

There will be a final exam evaluating your comprehension of core machine
learning concepts.

---

## Grading

Your course grades will be determined by

* 10% Attendance * 20% Final exam * 70% Written report

### Written report

The majority of your grade comes from a written report, due at the end of the
semester.

This report will describe your team's solution to one of the competitions. You
are not allowed to write your report on the same solution as your team mates.

---

## Written report

The report should explain how your final solution was built, what intermediate
models you tried, and track relevant metrics and evaluations along the process
of your work with that particular competition.

It will be graded on readability, and on completeness of your description of
your work.

---

## Course literature

Our primary reference literature will be [An Introduction to Statistical
Learning](http://www-bcf.usc.edu/~gareth/ISL/)

It is available as a free PDF file on http://www-bcf.usc.edu/~gareth/ISL/

If you want a printed copy, the SpringerLink version (available through the
college library website) can be printed for $25.

---

## What I will not teach

This is not a course in...

* ...programming * ...linear algebra * ...introductory statistics *
...multivariate calculus

To the extent these are unfamiliar you are expected to learn everything you need
on your own.

---

# Questions?

---

## What is Machine Learning?

[Wikipedia]

> Machine learning (ML) is the scientific study of algorithms and statistical
models that computer systems use to progressively improve their performance on a
specific task.

We distinguish between:

* Supervised and unsupervised learning * Regression and classification

---

## (Un)Supervised learning

A mathematical perspective is that Machine Learning is about learning some
function

$$ f: X \to Y $$

based on observations that are believed to be similar to the function itself.

**Supervised learning** Learns from paired data `$(x_i,y_i)$`, and minimizes
`$\mathbb{E}[d(f(x),y)]$` for some distance `$d$`.

**Unsupervised learning** Only uses `$x_i$`, and constructs an appropriate `$Y$`
based on the structures in `$X$`.

---

## Regression vs Classification

A mathematical perspective is that Machine Learning is about learning some
function

$$ f: X \to Y $$

**Classification** is when `$Y$` is a finite set

**Regression** is when `$Y$` is a continuous range

---

## A first example

```r
head(mpg) %>% kable("html") 
```

<table>
 <thead>
  <tr>
   <th style="text-align:left;"> manufacturer </th>
   <th style="text-align:left;"> model </th>
   <th style="text-align:right;"> displ </th>
   <th style="text-align:right;"> year </th>
   <th style="text-align:right;"> cyl </th>
   <th style="text-align:left;"> trans </th>
   <th style="text-align:left;"> drv </th>
   <th style="text-align:right;"> cty </th>
   <th style="text-align:right;"> hwy </th>
   <th style="text-align:left;"> fl </th>
   <th style="text-align:left;"> class </th>
  </tr>
 </thead>
<tbody>
  <tr>
   <td style="text-align:left;"> audi </td>
   <td style="text-align:left;"> a4 </td>
   <td style="text-align:right;"> 1.8 </td>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> auto(l5) </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:left;"> p </td>
   <td style="text-align:left;"> compact </td>
  </tr>
  <tr>
   <td style="text-align:left;"> audi </td>
   <td style="text-align:left;"> a4 </td>
   <td style="text-align:right;"> 1.8 </td>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> manual(m5) </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 29 </td>
   <td style="text-align:left;"> p </td>
   <td style="text-align:left;"> compact </td>
  </tr>
  <tr>
   <td style="text-align:left;"> audi </td>
   <td style="text-align:left;"> a4 </td>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:right;"> 2008 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> manual(m6) </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:right;"> 20 </td>
   <td style="text-align:right;"> 31 </td>
   <td style="text-align:left;"> p </td>
   <td style="text-align:left;"> compact </td>
  </tr>
  <tr>
   <td style="text-align:left;"> audi </td>
   <td style="text-align:left;"> a4 </td>
   <td style="text-align:right;"> 2.0 </td>
   <td style="text-align:right;"> 2008 </td>
   <td style="text-align:right;"> 4 </td>
   <td style="text-align:left;"> auto(av) </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:right;"> 21 </td>
   <td style="text-align:right;"> 30 </td>
   <td style="text-align:left;"> p </td>
   <td style="text-align:left;"> compact </td>
  </tr>
  <tr>
   <td style="text-align:left;"> audi </td>
   <td style="text-align:left;"> a4 </td>
   <td style="text-align:right;"> 2.8 </td>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> auto(l5) </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:right;"> 16 </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:left;"> p </td>
   <td style="text-align:left;"> compact </td>
  </tr>
  <tr>
   <td style="text-align:left;"> audi </td>
   <td style="text-align:left;"> a4 </td>
   <td style="text-align:right;"> 2.8 </td>
   <td style="text-align:right;"> 1999 </td>
   <td style="text-align:right;"> 6 </td>
   <td style="text-align:left;"> manual(m5) </td>
   <td style="text-align:left;"> f </td>
   <td style="text-align:right;"> 18 </td>
   <td style="text-align:right;"> 26 </td>
   <td style="text-align:left;"> p </td>
   <td style="text-align:left;"> compact </td>
  </tr>
</tbody>
</table>

---

```r
model = lm(cty ~ hwy + cyl + displ, data=mpg) 
summary(model) 
```

```
## 
## Call:
## lm(formula = cty ~ hwy + cyl + displ, data = mpg)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0347 -0.6012 -0.0229  0.7397  5.2573 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.08786    0.83226   7.315 4.25e-12 ***
## hwy          0.58092    0.02010  28.900  < 2e-16 ***
## cyl         -0.44827    0.13010  -3.446 0.000677 ***
## displ       -0.05935    0.16351  -0.363 0.716971    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.148 on 230 degrees of freedom
## Multiple R-squared:  0.9281,	Adjusted R-squared:  0.9272 
## F-statistic: 989.9 on 3 and 230 DF,  p-value: < 2.2e-16
```

---

```r
predict(model, data.frame(hwy=c(20,20,30,30), cyl=c(4,6,4,6),
displ=c(2.5,2.5,2.5,2.5))) 
```

```
##        1        2        3        4 
## 15.76490 14.86836 21.57413 20.67760
```

---

```r
augment(model) %>% gf_point(.resid ~ .fitted) %>%
gf_hline(yintercept=0) 
```

![](Lecture1_files/figure-html/unnamed-chunk-4-1.png)

---

## What about classification?

2-label case: logistic regression

```r
mpg$automatic = substring(mpg$trans, 1, 1) == "a"
model = glm(automatic ~ displ + cyl + cty + hwy + drv + class, 
            data=mpg, family=binomial())
model
```

```
## 
## Call:  glm(formula = automatic ~ displ + cyl + cty + hwy + drv + class, 
##     family = binomial(), data = mpg)
## 
## Coefficients:
##     (Intercept)            displ              cyl              cty  
##        -3.69992          0.86964         -0.09162         -0.15972  
##             hwy             drvf             drvr     classcompact  
##         0.09291          0.33727         -1.18974          2.48043  
##    classmidsize     classminivan      classpickup  classsubcompact  
##         2.75468         19.01374          1.48480          2.39596  
##        classsuv  
##         3.45598  
## 
## Degrees of Freedom: 233 Total (i.e. Null);  221 Residual
## Null Deviance:	    296.5 
## Residual Deviance: 243 	AIC: 269
```

---

## How can this fail? Over- and under-fitting

![](Lecture1_files/figure-html/unnamed-chunk-6-1.png)

---

## How do I know it works? - Measures of quality

`$R^2$` together with residual QQ-plot and residual vs fitted value plot will tell you if a linear regression worked, and how well.

---

## How do I know it works? - Measures of quality

For classification tasks,

**Confusion Matrix**

&nbsp; | True | False
-|-|-
Predicted True | TP | FP
Predicted False | FN | TN

* **sensitivity** or **true positive rate**: `$TP/(TP+FN)$`
* **specificity** or **true negative rate**: `$TN/(TN+FP)$`
* **precision**: `$TP/(TP+FP)$`
* **accuracy**: `$(TP+TN)/N$`

---

## How do I know it works? Train and Test split

Ideally, we want to estimate and minimize

$$
\mathbb{P}(\text{model is wrong})
\qquad
\mathbb{E}(\text{error measure})
$$

If any information used to estimate these goes into creating the model itself, we run severe risk of **overfitting**.

Since we cannot see into the future, we need to *simulate* clairvoyance:

1. Split the data into a **training set** and a **test set**.
Train the model on the training set, evaluate its performance on the test set.
2. If you need to pick between different **hyperparameters** or different **candidate models**, split data into **training**, **validation** and **test**. Train the models on the training set. Pick a model based on its performance on the validation set. Evalutate the model on the test set.

---

## In R: use `caret`

```r
library(caret)
trainingIx = createDataPartition(mpg$drv, p=0.75, list=F)
train.df = mpg[trainingIx,]
test.df = mpg[-trainingIx,]
c(train.nrow = nrow(train.df), test.nrow = nrow(test.df))
```

```
## train.nrow  test.nrow 
##        177         57
```

```r
model = train(drv ~ ., data=mpg, method="knn")
```

---

```r
model
```

```
## k-Nearest Neighbors 
## 
## 234 samples
##  11 predictor
##   3 classes: '4', 'f', 'r' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 234, 234, 234, 234, 234, 234, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8151386  0.6787380
##   7  0.8099359  0.6664383
##   9  0.7979361  0.6419197
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
```

---

```r
y.pred = predict(model, test.df)
confusionMatrix(y.pred, test.df$drv)
```

```
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  4  f  r
##          4 25  0  1
##          f  0 25  2
##          r  0  1  3
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9298        
##                  95% CI : (0.83, 0.9805)
##     No Information Rate : 0.4561        
##     P-Value [Acc > NIR] : 3.149e-14     
##                                         
##                   Kappa : 0.8783        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 4 Class: f Class: r
## Sensitivity            1.0000   0.9615  0.50000
## Specificity            0.9688   0.9355  0.98039
## Pos Pred Value         0.9615   0.9259  0.75000
## Neg Pred Value         1.0000   0.9667  0.94340
## Prevalence             0.4386   0.4561  0.10526
## Detection Rate         0.4386   0.4386  0.05263
## Detection Prevalence   0.4561   0.4737  0.07018
## Balanced Accuracy      0.9844   0.9485  0.74020
```

---

## In Python: scikit-learn

Read plenty of very good guides on the scikit-learn webpages.

Not as easy to include in lecture slides as R code is.

---

# Time to start

1. Team up. 
 * 3 to a team. 
 * 2 acceptable if necessary, 4 not acceptable
 * auditors team up with auditors
2. Create accounts on http://kaggle.com
3. Go to the Titanic competition https://www.kaggle.com/c/titanic
4. Either try this yourself, or wait for Prof. Vejdemo-Johansson to walk us through a first attempt

### Read

An Introduction to Statistical Learning: Chapter 2 (pp 15-57)