class: center, middle, inverse, title-slide # Lecture 1 --- ## Welcome to MTH 513 Competition based introduction to Machine Learning Course Information is available on: * Blackboard (syllabus, announcements) * http://www.math.csi.cuny.edu/~mvj/MTH513 (lecture slides, extra information) --- ## Course Design We will divide into teams of 3. These teams will work together for the entire semester. Most of class time will be spent working on Kaggle competition problems. Whenever the need for additional theory comes up, we will break for a theory lecture. For this design to work, you need to help. Let me know ASAP when... * ...you find yourself struggling with anything * ...you see terminology or concepts you don't understand --- ## Contact and Office Hours Mikael.VejdemoJohansson@csi.cuny.edu 1S-208 Office hours Monday, Wednesday, 12.45 - 14.15. ## Grading Your course grades will be determined by * 10% Attendance * 20% Final exam * 70% Written report --- ## Grading Your course grades will be determined by * 10% Attendance * 20% Final exam * 70% Written report ### Attendance You will be learning from your peers; your peers will be learning from you. For this reason, attendance is mandatory for this course. You have an allowance of 5 unexcused absences. Each additional absence will drop your final score by 0.5%pt --- ## Grading Your course grades will be determined by * 10% Attendance * 20% Final exam * 70% Written report ### Final exam There will be a final exam evaluating your comprehension of core machine learning concepts. --- ## Grading Your course grades will be determined by * 10% Attendance * 20% Final exam * 70% Written report ### Written report The majority of your grade comes from a written report, due at the end of the semester. This report will describe your team's solution to one of the competitions. You are not allowed to write your report on the same solution as your team mates. --- ## Written report The report should explain how your final solution was built, what intermediate models you tried, and track relevant metrics and evaluations along the process of your work with that particular competition. It will be graded on readability, and on completeness of your description of your work. --- ## Course literature Our primary reference literature will be [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) It is available as a free PDF file on http://www-bcf.usc.edu/~gareth/ISL/ If you want a printed copy, the SpringerLink version (available through the college library website) can be printed for $25. --- ## What I will not teach This is not a course in... * ...programming * ...linear algebra * ...introductory statistics * ...multivariate calculus To the extent these are unfamiliar you are expected to learn everything you need on your own. --- # Questions? --- ## What is Machine Learning? [Wikipedia] > Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. We distinguish between: * Supervised and unsupervised learning * Regression and classification --- ## (Un)Supervised learning A mathematical perspective is that Machine Learning is about learning some function $$ f: X \to Y $$ based on observations that are believed to be similar to the function itself. **Supervised learning** Learns from paired data `\((x_i,y_i)\)`, and minimizes `\(\mathbb{E}[d(f(x),y)]\)` for some distance `\(d\)`. **Unsupervised learning** Only uses `\(x_i\)`, and constructs an appropriate `\(Y\)` based on the structures in `\(X\)`. --- ## Regression vs Classification A mathematical perspective is that Machine Learning is about learning some function $$ f: X \to Y $$ **Classification** is when `\(Y\)` is a finite set **Regression** is when `\(Y\)` is a continuous range --- ## A first example ```r head(mpg) %>% kable("html") ``` <table> <thead> <tr> <th style="text-align:left;"> manufacturer </th> <th style="text-align:left;"> model </th> <th style="text-align:right;"> displ </th> <th style="text-align:right;"> year </th> <th style="text-align:right;"> cyl </th> <th style="text-align:left;"> trans </th> <th style="text-align:left;"> drv </th> <th style="text-align:right;"> cty </th> <th style="text-align:right;"> hwy </th> <th style="text-align:left;"> fl </th> <th style="text-align:left;"> class </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> audi </td> <td style="text-align:left;"> a4 </td> <td style="text-align:right;"> 1.8 </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> auto(l5) </td> <td style="text-align:left;"> f </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:left;"> p </td> <td style="text-align:left;"> compact </td> </tr> <tr> <td style="text-align:left;"> audi </td> <td style="text-align:left;"> a4 </td> <td style="text-align:right;"> 1.8 </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> manual(m5) </td> <td style="text-align:left;"> f </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 29 </td> <td style="text-align:left;"> p </td> <td style="text-align:left;"> compact </td> </tr> <tr> <td style="text-align:left;"> audi </td> <td style="text-align:left;"> a4 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 2008 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> manual(m6) </td> <td style="text-align:left;"> f </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 31 </td> <td style="text-align:left;"> p </td> <td style="text-align:left;"> compact </td> </tr> <tr> <td style="text-align:left;"> audi </td> <td style="text-align:left;"> a4 </td> <td style="text-align:right;"> 2.0 </td> <td style="text-align:right;"> 2008 </td> <td style="text-align:right;"> 4 </td> <td style="text-align:left;"> auto(av) </td> <td style="text-align:left;"> f </td> <td style="text-align:right;"> 21 </td> <td style="text-align:right;"> 30 </td> <td style="text-align:left;"> p </td> <td style="text-align:left;"> compact </td> </tr> <tr> <td style="text-align:left;"> audi </td> <td style="text-align:left;"> a4 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> auto(l5) </td> <td style="text-align:left;"> f </td> <td style="text-align:right;"> 16 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:left;"> p </td> <td style="text-align:left;"> compact </td> </tr> <tr> <td style="text-align:left;"> audi </td> <td style="text-align:left;"> a4 </td> <td style="text-align:right;"> 2.8 </td> <td style="text-align:right;"> 1999 </td> <td style="text-align:right;"> 6 </td> <td style="text-align:left;"> manual(m5) </td> <td style="text-align:left;"> f </td> <td style="text-align:right;"> 18 </td> <td style="text-align:right;"> 26 </td> <td style="text-align:left;"> p </td> <td style="text-align:left;"> compact </td> </tr> </tbody> </table> --- ```r model = lm(cty ~ hwy + cyl + displ, data=mpg) summary(model) ``` ``` ## ## Call: ## lm(formula = cty ~ hwy + cyl + displ, data = mpg) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.0347 -0.6012 -0.0229 0.7397 5.2573 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.08786 0.83226 7.315 4.25e-12 *** ## hwy 0.58092 0.02010 28.900 < 2e-16 *** ## cyl -0.44827 0.13010 -3.446 0.000677 *** ## displ -0.05935 0.16351 -0.363 0.716971 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.148 on 230 degrees of freedom ## Multiple R-squared: 0.9281, Adjusted R-squared: 0.9272 ## F-statistic: 989.9 on 3 and 230 DF, p-value: < 2.2e-16 ``` --- ```r predict(model, data.frame(hwy=c(20,20,30,30), cyl=c(4,6,4,6), displ=c(2.5,2.5,2.5,2.5))) ``` ``` ## 1 2 3 4 ## 15.76490 14.86836 21.57413 20.67760 ``` --- ```r augment(model) %>% gf_point(.resid ~ .fitted) %>% gf_hline(yintercept=0) ``` <!-- --> --- ## What about classification? 2-label case: logistic regression ```r mpg$automatic = substring(mpg$trans, 1, 1) == "a" model = glm(automatic ~ displ + cyl + cty + hwy + drv + class, data=mpg, family=binomial()) model ``` ``` ## ## Call: glm(formula = automatic ~ displ + cyl + cty + hwy + drv + class, ## family = binomial(), data = mpg) ## ## Coefficients: ## (Intercept) displ cyl cty ## -3.69992 0.86964 -0.09162 -0.15972 ## hwy drvf drvr classcompact ## 0.09291 0.33727 -1.18974 2.48043 ## classmidsize classminivan classpickup classsubcompact ## 2.75468 19.01374 1.48480 2.39596 ## classsuv ## 3.45598 ## ## Degrees of Freedom: 233 Total (i.e. Null); 221 Residual ## Null Deviance: 296.5 ## Residual Deviance: 243 AIC: 269 ``` --- ## How can this fail? Over- and under-fitting <!-- --> --- ## How do I know it works? - Measures of quality `\(R^2\)` together with residual QQ-plot and residual vs fitted value plot will tell you if a linear regression worked, and how well. --- ## How do I know it works? - Measures of quality For classification tasks, **Confusion Matrix** | True | False -|-|- Predicted True | TP | FP Predicted False | FN | TN * **sensitivity** or **true positive rate**: `\(TP/(TP+FN)\)` * **specificity** or **true negative rate**: `\(TN/(TN+FP)\)` * **precision**: `\(TP/(TP+FP)\)` * **accuracy**: `\((TP+TN)/N\)` --- ## How do I know it works? Train and Test split Ideally, we want to estimate and minimize $$ \mathbb{P}(\text{model is wrong}) \qquad \mathbb{E}(\text{error measure}) $$ If any information used to estimate these goes into creating the model itself, we run severe risk of **overfitting**. Since we cannot see into the future, we need to *simulate* clairvoyance: 1. Split the data into a **training set** and a **test set**. Train the model on the training set, evaluate its performance on the test set. 2. If you need to pick between different **hyperparameters** or different **candidate models**, split data into **training**, **validation** and **test**. Train the models on the training set. Pick a model based on its performance on the validation set. Evalutate the model on the test set. --- ## In R: use `caret` ```r library(caret) trainingIx = createDataPartition(mpg$drv, p=0.75, list=F) train.df = mpg[trainingIx,] test.df = mpg[-trainingIx,] c(train.nrow = nrow(train.df), test.nrow = nrow(test.df)) ``` ``` ## train.nrow test.nrow ## 177 57 ``` ```r model = train(drv ~ ., data=mpg, method="knn") ``` --- ```r model ``` ``` ## k-Nearest Neighbors ## ## 234 samples ## 11 predictor ## 3 classes: '4', 'f', 'r' ## ## No pre-processing ## Resampling: Bootstrapped (25 reps) ## Summary of sample sizes: 234, 234, 234, 234, 234, 234, ... ## Resampling results across tuning parameters: ## ## k Accuracy Kappa ## 5 0.8151386 0.6787380 ## 7 0.8099359 0.6664383 ## 9 0.7979361 0.6419197 ## ## Accuracy was used to select the optimal model using the largest value. ## The final value used for the model was k = 5. ``` --- ```r y.pred = predict(model, test.df) confusionMatrix(y.pred, test.df$drv) ``` ``` ## Confusion Matrix and Statistics ## ## Reference ## Prediction 4 f r ## 4 25 0 1 ## f 0 25 2 ## r 0 1 3 ## ## Overall Statistics ## ## Accuracy : 0.9298 ## 95% CI : (0.83, 0.9805) ## No Information Rate : 0.4561 ## P-Value [Acc > NIR] : 3.149e-14 ## ## Kappa : 0.8783 ## Mcnemar's Test P-Value : NA ## ## Statistics by Class: ## ## Class: 4 Class: f Class: r ## Sensitivity 1.0000 0.9615 0.50000 ## Specificity 0.9688 0.9355 0.98039 ## Pos Pred Value 0.9615 0.9259 0.75000 ## Neg Pred Value 1.0000 0.9667 0.94340 ## Prevalence 0.4386 0.4561 0.10526 ## Detection Rate 0.4386 0.4386 0.05263 ## Detection Prevalence 0.4561 0.4737 0.07018 ## Balanced Accuracy 0.9844 0.9485 0.74020 ``` --- ## In Python: scikit-learn Read plenty of very good guides on the scikit-learn webpages. Not as easy to include in lecture slides as R code is. --- # Time to start 1. Team up. * 3 to a team. * 2 acceptable if necessary, 4 not acceptable * auditors team up with auditors 2. Create accounts on http://kaggle.com 3. Go to the Titanic competition https://www.kaggle.com/c/titanic 4. Either try this yourself, or wait for Prof. Vejdemo-Johansson to walk us through a first attempt ### Read An Introduction to Statistical Learning: Chapter 2 (pp 15-57)