30 January, 2018

Course webpage: http://www.math.csi.cuny.edu/~mvj/GC-DataMining/

  • All course details: syllabus, grading scheme, report requirements, schedule, …
  • All additional course content: lecture slides, homework, …
  • Linked from Blackboard

Course will be graded on a written report and a midterm exam:

  • Data mining analysis of a dataset of your own

Data Mining

…unsupervised learning

…finding intrinsic structure in data without instructions

(Subtly) different from

Machine Learning …because doesn't include supervised learning
Pattern Recognition …because lower focus on computer vision
Knowledge Discovery …which covers more of the data handling pipeline

…though all these fields overlap significantly and blend together.

Data pipeline

Data flows through a sequence of steps:

  1. Collection
  2. Encoding
  3. Cleaning
  4. Exploration
  5. Modeling
  6. Communication

Data pipeline

Data flows through a sequence of steps:

  1. Collection
  2. Encoding
  3. Cleaning
  4. Exploration
  5. Modeling
  6. Communication

Data Mining lives primarily in steps 4-5 here; exploring data and producing models generated from the data itself (rather than from hypotheses and application needs)

Main components of Data Mining

…or at least of this course

  1. Exploratory Data Analysis
  2. Visualization
  3. Association Rules
  4. Clustering
  5. Dimension reduction
  6. Classification

Exploratory Data Analysis and Visualization

First contact with data

1985 population survey

wage educ race sex hispanic south married exper union age sector
9.0 10 W M NH NS Married 27 Not 43 const
5.5 12 W M NH NS Married 20 Not 38 sales
3.8 12 W F NH NS Single 4 Not 22 sales
10.5 12 W F NH NS Married 29 Not 47 clerical
15.0 12 W M NH NS Married 40 Union 58 const
9.0 16 W F NH NS Married 27 Not 49 clerical

First contact with data

1985 population survey

name class levels n missing distribution
race factor 2 534 0 W (87.5%), NW (12.5%)
sex factor 2 534 0 M (54.1%), F (45.9%)
hispanic factor 2 534 0 NH (94.9%), Hisp (5.1%)
south factor 2 534 0 NS (70.8%), S (29.2%)
married factor 2 534 0 Married (65.5%), Single (34.5%)
union factor 2 534 0 Not (82%), Union (18%)
sector factor 8 534 0 prof (19.7%), clerical (18.2%) …

First contact with data

1985 population survey

name class min Q1 median Q3 max mean sd n missing
wage numeric 1 5.25 7.78 11.25 44.5 9.02 5.14 534 0
educ integer 2 12.00 12.00 15.00 18.0 13.02 2.62 534 0
exper integer 0 8.00 15.00 26.00 55.0 17.82 12.38 534 0
age integer 18 28.00 35.00 44.00 64.0 36.83 11.73 534 0

First contact with data

1985 population survey

Software packages & Visualization

Packages in Python and R

Some of the packages I will be referring to include

Python R
numpy mosaic
scipy tidyverse (includes: ggplot2, dplyr, broom, lubridate, readxl, xml2, and many more useful packages)
matplotlib ggformula
seaborn GGally
scikit-learn caret and many many specific libraries
iPython and jupyter knitr and RMarkdown

Functions in Python and R

There are several choices in each environment for most of these tasks. I'll list my own current favorites here.

What? Why? Python R
Import Data pandas, csv, numpy.loadtxt read.table, read.csv
Save Data pandas, csv, numpy.savetxt write.table, write.csv
Summary statistics Numeric summaries pandas:describe summary, inspect

Functions in Python and R

What? Why? Python R
Histogram Distribution of numeric data matplotlib.hist gf_histogram, geom_histogram
Frequency Curve Distribution of numeric data matplotlib.hist gf_freqpoly
Average Shifted Histogram Distribution of numeric data gf_ash
Density plot Distribution of numeric data seaborn.distplot, seaborn.kdeplot gf_density

Examples

Functions in Python and R

What? Why? Python R
Bar Diagram Distribution of categorical data matplotlib.bar, seaborn.countplot gf_bar, geom_bar

Functions in Python and R

What? Why? Python R
2-way-table categorical vs categorical distribution pandas.crosstab table, tally
2-way-table plot categorical vs categorical distribution pandas.crosstab and seaborn.heatmap table and as.data.frame and geom_point
Boxplot categorical vs numerical distribution seaborn.boxplot gf_boxplot
Violinplot categorical vs numerical distribution seaborn.violinplot gf_violin
Point estimate plot categorical vs numerical distribution seaborn.pointplot gf_pointrange

Examples

Not Union
Married 278 72
Single 160 24

Examples

Functions in Python and R

What? Why? Python R
Correlation numerical vs numerical summary numpy.corrcoef cor
Scatter plot numerical vs numerical distribution matplotlib.scatter gf_point
2d density plot numerical vs numerical distribution matplotlib.hist2d gf_density2d, geom_bin2d

Examples