Welcome to CSC 84050 - Data Mining

Welcome to CSC 84050 - Data Mining

Welcome to Data Mining

As access to data and the size of amenable datasets grows, our capacity for using data to drive discovery, analysis and decisions has grown considerably. Data at the scale we have available now can be used with a variety of techniques: statistical modeling, or drilling down to highly specific subpopulations, learning responses and culling explanatory features to produce transparency in predictive systems. We expect the data we meet to be potentially very large: enough to strain storage or processing, with vast numbers of observations and variables. The populations under study may contain many and varied subpopulations, possibly with different sets and structures of dependencies between variables. From Data Mining, we find tools and techniques for discovering and identifying structures that enable formulating hypotheses and making valid predictions even as the assumptions in classical statistical methods fail.

Course description

Data mining is the name given to a variety of new analytical and statistical techniques that are already widely used in business, and are starting to spread into social science research. Other closely-related terms are ‘machine learning’ ‘pattern recognition’ and ‘predictive analytics.’
Data mining methods can be applied to visual and to textual data, but the focus of this class is on the application of data mining to symbolic or numerical data: focusing more on data analysis and prediction than on the process of cleaning and processing data. In this area, data mining offers interesting alternatives to conventional statistical modeling methods such as regression and its offshoots: the primary focus is on \emph{unsupervised} methods, where the structures emerge from the data instead of being trained from an explicit pairing of explanatory and response observations.

Each student will undertake a data mining analysis project as a final paper, typically analyzing a dataset chosen by the student.

Contact

mvejdemojohansson@gc.cuny.edu
Office 4425

Literature

We will be using Hastie, Friedman and Tibshirani’s book Elements of Statistical Learning (available inter alia from Tibshirani’s own website), as well as supplementary material.

Learning goals and examination

After you finish this course, you will:

  • understand the mathematical and statistics foundations of the methodology and algorithms of data mining techniques
  • become proficient with data mining software such as WEKA, Python and R
  • given a dataset, be able to discover and communicate patterns and relationships in the data that may be used for descriptive modeling or to make valid predictions