Complex data

When your data is very complex - high embedding dimension, internal structures, ... - even small data sizes become difficult to handle.

If you know the internal structures before hand, this helps immensely; we have methods that make image or sound analysis feasible in spite of a high embedding dimension.

Far more difficult is when there is unknown internal structure. One area where this happens a lot is in genomics.

Intrinsic and Ambient Dimension

The problem with most classical analysis methods is that their complexity tends to depend on ambient dimension - on the number of features in the data set.

In most cases of interest, the data does not fill out the entire ambient space: data has a relatively high co-dimension.

The lower intrinsic dimension of data is what drives successful modeling.

From Ambient to Intrinsic

One approach is to use dimensionality reduction methods to lower the dimension:

  • PCA: find linear map that concentrates variances
  • Manifold learning: find a non-linear map under a manifold assumption.
  • Graph / Complex representations: build a combinatorial model of the data.

When the data is linear enough, PCA can perform very well. Alternative approaches excel when the internal structure of the data is more complicated.

Why Non-linear?

In [103]:
fig1
Out[103]:

Why Non-linear?

In [111]:
fig2
Out[111]:

Manifold Learning Methods

  • MDS - Multidimensional Scaling: optimize an embedding to reduce the difference between original distances and embedding distances
  • Isomap: use nearest neighbors to build a distance graph, traverse the graph to build distances between points, then use MDS
  • Locally Linear Embedding (LLE): find nearest neighbors of each data point, use small neighborhood to build local linear regressions and then glue these to form a global structure
  • Spectral Embedding: use Eigenvectors of the Graph Laplacian to create new coordinates

Topological Data Analysis

The field of Topological Data Analysis starts with a few observations and builds a new paradigm of data analysis techniques, designed to deal with highly complex datasets.

  • Data Has Shape
  • Shape Matters

Once the shape of data is identified as important, several features of topology emerge as useful:

  • Coordinate Independence
  • Deformation Resilience
  • Compressed Representation

There are two main current directions of TDA:

  • Mapper - constructing intrinsic models of data
  • Persistent Homology - extending clustering to topological invariants that measure internal structures of clusters

Mapper: graph / simplicial complex representations

Given: arbitrary dataset $X$, function $X\to\mathbb{R}^d$.

The role of the filter function is to determine what differences in the data matter to the user: what distinctions need to be preserved in the new representation.

Mapper then:

  1. Bins $X$ into overlapping bins according to the filter function values
  2. Clusters each bin separately
  3. Connects overlapping clusters from adjacent bins to create a graph / simplicial complex representing the original data

The Nerve lemma from algebraic topology tells us that if we are sufficiently lucky, the result is structurally equivalent to the original data.