When your data is very complex - high embedding dimension, internal structures, ... - even small data sizes become difficult to handle.
If you know the internal structures before hand, this helps immensely; we have methods that make image or sound analysis feasible in spite of a high embedding dimension.
Far more difficult is when there is unknown internal structure. One area where this happens a lot is in genomics.
The problem with most classical analysis methods is that their complexity tends to depend on ambient dimension - on the number of features in the data set.
In most cases of interest, the data does not fill out the entire ambient space: data has a relatively high co-dimension.
The lower intrinsic dimension of data is what drives successful modeling.
One approach is to use dimensionality reduction methods to lower the dimension:
When the data is linear enough, PCA can perform very well. Alternative approaches excel when the internal structure of the data is more complicated.
fig1
fig2
The field of Topological Data Analysis starts with a few observations and builds a new paradigm of data analysis techniques, designed to deal with highly complex datasets.
Once the shape of data is identified as important, several features of topology emerge as useful:
There are two main current directions of TDA:
Given: arbitrary dataset $X$, function $X\to\mathbb{R}^d$.
The role of the filter function is to determine what differences in the data matter to the user: what distinctions need to be preserved in the new representation.
Mapper then:
The Nerve lemma from algebraic topology tells us that if we are sufficiently lucky, the result is structurally equivalent to the original data.