Far more subtle details exist, and are important.
As we start with a text data point, we have a sequence of bytes (or Unicode codepoints...). From these, we need to extract a useful representation of the content.
A tokenizer takes a byte stream and divides it up into distinct component. Something like:
Hold infinity in the palm of your hand.
transforms into:
Hold | [space] | infinity | [space] | in | [space] |
---|---|---|---|---|---|
the | [space] | palm | [space] | of | [space] |
your | [space] | hand | [period] |
As we start with a text data point, we have a sequence of bytes (or Unicode codepoints...). From these, we need to extract a useful representation of the content.
In some cases, the tokens are smaller units than space-or-punctuation-separated-character-sequences. How many tokens should be generated for either of:
25¢ | O'Malley | 5th | 250mi |
---|---|---|---|
Professor | Professors | Professor's | Professors' |
When venturing outside English, far more gnarly features show up in different languages.
As we start with a text data point, we have a sequence of bytes (or Unicode codepoints...). From these, we need to extract a useful representation of the content.
Tokens are largely a syntactic subdivision of the text. For more semantic content, one common step is to tag the tokens. By tagging, each token is equipped with information about its role in a sentence or its meaning.
Especially when used in translation or natural language controlled systems, the overall structure of the text needs to be recovered. This is the role of parsing - create a data structure that captures subdivisions into phrases, contexts and dependencies.
Most machine learning and data analytics systems expect, sooner or later, tidy data: some sort of numeric matrix with observations for each row and features for each column.
Somehow we need to take a sequence of (tagged) tokens and produce a fixed-size vector to process.
A standard first step in this process is to extract a list of all occurring words, and order them. The feature vectors will then use the position in this dictionary to describe a token.
Most machine learning and data analytics systems expect, sooner or later, tidy data: some sort of numeric matrix with observations for each row and features for each column.
Somehow we need to take a sequence of (tagged) tokens and produce a fixed-size vector to process.
Some words are so very common that they will overwhelm any system. For instance, each slide up until now has featured the words
a of or the
It is common practice to filter out these words.
The issue with the professor
, professors
, professor's
, professors'
example can be dealt with either by splitting into morpheme tokens, or by stemming - removing grammatical markers from words.
Most machine learning and data analytics systems expect, sooner or later, tidy data: some sort of numeric matrix with observations for each row and features for each column.
Somehow we need to take a sequence of (tagged) tokens and produce a fixed-size vector to process.
A first step is to use an incidence vector: for each document, put a 1 in the feature vector if that word occurs in the document; a 0 if it does not.
More flexible is to use a count of words for each document. This weights a word stronger if it is used often than if it shows up sporadically.
The word counts and bag of words fit in a generic scheme for measuring term frequency weighting:
Name | Weighting |
---|---|
binary | 0/1 |
raw count | number of occurrences |
term frequency | raw count / document length |
log normalized | log(1 + raw count) |
K normalized | K + (1-K)(raw count / highest raw count) |
Inverse document frequency measures how much information the word provides - if it is rare or common across all documents. Given a corpus with $N$ documents, and writing $n_t$ for the number of documents that contain the term $t$, some common measures are
Name | Weighting |
---|---|
unary | 1 (ie no weighting at all) |
idf | $\log(N/n_t) = -\log(n_t/N)$ |
smooth idf | $\log(N/(1+n_t))$ |
max idf | $\log[(\max_{t'} n_{t'}) / (1+n_t)]$ |
probabilistic idf | $\log (N-n_t)/n_t$ |
Term frequency - inverse document frequency is an adjustment of the counting strategy that emphasizes high term frequency, low document frequency terms as more semantically meaningful: a term that is very common when it occurs, but rarely does occur.
$$ tfidf(t, d, D) = tf(t,d) \cdot idf(t,D) $$For many of the term frequency measures, the values will increase if the size of the document increases. One particularly common metric to use is cosine similarity - a normalized dot product:
$$ CS(X,Y) = \frac{X\cdot Y}{\|X\|\cdot\|Y\|} $$This measures the (high dimensional) angle between the two vectors - disregarding the magnitude of the vector.
Word vectors are very, very sparse in most applications. Several transformations exist that make each dimension in the vector more meaningful.
Most dimensionality reduction methods could work - but one method is particularly popular lately:
word2vec
¶One word embedding scheme with impressive additive properties (adding / subtracting word vectors maps to semantically meaningful relationships) is word2vec
.
The fundamental idea with word2vec
is to train a neural network with one hidden layer to predict a word from its context (Continuous Bag of Words / CBOW) or a context from a word (Skip-grams) - and then use the hidden layer activations as a vector representative of the word.
Trained and ready to use word2vec
networks are available for download.
A discriminative model is one that provides conditional probabilities (or non-probabilistic statements) - and thus allows us to tell the difference between objects, but not to generate new examples of objects.
A generative model is one that produces a joint probability distribution - from which we can sample to generate new objects.
One particular type of generative model that I have seen very often in machine learning contexts are the Probabilistic Graphical Models (PGMs).
Given a directed graph $\Gamma=(V,E)$, a PGM on that graph is a probabilistic model that encodes the joint probability as a product of probabilities of the nodes conditioned on their directed parents:
$$ p(x) = \prod_{v\in V} p(x_v | \{x_s : \forall (x_s, x_v)\in E\}) $$Software supporting these include JAGS, OpenBUGS, Stan.
Graphical Models can end up very messy to draw out. Significant simplification is available through plate notation:
Each random variable is a circle. Observable random variables are filled in, latent variables are clear.
Each graphical relation is a directed edge.
Indexed collections of related entities are enclosed in a square (the plate) which is annotated by its indexing set.
pgm.render()
<matplotlib.axes._axes.Axes at 0x122ee8278>
With the right choices, this models a bayesian linear regression. To read this:
The complete graph encodes a joint probability of all the $\epsilon$, $\beta$, $y_i$ where each arrow is a conditional probability:
$$ \mathbb{P}(y_{1:N}) = \mathbb{P}(\epsilon) \mathbb{P}(\beta|\mu, \sigma^2) \prod_{i\in1:N}\mathbb{P}(y_i | \epsilon, \beta, x_i) $$Given
Latent Dirichlet Allocation is a model:
lda.render()
<matplotlib.axes._axes.Axes at 0x1166a22b0>
lda.render()
<matplotlib.axes._axes.Axes at 0x1166a22b0>
Fitting an LDA model uses variational calculus or Gibbs sampling to estimate the parameters.
Transforming with an LDA model produces document topic distributions given a document/word matrix.
Msny social media platforms offer REST APIs for access, often with libraries provided for major programming languages.
Words get their meaning from their use and their contexts. Meaning comes from use, not etymology, not prejudice.
Corpora provide text samples, often tagged with additional linguistic information.
Some corpora (COCA, BNC) are balanced: contain text samples from different sources - spoken; fiction; non-fiction; news; …
Text analysis can not provide negative evidence. Large text samples always have patterns - enter with an articulated research question.
"We TOLD you it was hard." "Yeah, but now that I'VE tried, we KNOW it's hard."