Wiki
Clone wikiBMCr / intro_lang_models
Intro in Language Modeling
"A statistical language model is a probability distribution over sequences of words."
Source: https://en.wikipedia.org/wiki/Language_model
What does it mean and what is it good for?
Consider a word as a value of a variable. A sequence of words of length n is a series of n variables, where each variable is assigned a word from the sequence. The n-dim vector is an observation, e.g. a sentence from a book.
A language model is now a joint probability distribution of the variable values (i.e. the words) in order to answer any question about the relationship between these variables. So one may be interested in ...
- the most likely words of a subset of the observations given others words, or
- the distribution of words may form features from statistical regularities which may serve for some form of classification, e.g. (dis-)similarity
Instead of word, one often talks about term or token. We will use them synonymously.
Objective and Application for BMCr
Our objective is to discover and evaluate relevant text structures in Canvas models. Technically, we are interested in probability distributions of word sequences, which ...
- ... are typical for canvas boxes such as problem formulation, value proposition, etc.
- ... are often used in conjunction with sequences from other boxes or canvas
Applications
- Search & Find of similar canvas models (Query and information retrieval)
- Support for classification (Document classification)
- Enable recommendation systems
Discrete models: n-gram Models
These kind of models simply count the occurrences of words or sequences. The frequencies serve as probabilities.
Unigram model (n=1)
In this model, the probability to hit each word all depends on its own.
P(t1, t2, t3) = P(t1) P(t2) P(t3)
The joint distribution probability is the product of each single word probability indicating the occurrence of each word is independent of the context.
n-gram Model
It is assumed that the probability of observing the i-th word w_i in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (n-th order Markov property).
Ex.: bigram, n=2, word / token depends only on previous word.
P(I, saw, the, red, house) = P( I, ) P(saw | I) P(the | saw) P(red | the) P(house | red) P(<\s> | house)
App. Document Classification
- count the different words (n=1) or n-grams in a document --> term frequencies
- tool of feature generation, calculate various measures to characterize the text, z.B. term frequency–inverse document frequency
Example: https://en.wikipedia.org/wiki/Bag-of-words_model#Example_implementation
Ex. term frequency–inverse document frequency (tf–idf)
"... numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus."
Source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
tf-idf is often used as a similarity measure to evaluate the relevance of terms in documents within a document collection. One perspective is to look for terms or sequences which two documents have in common, but occur rarely within the whole document collection.
Data sparsity problem
"for large texts there are exponentially many word sequences and fewer statistics available. Space of sequences is larger than the sequences seen previously. As a consequence, statistics missing. However, sequence statistics is needed to properly estimate probabilities."
Source: https://en.wikipedia.org/wiki/Language_model
Data sparsity problem is a consequence of discrete models. Motivates the use of vector space models.
Vector Space Models
Remember that words are values of variables in a vector of length n (i.e. word sequence lenght). We are interested in the transitional probabilities between words, that is the likelihood that they will co-occur.
As a result, we create real-value vectors from word sequences in such a way that vectors of similar words are grouped together in the vector space. Vectors that are distributed numerical representations of word features, features such as the context of individual words. Similarity can be expressed as distance between vectors in this space.
The numerical representation is not longer a sequence of word frequency, but an other real values in a vector space under some grouping / similarity measure. So, words are embedded in the vector space using this similarity measure. The procedure to find these vectors is therefore called word embedding.
Embedding as Expression of Probability
Idea: neural net language models produce actual probabilities of word occurrence given some context
Approach
Neural net language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution
P(w_t | context) \forall t \in V
context might be a fixed-size window of previous words
Technologies
- Neural autoregressive models (NADE), http://www.dmi.usherb.ca/~larocheh/projects_nade.html
Options
- use "future" words as well as "past" words as features,
- make a neural network learn the context, given a word. One then maximizes the log-probability
Word Embedding as Compressed Representation
Idea: word / sequence of words is a compressed / efficient representation in a n-dim vector space, i.e. the word is encoded as n-dim vector
Approach
- word is mapped onto an n-dimensional real vector
- it represents words as the neural networks' last "hidden" layer, i.e. n is the size (#neuron)s of the hidden layer
Technologies
- Autoencoders, https://en.wikipedia.org/wiki/Autoencoder
- Tutorial on auotencoders with Keras, https://blog.keras.io/building-autoencoders-in-keras.html
The amazing power of word vectors
Blog post on April 21, 2016 by Adrian Colyer containing impressive examples.
Link: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/
Generative models
The autoencoder concept has become more widely used for learning generative models of data.
Source: https://en.wikipedia.org/wiki/Autoencoder
Tools
- Document analysis, classification: gensim,
- Keras
- General Architecture for Text Engineering(GATE) in Java
Other Resources
- Sebastian Mantsch, Information Retrieval: Vector Space Model, Seminar Paper, HFT Stuttgart, June 10, 2014.
Updated