Wiki

Clone wiki

BMCr / intro_lang_models

Intro in Language Modeling

"A statistical language model is a probability distribution over sequences of words."

Source: https://en.wikipedia.org/wiki/Language_model

What does it mean and what is it good for?

Consider a word as a value of a variable. A sequence of words of length n is a series of n variables, where each variable is assigned a word from the sequence. The n-dim vector is an observation, e.g. a sentence from a book.

A language model is now a joint probability distribution of the variable values (i.e. the words) in order to answer any question about the relationship between these variables. So one may be interested in ...

  • the most likely words of a subset of the observations given others words, or
  • the distribution of words may form features from statistical regularities which may serve for some form of classification, e.g. (dis-)similarity

Instead of word, one often talks about term or token. We will use them synonymously.

Objective and Application for BMCr

Our objective is to discover and evaluate relevant text structures in Canvas models. Technically, we are interested in probability distributions of word sequences, which ...

  • ... are typical for canvas boxes such as problem formulation, value proposition, etc.
  • ... are often used in conjunction with sequences from other boxes or canvas

Applications

  • Search & Find of similar canvas models (Query and information retrieval)
  • Support for classification (Document classification)
  • Enable recommendation systems

Discrete models: n-gram Models

These kind of models simply count the occurrences of words or sequences. The frequencies serve as probabilities.

Unigram model (n=1)

In this model, the probability to hit each word all depends on its own.

P(t1, t2, t3) = P(t1) P(t2) P(t3)

The joint distribution probability is the product of each single word probability indicating the occurrence of each word is independent of the context.

n-gram Model

It is assumed that the probability of observing the i-th word w_i in the context history of the preceding i − 1 words can be approximated by the probability of observing it in the shortened context history of the preceding n − 1 words (n-th order Markov property).

Ex.: bigram, n=2, word / token depends only on previous word.

P(I, saw, the, red, house) = P( I, ) P(saw | I) P(the | saw) P(red | the) P(house | red) P(<\s> | house)

App. Document Classification

Ex. Bag-of-words model

  • count the different words (n=1) or n-grams in a document --> term frequencies
  • tool of feature generation, calculate various measures to characterize the text, z.B. term frequency–inverse document frequency

Example: https://en.wikipedia.org/wiki/Bag-of-words_model#Example_implementation

Ex. term frequency–inverse document frequency (tf–idf)

"... numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus."

Source: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

tf-idf is often used as a similarity measure to evaluate the relevance of terms in documents within a document collection. One perspective is to look for terms or sequences which two documents have in common, but occur rarely within the whole document collection.

Data sparsity problem

"for large texts there are exponentially many word sequences and fewer statistics available. Space of sequences is larger than the sequences seen previously. As a consequence, statistics missing. However, sequence statistics is needed to properly estimate probabilities."

Source: https://en.wikipedia.org/wiki/Language_model

Data sparsity problem is a consequence of discrete models. Motivates the use of vector space models.

Vector Space Models

Remember that words are values of variables in a vector of length n (i.e. word sequence lenght). We are interested in the transitional probabilities between words, that is the likelihood that they will co-occur.

As a result, we create real-value vectors from word sequences in such a way that vectors of similar words are grouped together in the vector space. Vectors that are distributed numerical representations of word features, features such as the context of individual words. Similarity can be expressed as distance between vectors in this space.

The numerical representation is not longer a sequence of word frequency, but an other real values in a vector space under some grouping / similarity measure. So, words are embedded in the vector space using this similarity measure. The procedure to find these vectors is therefore called word embedding.

Embedding as Expression of Probability

Idea: neural net language models produce actual probabilities of word occurrence given some context

Approach

Neural net language models are constructed and trained as probabilistic classifiers that learn to predict a probability distribution

P(w_t | context) \forall t \in V

context might be a fixed-size window of previous words

Technologies

Options

  • use "future" words as well as "past" words as features,
  • make a neural network learn the context, given a word. One then maximizes the log-probability

Word Embedding as Compressed Representation

Idea: word / sequence of words is a compressed / efficient representation in a n-dim vector space, i.e. the word is encoded as n-dim vector

Approach

  • word is mapped onto an n-dimensional real vector
  • it represents words as the neural networks' last "hidden" layer, i.e. n is the size (#neuron)s of the hidden layer

Technologies

The amazing power of word vectors

Blog post on April 21, 2016 by Adrian Colyer containing impressive examples.

Link: https://blog.acolyer.org/2016/04/21/the-amazing-power-of-word-vectors/

Generative models

The autoencoder concept has become more widely used for learning generative models of data.

Source: https://en.wikipedia.org/wiki/Autoencoder

Tools

  • Document analysis, classification: gensim,
  • Keras
  • General Architecture for Text Engineering(GATE) in Java

Other Resources

Updated