Wiki

Goal

This is documentation for MixModel, a finite, parametric mixture model with \(K\) components.

This document only addresses allocation, which means we will only describe notations and update equations that pertain to assigning data items to clusters/components. Anything to do with the likelihood funciton is covered in the appropriate obsmodel documentation.

Notation

Parameters

Let vector \(w\) denote the probability of selecting each of the \(K\) components.

\begin{equation*} w = [w_1 ~ w_2 ~ \ldots w_K ] \end{equation*}

This vector satisfies two constraints:

each entry is positive: \(w_k > 0\)
the vector sums to one: \(\sum_{k=1}^K w_k = 1\).

Prior parameters

When enabled, we place a symmetric Dirichlet prior on \(w\), with parameter \(\alpha_0\).

\begin{equation*} p( w \mid \alpha_0) = \mbox{Dirichlet}( \alpha_0, \alpha_0, \ldots \alpha_0) \end{equation*}

Here, \(\alpha_0\) is a positive scalar: \(\alpha_0 > 0\).

Local parameters: discrete assignments

We let \(z_n\) denote the assignment of data item \(n\) to one of the \(K\) clusters.

Here, we'll use one-hot notation, in which \(z_n\) is a binary vector of length \(K\).

\begin{equation*} z_n = [ z_{n1} z_{n2} \ldots z_{nK} ] \end{equation*}

This vector is constrained so only one entry of \(z_n\) is one, and the rest are zeros.

Example

Suppose there are 5 clusters. We assign item 1 to cluster 3, and item 2 to cluster 4.

Then we have

\begin{align*} z_1 = [0 0 1 0 0] \\ z_2 = [0 0 0 1 0] \end{align*}

Local posterior

The goal of learning is to estimate the posterior \(p(z_n \mid x_n, w, \phi)\), where \(\phi\) are generic parameters for the likelihood function (aka obsmodel). Because this distribution must discrete over \(K\) choices, we'll parameterize it with a \(K\)-length vector \(r_n\).

\begin{equation*} p( z_n \mid x_n, w, \phi) = \mbox{Discrete}( z_n \mid r_{n1}, r_{n2}, \ldots r_{nK} ) \end{equation*}

We know that the vector \(r_n\) must obey two constraints

each entry is positive: \(r_{nk} > 0\)
sums-to-one: :math:`sum_{k=1}^K r_{nk} =1 `

We can intuitively interpret \(r_nk\) as the responsibility that cluster \(k\) has for item \(n\) under the posterior. Here, we think of responsiblity in a "fractional" sense, where there is one whole unit divided up among all \(K\) components. When item \(n\) is explained entirely by one component \(k\), then \(r_{nk} \approx 1\) and \(r_{n\ell} \approx 0\) for \(\ell \neq k\).

Local posterior updates (E-step)

We assume we have a current estimated \(w\) and a precomputed log-likelihood matrix \(L\) of size \(N \times K\). The matrix \(L\) has a row for every data item \(n\) and a column for every component \(k\).

\begin{equation*} L_{nk} = \log p( x_n \mid z_{nk}=1, \phi) \end{equation*}

Given both \(w\) and \(L\), we can update \(r_{nk}\) in two easy steps.

First, we compute the unnormalized log posterior (which is accurate up to an additive constant).

\(\begin{align} \log \tilde{r}_{nk} &= \log p( z_{nk} = 1 \mid x_n, w, \phi) \\ &= \log p( z_{nk}=1 \mid w) + \log p( x_n \mid z_{nk}=1, \phi) + \mbox{const} \\ &= \log w_k + \log p( x_n \mid z_{nk}=1, \phi) \end{align}\)

Second, we exponentiate and normalize

\(\begin{align} r_{nk} = \frac{ \exp[ \log \tilde{r}_{nk} ] }{\sum_{\ell=1}^K \exp[ \log \tilde{r}_{n\ell} ] } \end{align}\)

Note that in implementations we must be careful to avoid underflow when exponentiating, in case log probabilities are all very small.

Global parameter updates (M-step)

Given local responsibility parameters \(r\), we compute new estimates of global weights \(w\). Our update must maximize the expected complete-data log-likelihood.

\(\begin{align} w &\gets \mbox{argmax}_w \mbox{E}_{p(z_n \mid r_n)} \Big[ \log p(z \mid w) + \log p( x \mid z, \phi) \Big] & \mbox{subject to}~ w_k > 0, \sum_{k=1}^K w_k = 1 \end{align}\)

We can simplify the objective as follows

\(\begin{align} &= \mbox{argmax}_w \mbox{E}_{p(z_n \mid r_n)} \Big[ \log p(z \mid w) \Big] \\ &= \mbox{argmax}_w \sum_{k=1}^K \sum_{n=1}^N \mbox{E}[z_{nk}] \log w_k \\ &= \mbox{argmax}_w \sum_{k=1}^K \Big( \sum_{n=1}^N r_{nk} \Big) \log w_k \end{align}\)

Define a new sufficient statistic \(N_k\): \(N_k = \sum_{n=1}^N r_{nk}\).

Here, \(N_k\) can be interpreted as the effective number of items assigned to cluster \(k\).

Now, we have the following optimization problem

\(\begin{align} &~ \mbox{argmax}_w \sum_{k=1}^K N_k \log w_k \\ & \mbox{subject to}~ w_k > 0, \sum_{k=1}^K w_k = 1 \end{align}\)

which using Lagrange multipliers, has a simple solution

\begin{equation*} w = [ \frac{N_1}{N} ~ \frac{N_2}{N}, ~ \ldots \frac{N_K}{N} ] \end{equation*}