Wiki
Clone wikibnpy-dev / docs / allocmodel / MixModel
Goal
This is documentation for MixModel, a finite, parametric mixture model with \(K\) components.
This document only addresses allocation, which means we will only describe notations and update equations that pertain to assigning data items to clusters/components. Anything to do with the likelihood funciton is covered in the appropriate obsmodel documentation.
Notation
Parameters
Let vector \(w\) denote the probability of selecting each of the \(K\) components.
This vector satisfies two constraints:
- each entry is positive: \(w_k > 0\)
- the vector sums to one: \(\sum_{k=1}^K w_k = 1\).
Prior parameters
When enabled, we place a symmetric Dirichlet prior on \(w\), with parameter \(\alpha_0\).
Here, \(\alpha_0\) is a positive scalar: \(\alpha_0 > 0\).
Local parameters: discrete assignments
We let \(z_n\) denote the assignment of data item \(n\) to one of the \(K\) clusters.
Here, we'll use one-hot notation, in which \(z_n\) is a binary vector of length \(K\).
This vector is constrained so only one entry of \(z_n\) is one, and the rest are zeros.
Example
Suppose there are 5 clusters. We assign item 1 to cluster 3, and item 2 to cluster 4.
Then we have
Local posterior
The goal of learning is to estimate the posterior \(p(z_n \mid x_n, w, \phi)\), where \(\phi\) are generic parameters for the likelihood function (aka obsmodel). Because this distribution must discrete over \(K\) choices, we'll parameterize it with a \(K\)-length vector \(r_n\).
We know that the vector \(r_n\) must obey two constraints
- each entry is positive: \(r_{nk} > 0\)
- sums-to-one: :math:`sum_{k=1}^K r_{nk} =1 `
We can intuitively interpret \(r_nk\) as the responsibility that cluster \(k\) has for item \(n\) under the posterior. Here, we think of responsiblity in a "fractional" sense, where there is one whole unit divided up among all \(K\) components. When item \(n\) is explained entirely by one component \(k\), then \(r_{nk} \approx 1\) and \(r_{n\ell} \approx 0\) for \(\ell \neq k\).
Local posterior updates (E-step)
We assume we have a current estimated \(w\) and a precomputed log-likelihood matrix \(L\) of size \(N \times K\). The matrix \(L\) has a row for every data item \(n\) and a column for every component \(k\).
Given both \(w\) and \(L\), we can update \(r_{nk}\) in two easy steps.
First, we compute the unnormalized log posterior (which is accurate up to an additive constant).
\(\begin{align} \log \tilde{r}_{nk} &= \log p( z_{nk} = 1 \mid x_n, w, \phi) \\ &= \log p( z_{nk}=1 \mid w) + \log p( x_n \mid z_{nk}=1, \phi) + \mbox{const} \\ &= \log w_k + \log p( x_n \mid z_{nk}=1, \phi) \end{align}\)
Second, we exponentiate and normalize
\(\begin{align} r_{nk} = \frac{ \exp[ \log \tilde{r}_{nk} ] }{\sum_{\ell=1}^K \exp[ \log \tilde{r}_{n\ell} ] } \end{align}\)
Note that in implementations we must be careful to avoid underflow when exponentiating, in case log probabilities are all very small.
Global parameter updates (M-step)
Given local responsibility parameters \(r\), we compute new estimates of global weights \(w\). Our update must maximize the expected complete-data log-likelihood.
\(\begin{align} w &\gets \mbox{argmax}_w \mbox{E}_{p(z_n \mid r_n)} \Big[ \log p(z \mid w) + \log p( x \mid z, \phi) \Big] & \mbox{subject to}~ w_k > 0, \sum_{k=1}^K w_k = 1 \end{align}\)
We can simplify the objective as follows
\(\begin{align} &= \mbox{argmax}_w \mbox{E}_{p(z_n \mid r_n)} \Big[ \log p(z \mid w) \Big] \\ &= \mbox{argmax}_w \sum_{k=1}^K \sum_{n=1}^N \mbox{E}[z_{nk}] \log w_k \\ &= \mbox{argmax}_w \sum_{k=1}^K \Big( \sum_{n=1}^N r_{nk} \Big) \log w_k \end{align}\)
Define a new sufficient statistic \(N_k\): \(N_k = \sum_{n=1}^N r_{nk}\).
Here, \(N_k\) can be interpreted as the effective number of items assigned to cluster \(k\).
Now, we have the following optimization problem
\(\begin{align} &~ \mbox{argmax}_w \sum_{k=1}^K N_k \log w_k \\ & \mbox{subject to}~ w_k > 0, \sum_{k=1}^K w_k = 1 \end{align}\)
which using Lagrange multipliers, has a simple solution
Updated