How can I extract local parameters from a mixture model?

Issue #37 closed
Debasish Das created an issue

Hi, Thanks for making this great package available to everyone. I was wondering if there is a way to access the local parameters of a model. For example, I need the following variables:

1) After applying a DP Gaussian mixture model on a training dataset, I want to know the clusters/components each training point belongs to (or the cluster membership variables). How can I get those?

2) Now, let's assume I have an unseen set of test dataset that I want to cluster using the above model. Each of the test points will belong form one of the clusters found in step 1 (or from one or more completely new clusters, if we use DPM). How will I get those membership variables?

I would really appreciate your input.

Comments (9)

  1. Mike Hughes repo owner

    Great question.

    1) Given a dataset Data and a trained model hmodel, you can do

    LP = hmodel.calc_local_params(Data)
    Z = LP['resp'].argmax(axis=1)
    

    Here, LP is a dictionary that will hold all the local parameters. One of them is named resp, for the cluster posterior responsibilities. This is an N (# data pts) by K (# cluster) 2D array, where each row sums to one.

    Using the code above, you can get the best estimate of the hard cluster assignments Z, which will be a 1D vector of size N, where index n gives the cluster assignment (integer value in 0, 1, 2, ... K-1) for data point n.

  2. Mike Hughes repo owner

    If you have a test dataset, you can apply the same command (calc_local_params) as above. Because of the truncation assumption of our variational approximation, this step will only use the K clusters found during training, not any "previously unseen" clusters.

    If you really want to know if your test data has new clusters not seen in training, you can hold the training set fixed and just keep doing joint updates over a combined set (training and test), including birth moves. But this is a bit more complicated.

  3. tiger lee

    Hi, Thanks for the information. i still have following question: After applying a HDP with moVB on a training corpus, I want to know the clusters each document(not the token of the doc) belong to, how can i get these?

    I would really appreciate your reply.

  4. Mike Hughes repo owner

    Suppose hmodel is an HDPTopicModel. The following code will give you counts for how many tokens in document d have been assigned to each topic:

    LP = hmodel.calc_local_params(Data)
    DocTopicCount_d = LP['DocTopicCount'][d,:]
    

    Here, the 'DocTopicCount' field of the LP dictionary is a 2D array of size nDoc x K (# topics). Each row corresponds to one document. This row vector has K entries, one for each topic. Entry k gives the number of tokens assigned to topic k.

    Hope that helps!

  5. tiger lee

    Thanks for your quick reply. This is very helpful. The 'DocTopicCount' can tell how many tokens in document d have been assigned to each topic, but how can I figure out which topic the document d most likely belongs to? Is there a field in LP for doc-topic probabilities?

    Thanks a lot!

  6. Mike Hughes repo owner

    Under the HDP topic model, documents are not assigned to topics. Only tokens are. So none of the learned parameters will tell you how to cluster or assign documents.

    If you really want to pick one topic that "best" explains a given document, you could just pick the topic with the highest count in DocTopicCount_d.

    Alternatively, if you want the document's distribution over topics, that is given by the 'theta' field of LP. The expected probability of topic k in document d is given by:

    LP['theta'][d,k] / sum(LP['theta'][d,:])

  7. Mike Hughes repo owner

    I think I've answered all the questions here. If future problems arise, please open a new Issue in the tracker. Thanks!

  8. Log in to comment