Wiki

Clone wiki

bnpy-dev / FAQ

Data Questions

How do I use bnpy with my own dataset?

bnpy allows you to write your own dataset script to load a new dataset. See the Data Format documentation for details and examples. Make sure that the environment variable $BNPYDATADIR points to the directory where the dataset script lives.

Model Questions

How do I get the cluster assignments for a specific dataset?

Let's assume you have a dataset of interest named Data, and a trained model named model. Then, you can calculate local assignments (and other necessary parameters) via:

LP = model.calc_local_params(Data)
Here, LP is a dictionary object. It will always have a field called resp, which will hold the cluster membership probabilities for each data atom in the dataset in a 2D array that has N rows (one per data item) and K columns (one per cluster).

For example, to get the hard assignment of each data item to its most likely cluster, you can do:

Z = LP['resp'].argmax(axis=1)
Here, Z is a 1D array of size N. Entry Z[n] will be an integer in the range {0, 1, 2, ... K-1}.

What about document-level cluster assignments?

For topic models, we can access document-level counts via the DocTopicCount field of the LP local parameter dictionary. LP['DocTopicCount'] is a 2D array with D rows (one per document) and K columns (one per cluster). Entry LP['DocTopicCount'][d,k] is a positive number that says how many atoms in document d were assigned to cluster k.

Where are the model parameters saved?

When running a particular trial (aka task), bnpy will save snapshots of its estimated parameters to disk at various checkpoints specified by the --saveEvery kwarg provided when calling the run function.

You can always find the "latest and greatest" estimates in MAT files with the "Best" prefix:

  • $BNPYOUTDIR/dataname/jobname/taskid/BestAllocModel.mat
  • $BNPYOUTDIR/dataname/jobname/taskid/BestObsModel.mat

You can load these in as a function model object in bnpy with the python code:

>>> import bnpy
>>> hmodel = bnpy.load_model("$BNPYOUTDIR/dataname/jobname/taskid/")

If for example, you are interested in the probabilities of the K active components, you can do:

>>> beta = hmodel.allocModel.get_active_comp_probs() # 1D array, size K
>>> assert beta.min() > 0 # all entries are positive
>>> assert np.sum(beta) < 1.0 # sum is slightly less than 1.0, since the "inactive" components have small posterior mass

If you want to inspect the learned cluster means:

>>> mu0 = hmodel.obsModel.get_mean_for_comp(0) # get mean vector for first comp
# Alternatively
>>> m0 = hmodel.obsModel.Post.m[0] # Get the variational parameter for the mean of q(\mu_0) directly
Of course, you can also use this loaded model to investigate other datasets in the usual ways.

Running from Python

You can also get the best model easily across multiple tasks by running everything within Python (not the command line).

>>> bestModel, bestInfo = bnpy.run( "MyDataset", "DPMixtureModel", "Gauss", "VB", nTask=5, K=10, nLap=100)

The returned model bestModel will be the best of 5 runs (set by nTask parameter).

Note that this will save parameters from each run to disk, at locations

  • $BNPYOUTDIR/MyDataset/defaultjob/1/
  • $BNPYOUTDIR/MyDataset/defaultjob/2/ ...
  • $BNPYOUTDIR/MyDataset/defaultjob/5/

Visualization questions

How do I plot the most likely segmentation of my dataset?

To plot the segmentation estimated by a Markov model (like the HDP-HMM), you can use the code in SequenceViz.py

bnpy.viz.SequenceViz.plotSingleJob(dataName, jobpath, taskids=1, lap=0)

bnpy.viz.SequenceViz.plotSingleJob(
    "DDToyHMM",
    "nipsexperiment-alg=bnpyHDPHMMmemo-....",
    taskids=1)

This will plot the segmentation associated with the final lap of the first task (first initialization) of the experiment named "nipsexperiment...." for the DDToyHMM dataset.

Other Questions

How do I profile the runtime cost of my experiments in bnpy?

We have a built-in profiler tool you can run from the command-line only. See Profiler.

Updated