Wiki
Clone wikibnpy-dev / FAQ
Data Questions
How do I use bnpy with my own dataset?
bnpy allows you to write your own dataset script to load a new dataset. See the Data Format documentation for details and examples. Make sure that the environment variable $BNPYDATADIR
points to the directory where the dataset script lives.
Model Questions
How do I get the cluster assignments for a specific dataset?
Let's assume you have a dataset of interest named Data
, and a trained model named model
. Then, you can calculate local assignments (and other necessary parameters) via:
LP = model.calc_local_params(Data)
LP
is a dictionary object. It will always have a field called resp
, which will hold the cluster membership probabilities for each data atom in the dataset in a 2D array that has N rows (one per data item) and K columns (one per cluster).
For example, to get the hard assignment of each data item to its most likely cluster, you can do:
Z = LP['resp'].argmax(axis=1)
What about document-level cluster assignments?
For topic models, we can access document-level counts via the DocTopicCount
field of the LP
local parameter dictionary. LP['DocTopicCount']
is a 2D array with D rows (one per document) and K columns (one per cluster). Entry LP['DocTopicCount'][d,k]
is a positive number that says how many atoms in document d were assigned to cluster k.
Where are the model parameters saved?
When running a particular trial (aka task), bnpy will save snapshots of its estimated parameters to disk at various checkpoints specified by the --saveEvery
kwarg provided when calling the run
function.
You can always find the "latest and greatest" estimates in MAT files with the "Best" prefix:
- $BNPYOUTDIR/dataname/jobname/taskid/BestAllocModel.mat
- $BNPYOUTDIR/dataname/jobname/taskid/BestObsModel.mat
You can load these in as a function model object in bnpy with the python code:
>>> import bnpy >>> hmodel = bnpy.load_model("$BNPYOUTDIR/dataname/jobname/taskid/")
If for example, you are interested in the probabilities of the K active components, you can do:
>>> beta = hmodel.allocModel.get_active_comp_probs() # 1D array, size K >>> assert beta.min() > 0 # all entries are positive >>> assert np.sum(beta) < 1.0 # sum is slightly less than 1.0, since the "inactive" components have small posterior mass
If you want to inspect the learned cluster means:
>>> mu0 = hmodel.obsModel.get_mean_for_comp(0) # get mean vector for first comp # Alternatively >>> m0 = hmodel.obsModel.Post.m[0] # Get the variational parameter for the mean of q(\mu_0) directly
Running from Python
You can also get the best model easily across multiple tasks by running everything within Python (not the command line).
>>> bestModel, bestInfo = bnpy.run( "MyDataset", "DPMixtureModel", "Gauss", "VB", nTask=5, K=10, nLap=100)
The returned model bestModel
will be the best of 5 runs (set by nTask
parameter).
Note that this will save parameters from each run to disk, at locations
- $BNPYOUTDIR/MyDataset/defaultjob/1/
- $BNPYOUTDIR/MyDataset/defaultjob/2/ ...
- $BNPYOUTDIR/MyDataset/defaultjob/5/
Visualization questions
How do I plot the most likely segmentation of my dataset?
To plot the segmentation estimated by a Markov model (like the HDP-HMM), you can use the code in SequenceViz.py
bnpy.viz.SequenceViz.plotSingleJob(dataName, jobpath, taskids=1, lap=0) bnpy.viz.SequenceViz.plotSingleJob( "DDToyHMM", "nipsexperiment-alg=bnpyHDPHMMmemo-....", taskids=1)
This will plot the segmentation associated with the final lap of the first task (first initialization) of the experiment named "nipsexperiment...." for the DDToyHMM dataset.
Other Questions
How do I profile the runtime cost of my experiments in bnpy?
We have a built-in profiler tool you can run from the command-line only. See Profiler.
Updated