Wiki

Clone wiki

bnpy-dev / QuickStart / EasyGMMOnYourOwnData

Goal

This doc shows how to apply bnpy to your own dataset. We'll specifically discuss fitting a Gaussian Mixture Model with EM, but the workflow here generalizes to any model and learning algorithm.

Our focus here is on a workflow you can implement directly in Python scripts. This closely follows the same flow used in most other demos which look at our command line tools instead.

Representing your Data

bnpy supports several data formats depending what type of numbers you observe (real vectors, word counts, binary vectors, etc.). See the DataFormat doc for some key details.

For this tutorial, we'll focus on modeling observed vectors of real numbers. A Gaussian mixture model makes sense for this data type.

All you need is to represent your data as a Numpy array, where each row is an observed vector. Here we'll just fill up a random matrix so you get the idea. Once a Numpy array is defined, we create an instance of bnpy's built-in data type "XData," which is just a thin-wrapper around the X array that enables bnpy to do its thing.

import numpy as np
import bnpy

#### Fill a matrix X with your data
X = np.random.randn(100,2)

#### Convert it into a bnpy data object
Data = bnpy.data.XData(X)

#### Apply any supported model + learning alg
#### Using syntax exactly like calling Run from cmd line

kwargs = dict(K=5, nLap=50, printEvery=10, initname='randexamples')
hmodel, LP, Info = bnpy.Run.run(Data, 'MixModel', 'Gauss', 'EM', **kwargs)

That's it. We can call run just like we do from the command line, providing exactly the same keyword options.

The "kwargs" variable is a keyword arguments dictionary. You can specify all kinds of options, such as how many components (K) to fit, the number of laps through the data (nLap), how often to print parameters (printEvery), initialization procedures, and more. These arguments have exactly the same names and behavior as the command line options when running Learn.py as a script.

Return Values

run returns three objects:

  • hmodel : bnpy HModel, whose parameters were fit to the Data
  • LP : local parameters dict, containing learned hidden variables specific to the Data
  • Info : dictionary of properties about the run.
  • evBound : scalar value of the log evidence of the data (ELBO) under the final model
  • evTrace : vector of evBound from each recorded step of the run

More about the model

See TODO.

More about local parameters

For the Gaussian mixture model, the LP dictionary contains the posterior "responsibilities" for each data item.

Within LP, there is a 'resp' field containing an N-by-K matrix. Each row gives the posterior responsibilities for data example n. Each row should sum to one.

resp[n,k] = p(z_n = k | x_n )

More about an experiment's Info dict

See TODO.

Updated