Wiki

Clone wiki

bnpy-dev / Code / StepByStepWalkThru

Goal

This doc explains a "step-by-step" view of using the bnpy codebase to solve machine learning problems. We emphasize the procedural "how" questions, like "how does data get loaded?", "when do the model's global parameters get initialized?" and "which bit of code saves progress to disk?". For more on the "why", see the Big Picture doc.

We'll trace the end-to-end progress of the Run.py script (the main command-line executable for training bnpy from data). At each step, we'll identify the key function in the codebase and provide relevant commentary.

To make this concrete, we'll use the same model as in the GMM-VB demo. This trains a Gaussian mixture model using the VB algorithm on the Asterisk toy data.

Sample experiment execution

We execute the following from the terminal

python -m bnpy.Run AsteriskK8 FiniteMixtureModel Gauss VB --K 8 --nLap 50 

In general, the syntax is

python -m bnpy.Run <dataName> <allocModelName> <obsModelName> <algName> [--keywordarg argval]

1) Processing command-line input

Internally, Run.py calls BNPYArgParser.py to parse the command line input.

First, the 4 required arguments are parsed. this defines what data will be used, what model will be applied, and which learning algorithm.

Next, the specific options for each piece are loaded from the relevant config files. These are just plain-text files, broken up into sections where each section defines key-value pairs. For example, learnalg.conf contains these lines

[VB]
nLap=10

[soVB]
rhoexp=0.5
rhodelay=1.0
Only the VB options are loaded when VB is specified as the learning algorithm. Likewise, only the soVB options are loaded when soVB is the learning algorithm.

Any arguments provided by the user are given priority over default values in the config file. For example, when the user says --nLap 50, there will be 50 passes through the data, not the default (10) defined in learnalg.conf.

###Output The result of the call to BNPYArgParser is a multi-level dictionary called ArgDict. We can access the final inference options via ArgDict['VB'], and the final initalization options via ArgDict['Initialization'].

2) Loading data

Internally, bnpy represents all data as proper Data objects, as oultined in the doc. To load arbitrary datasets into this format and allow for flexibility to diverse formats of saving on disk, we have the user encode the instructions for loading the data as proper python code, in a script written for the particular dataset in question. This python script must define functions adhering a common interface, described below.

Full-dataset (offline) dataset script

When performing offline learning, we expect the dataset script to have a get_data function that returns a bnpy.data.DataObj object.

def get_data(seed=1234, **kwargs):
      return bnpy.data.DataObj [or subclass]

For example, in the AsteriskK8.py dataset script, we find the following (with some non-essential parts removed for readability).

def get_data(seed=8675309, nObsTotal=25000, **kwargs):
  # Create real-matrix X of observed toy data
  X = get_X(seed, nObsTotal)
  # Construct a bnpy.data.XData object
  Data = XData(X=X)
  return Data

Online / Streaming dataset script

When performing online learning, instead of a single static dataset, we need to iterate over batches of data. Any dataset script wishing to support a get_minibatch_iterator function returning a bnpy.data.MinibatchIterator objection.

def get_minibatch_iterator(seed=123, nBatch=10, nLap=1, **kwargs):
    return bnpy.data.MinibatchIterator [or subclass]

For example, in the AsteriskK8.py dataset script, we find the following (with some non-essential parts removed for readability).

def get_minibatch_iterator(seed=8675309, nBatch=10, nObsBatch=None, nObsTotal=25000, nLap=1):
  X = get_X(seed, nObsTotal)
  Data = XData(X=X)
  # Build an iterator over the given data object
  DataIterator = MinibatchIterator(Data, nBatch=nBatch, nObsBatch=nObsBatch, nLap=nLap, dataseed=seed)
  return DataIterator

Dataset Script Location

bnpy expects to find the dataset script in the directory defined by $BNPYDATADIR environment variable. By default, bnpy sets this variable to the demodata/ directory that comes pre-packaged.

To set this variable, at any terminal just type

export BNPYDATADIR=$BNPYROOT/demodata/

Users may wish to add this line to their .bashrc (Linux) or .profile (Mac) file, to avoid reentering it repeatedly.

Putting it all together

All dataset construction and loading happens in Run.py's loadData function. This function uses the preprocessed ArgDict from Step 1 as input. Internally, it accesses the desired dataset script (via the "dataName" provided by user) and loads either a full dataset or an iterator.

loadData always returns two values:

  • Data, the actual dataset or iterator on which training will occur
  • InitData, a bnpy.data.DataObj object, used to (a) help specify the observation model's output dimensions, and (b) initialize global model parameters before the learning algorithm is executed.
def loadData(ArgDict, dataseed=0):
  ''' Load DataObj specified by the user, using particular random seed.
      Returns
      --------
        Data, InitData
      or
        DataIterator, InitData
  '''
  sys.path.append(os.environ['BNPYDATADIR'])
  datamod = __import__(ArgDict['dataName'],fromlist=[])
  algName = ArgDict['algName']
  if algName in FullDataAlgSet:
    Data = datamod.get_data(seed=dataseed)
    return Data, Data
  elif algName in OnlineDataAlgSet:
    ArgDict[algName]['nLap'] = ArgDict['OnlineDataPrefs']['nLap']
    InitData = datamod.get_data(seed=dataseed)
    DataIterator = datamod.get_minibatch_iterator(seed=dataseed, **ArgDict['OnlineDataPrefs'])
    return DataIterator, InitData

3) Constructing the model

After loading the data, we construct an HModel object to represent the Bayesian hierarchical model the user requested. You can read more about HModel in the doc.

Quick intro to HModel construction

The crucial functionality we'll explain here is that these models can be constructed in one line of user-facing code. That is, all necessary prior distributions and both AllocModel and ObsModel components are created in one-call under-the-hood.

Here's an abstract line of Python code that builds an HModel.

hmodel = HModel.CreateEntireModel(inferType, allocModelName, obsModelName, allocPriorDict, obsPriorDict, Data):

For example, to create a simple GMM for VB learning, as in GMM-VB demo.

allocPriorDict = dict(alpha0=1.0)
obsPriorDict = dict(smatname='eye', sF=1.0, kappa=1e-5, dF=1)
hmodel = HModel.CreateEntireModel('VB', 'FiniteMixtureModel', 'Gauss', allocPriorDict, obsPriorDict, Data):

Here, the allocPriorDict and obsPriorDict need to contain values for all the keywords specified in config/allocmodel.conf's FiniteMixtureModel section, and config/obsmodel.conf's Gauss section. Again, the details of this process can be found in the doc.

After executing the above code, we'll have a working hmodel object but its parameters will not be initialized. That is, the hmodel object above has a proper prior distribution, but it won't be a valid mixture model because it has no components (and not even a $K$ value indicating the number of components). This happens in the next step, "Initializing the model".

Putting it all together

Constructing the model within Run.py is handled by the createModel function. It takes all the specifications for the model defined in the ArgDict, constructions an HModel, and returns it.

def createModel(Data, ArgDict):
  '''
      Returns
      --------
      hmodel : constructed (by not initialized) model 
  '''
  algName = ArgDict['algName']
  aName = ArgDict['allocModelName']
  oName = ArgDict['obsModelName']
  aPriorDict = ArgDict[aName]
  oPriorDict = ArgDict[oName]
  hmodel = bnpy.HModel.CreateEntireModel(algName, aName, oName, aPriorDict, oPriorDict, Data)
  return hmodel  

4) Initializing the model

As discussed above, after construction an HModel is a "blank slate" in that it doesn't have any global parameter values. All these fields are empty, and so need to be initialized.

What does it mean to initialize global parameters?

For this discussion, initialized means that all global parameters in the model (including number of components) are specified. A few examples are given below.

Initializing a Gaussian Mixture Model for EM requires

  • AllocationModel has mixture weights w (K-length vector).
  • ObservationModel with K GaussDistr objects specified (each with a mean and a covariance matrix).

Initializing a Gaussian Mixture Model for VB requires

  • AllocationModel with Dirichlet prior parameter alpha0 (K-length vector).
  • ObservationModel with K GaussWishDistr objects specified (deg of freedom, scale matrix, means, precisions).

Initialization in practice

All initialization is handled by taking an hmodel and executing init_global_params.

Within Run.py, initialization happens with this function call:

    hmodel.init_global_params(InitData, seed=algseed, \
                              **ArgDict['Initialization'])

Quick intro to HModel's init_global_params

The HModel object's init_global_params method allows custom initialization of all global parameters. Broadly, several categories of initialization routines are supported:

  • Initialization from scratch

  • This means parameters are created fresh, using a defined subroutine appropriate for the observation model. For example, we may divide the data randomly into $K$ blocks and initialize parameters from each block.

  • See bnpy.init.FromScratchGauss for subroutines for Gauss observation models.

  • Initialization from previously-saved model.

  • bnpy allows smart reuse of previous computation. We can run a GMM model with EM, and then do a GMM-VB run that starts by leveraging the final state of the EM run.

  • See bnpy.init.FromSaved for details.

  • Initialization from "ground-truth" parameters or labels provided by the DataObj

  • These initializations are good for diagnosing correctness and local optima issues.

  • Implementations coming soon!

The code for init_global_params is given below (non-essentials removed for readability).

  def init_global_params(self, Data, **initArgs):
    ''' Initialize (in-place) global parameters
    '''
    initname = initArgs['initname']
    if initname.count('truth') > 0:
      init.FromTruth.init_global_params(self, Data, **initArgs)
    elif initname.count(os.path.sep) > 0:
      init.FromSaved.init_global_params(self, Data, **initArgs)
    elif str(type(self.obsModel)).count('Gauss') > 0:
      init.FromScratchGauss.init_global_params(self, Data, **initArgs)

The initname argument defines what type of initialization is performed. If the initname is a filepath to a saved bnpy model, we attempt to do a FromSaved initialization. Otherwise, we execute a FromScratch initialization appropriate for the observation model.

Note that functions like init.FromScratchGauss.init_global_params perform in-place initialization of the model parameters. They don't return anything. Instead, their first argument (the hmodel to be updated) has relevant parameters added to its internal structure.

Walk-through of init_global_params for a 'randexamples' initialization

Let's take a look at how global parameters get initialized under the default ('randexamples') initialization, which is FromScratch.

Within bnpy.init.FromScratchGauss, the relevant code is:

def init_global_params(hmodel, Data, initname='randexamples', seed=0, K=0, **kwargs):
  PRNG = np.random.RandomState(seed)
  X = Data.X
  if initname == 'randexamples':
    ''' Choose K items uniformly at random from the Data
        then component params by M-step given those single items
    '''
    resp = np.zeros((Data.nObs, K))
    permIDs = PRNG.permutation(Data.nObs).tolist()
    for k in xrange(K):
      resp[permIDs[k],k] = 1.0
  LP = dict(resp=resp)
  SS = hmodel.get_global_suff_stats(Data, LP)
  hmodel.update_global_params(SS)

We start initialization by creating a pseudo-random-number generator (PRNG) with the given seed (seed is determined by the jobname and taskid of the current run). This allows repeatable initialization for debugging and reproducing experimental results.

Then, the 'randexamples' code simply picks out $K$ observations from the dataset (uniformly at random), and selects these to be the singleton members for each of the $K$ components. The model parameters are then generated by the normal inference steps (calculate sufficient statistics, update global parameters), as if these singleton data items were the only data assigned to each component.

Using existing subroutines like update_global_params lets us avoid writing custom initialization code for each possible model and learning algorithm, instead reusing a routine we already need to do learning.

Note on parameter creation

We rely on code within the AllocModel and ObsModel base classes to quickly instantiate $K$ components if they do not already exist, so that these updates makes sense. For example, here's the relevant code from the ObsModel base class.

  def update_global_params( self, SS, rho=None, Krange=None):
    ''' M-step update
    '''
    self.K = SS.K
    if len(self.comp) != self.K:
      self.comp = [copy.deepcopy(self.obsPrior) for k in xrange(self.K)]
    [... other content here ...]

5) Running the learning algorithm

Once we have an initialized HModel, we're ready to actually execute a learning algorithm.

Constructing the LearnAlg object

The relevant code (from Run.py) to create a learning algorithm object is:

def createLearnAlg(Data, model, ArgDict, algseed=0, savepath=None):
  '''
    Creates a learning algorithm object, preparing a directory to save the data (savepath) and setting appropriate random seeds.
    Returns
    -------
    learnAlg : bnpy.learn.LearnAlg [or subclass] object
               type defined by ArgDict['algName'], one of {EM, VB, soVB, moVB}
  '''
  algName = ArgDict['algName']
  algP = ArgDict[algName]
  outputP = ArgDict['OutputPrefs']
  if algName == 'EM' or algName == 'VB':
    learnAlg = bnpy.learn.VBLearnAlg(savedir=savepath, seed=algseed, \
                                      algParams=algP, outputParams=outputP)
  elif algName == 'soVB':
    learnAlg = bnpy.learn.StochasticOnlineVBLearnAlg(savedir=savepath, seed=algseed, \
                                      algParams=algP, outputParams=outputP)
  elif algName == 'moVB':
    learnAlg = bnpy.learn.MemoizedOnlineVBLearnAlg(savedir=savepath, seed=algseed, \
                                      algParams=algP, outputParams=outputP)
  else:
    raise NotImplementedError("Unknown learning algorithm " + algName)
  return learnAlg

Executing the LearnAlg object

Every learning algorithm implements a fit method. To execute the algorithm, the relevant line from Run.py is:

    learnAlg.fit(hmodel, Data)

where the Data can either be an Iterator (for online methods) or a DataObj object (for offline methods).

Internally, this fit method handles iteration over Data and execution of the various learning subroutines of hmodel, like calculating local parameters, calculating sufficient statistics, and updating global parameters.

Here's the relevant code for the VBLearnAlg, with some non-essential lines related to printing and saving state removed for readability.

  def fit( self, hmodel, Data ):
    prevBound = -np.inf
    LP = None
    for iterid in range(self.algParams['nLap'] + 1):
      # M step
      if iterid > 0:
        hmodel.update_global_params(SS) 

      # E step 
      LP = hmodel.calc_local_params(Data, LP)
      SS = hmodel.get_global_suff_stats(Data, LP)

      # ELBO calculation
      evBound = hmodel.calc_evidence(Data, SS, LP)

      # Check for Convergence!
      #  report warning if bound isn't increasing monotonically
      isConverged = self.verify_evidence( evBound, prevBound )

      if isConverged:
        break
      prevBound = evBound

For more on how the three key steps of learning algorithms, see the BigPicture and the documentation related to each step.

6) Saving progress to file

bnpy tries to support saving learning algorithm progress to disk at regular intervals, so that (1) results can be accessed later, and (2) if a long-running experiment is terminated early, not all information is lost.

File format

The chosen format (for now) is to package up the parameters into arrays that are saved in MAT file format. This is the standard file format for Matlab.

Python support is provided via the scipy.io.savemat and scipy.io.loadmat functions. For a detailed reference, see the scipy documentation. For our purposes, the important facts are that we can package arrays into Python dictionaries, and these scipy.io routines transparently allow us to read and write these dictionaries to file, as shown below

import numpy as np
import scipy.io
A = np.eye(3)
B = np.diag([1, 2, 3, 4])
myDict = dict(A=A, B=B)
scipy.io.savemat( 'myfile.mat', myDict)

Converting model parameters to dictionaries

All bnpy models support reading and writing the global parameters of the model to file.

All AllocModels, ObsModels, and basic Distr objects support a to_dict method. This packages up relevant parameters into a dictionary for file IO. For example, FiniteMixtureModel.py has the following:

  def to_dict(self): 
    if self.inferType.count('VB') >0:
      return dict( alpha=self.alpha)
    elif self.inferType == 'EM':
      return dict( w=self.w)    

where we recall that EM directly estimates the mixture weights vector w, while VB estimates the prior parameter vector alpha.

Saving models to file

The process of saving models to and from file are handled by bnpy.ioutil.ModelWriter. bnpy.ioutil.ModelReader handles the corresponding reading from file.

For example, to save a model to file, we can execute

    bnpy.ioutil.ModelWriter(hmodel, \
                            '/path/to/desired/dir/', \
                            'Prefix')

Which will create the following files in /path/to/desired/dir/

  • PrefixAllocModel.mat
  • PrefixObsModel.mat

The prefix argument allows us to save multiple versions of a model during training. For example, learning algorithms might use prefixes like "Lap0001", "Lap0002" to checkpoint progress after each lap (pass through the data).

Where saving happens in a learning algorithm

Saving to disk occurs in the save_state function of the LearnAlg object. The basic template is:

def save_state(hmodel, iterid, lap, evBound, **kwargs):
    return None

This function saves the hmodel to file (via the bnpy.io.ModelWriter), and also saves trace variables (iterations, laps, evidence, etc.) get written to disk as plain-text files.

Laps vs iterations

To clarify, an iterid is an integer specifying the total number of M-steps (parameter updates) the algorithm has taken. The lap is a real positive number of how many times we've passed through the data.

The command-line argument saveEvery defines how often the model is saved to file (in terms of laps through the data). For example, setting saveEvery=5 for EM will save the progress every 5 laps through the data. Settin saveEvery=0.5 for an online algorithm like soVB will save progress every time half of all batches are processed.

Saving the best version of a model

It would be annoying to have to remember which iteration is the latest and greatest in order to load the best model. bnpy LearnAlgs solve this problem by saving a version of the model with the "Best" prefix after each step. This way, you can always load in the final, best result of the latest run by calling ModelReader's load_model, like so

    hmodel = bnpy.ioutil.ModelReader.load_model('/path/to/desired/results/',                                  prefix='Best')

Which will load in the model defined by these files in /path/to/desired/dir/

  • BestAllocModel.mat
  • BestObsModel.mat

Note that these Best files are symbolic links in the file system, so that extra space is not taken up.

7) Displaying progress to stdout

bnpy tries to create a useful "transcript" that summarizes in plain-text the progress of the learning algorithm run. We utilize Python's built-in Logging module to handle this seamlessly, so that we can always save this transcript to file (for later analysis) while also printing it to stdout if requested.

Configuring the Log

Logging is configured within Run.py in the configLoggingToConsoleAndFile function. It's not very readable or useful for end-users, but you should know where it is just in case.

Printing progress to the Log

Instead of print statements, we use log statements. To report that we're at iteration 1 of 10, we might write.

Log.info('at iter %d/%d' % (1,10)

Printing progress for a learning algorithm

All LearnAlg objects have a print_state function, whose basic template is

def print_state(hmodel, iterid, lap, evBound, **kwargs):
    return None

Within this method, we control how often we print with the argument --printEvery. This is like --saveEvery, but for controling log messages. For example, with online methods --printEvery=0.25 will print a message every time we complete a quarter-lap through the data.

The --saveEvery, --printEvery attributes of the learning algorithm are specified in its construction, as part of the required outputParams argument. See the default values in output.conf.

Updated