Wiki

Clone wiki

bnpy-dev / Code / BigPicture

Goal

This doc explains the "big picture" view of using bnpy to solve machine learning problems. We outline the major conceptual components and explain how they interact in bnpy. The main focus is the "why" behind bnpy. For the "how", see the step-by-step walkthrough.

By design, bnpy separates each conceptual component (data, model, learning algorithm) into self-contained, modular code. The goal is that any learning algorithm can be applied to any model on any applicable dataset. We hope this extremely modular organization will be

  • instructive to new students trying to understand these concepts.

  • useful for practitioners interested in comparing models and algorithms cleanly and fairly.

  • efficient for researchers hoping to investigate new models or algorithms without "reinventing the wheel"

Data objects

Machine learning gives rise to many types of data, from plain old real-valued matrix data to sparse word count data to sequential data to network data. In the age of big data, we also need to consider whether the data entirely fits in memory, or if online (streaming) processing needs to be done.

Our goal in bnpy is to have each data type be a subclass of a generic Data object, which defines a common interface that all subtypes extend appropriately. We hope that learning algorithms can be written entirely using the common interface, and that any specialized interaction occurs deep within specific function of the model that applies only to the type of data at hand.

For example, the VB and soVB algorithms should never need to access the "X" field of the real-valued matrix type XData, which holds the matrix of observed data. Such a field may not exist for word-count data. Instead, the only code that needs "X" explicitly is code related to Gaussian likelihoods (which can't be applied to word count data).

Model objects

The focus for bnpy is on hierarchical Bayesian graphical models. Many of the workhorses of unsupervised machine learning fall into this category, including

  • Gaussian mixture models (for real data)
  • Hidden markov models (for sequential data)
  • Topic models like Latent Dirichlet Allocation (for word-count data)
  • Stochastic block models (for network data)

Other toolboxes for these models are usually extremely narrow in scope. The code for fitting GMMs is usually very tightly coupled, in a way that if we wanted to turn that code into "Bernoulli mixture models", it would often be easier to rewrite the whole thing rather than re-use the "mixture model" part but swap "Gaussian" for "Bernoulli". It's usually even more difficult to take the "Gaussian" bits of the GMM model, and make them work on sequential data where observations are real-valued.

bnpy attempts to make composability a reality, by breaking each of the common models defined above into two components:

  • allocation model, which generates discrete latent structure

  • observation model, which generates data given the discrete latent structure

Examples of allocation models include: mixture models, Dirichlet process mixture models, HMMs, admixture models (LDA), etc.

Examples of observation models include: Gaussian, Bernoulli, Multinomial (word counts), etc.

The basic idea is that bnpy will enable maximum code reuse, to the point where we need to define the Gaussian likelihood model just once, and we can use it to have Gaussian mixture models, HMMs with real emissions, and even LDA with real-valued data instead of word counts.

Global vs. local parameters

Much of the organization of the HModel code comes from the perspective of conceptual separating local and global parameters.

Local parameters are attached to particular data items, and contain information only about that data item, not any global structure (hence the term local). They are sometimes called "hidden variables". Examples include the cluster assignments (one per item) in mixture models or topic-wordtoken assignments (one per observed word) in LDA-style topic models.

Global parameters are unattached to specific data items, instead representing some overarching structure shared by many items. Examples include the means of Gaussian mixture components, the topic-word distributions over the vocabulary of LDA, or the interconnection probabilities of each community in a stochastic block model.

One easy litmus test for deciding if a parameter is local or global is to imagine first training a model on one dataset, then observing a new dataset with 100 new items. If you need to infer brand-new values for some parameter to use the model on the new dataset, it is local. If you can transfer existing values on the new dataset unchanged, it is global.

Learning algorithm objects

Learning algorithms are fundamentally responsible for iterating over data and changing the model parameters to better model that data. bnpy defines learning algorithm objects as the "outer loops" that control how data is iterated over and when model parameters get updated. All subroutines that require model-specific knowledge (for example, computing conditional probabilities) are handled by the HModel object and its subcomponents.

Variational learning

bnpy focuses on algorithms that optimize a variational-bound objective function. Algorithms in this family include

  • expectation maximization (EM)
  • variational inference (VB)
  • stochastic variational inference (soVB)

To understand the basic concepts and mathematical theory underlying these algorithms, see Variational Methods.

Key subroutines of variational learning

Regardless of which algorithm is used (we support offline methods like EM or VB as well as online methods like soVB), all bnpy learning algorithms iterate over the following three steps

1) calculate local parameters (LP) of each data item given the global model

2) calculate sufficient statistics (SS) across all data items given local parameters

3) update the global parameters of the model given sufficient statistics

Local parameters are unobserved (hidden) variables in the model. We call them local because they are attached to a particular data instance.

Sufficient statistics are fixed-dimensional quantities that summarize the observed data and any local parameters. Given the sufficient statistics, we have all the information we need to update parameters.

Global parameter updates can be either closed-form symbolic calculations (common with conjugate posteriors), or more complex "gradient descent"-style numerical procedures. Either way, the global parameters of the HModel change their values to more accurately model the data at hand.

HModel API

We use the 3-step common structure to modularize the learning process. The HModel object is responsible for providing methods that execute the three basic steps. LearnAlg objects are responsible for determining how these steps are composed into a bigger process.

The three fundamental steps for variational inference are encoded as these three methods of the HModel object. Every learning algorithm invokes each of these steps in a certain order to achieve its objective.

get_local_params(DataObj, LP)
get_global_suff_stats(DataObj, LP)
update_global_params(SS)

Updated