Wiki

Clone wiki

bnpy-dev / Code / Data / DataFormat

Goal

Here we describe the bnpy data formats and give examples for how to make bnpy load custom datasets.

As a first step, to process a new dataset with bnpy, you must either

  • define a custom Python script for loading your dataset
  • within Python code, call bnpy.run with a valid bnpy.data.DataObj object

The first option is recommended, because it allows running experiments from the command-line.

Loading data via Dataset Scripts

Suppose you want your dataset to be called "MyCustomData". After defining the a script called MyCustomData.py, you could then load this data and execute experiments with any valid model or learning algorithm.

python -m bnpy.Run MyCustomData [allocModel] [obsModel] [algName] ...

But first, you need to create a Python script called MyCustomData.py, following the template below.

''' MyCustomData.py
    A template for real-valued datasets.
'''
import bnpy.data.XData as XData

def get_data(**kwargs):
    ''' Returns simple dataset of a single observation: number 0
    '''
    return XData(X=[0]) 

The only requirement of this script is that it implements the get_data interface. This means it defines a function named "get_data" with following properties

  • INPUT: keyword arguments
  • OUTPUT: any subclass of bnpy.data.DataObj
    • such as XData, etc.

Toy example with XData

Let's consider a simple dataset of 2D Gaussian observations, generated by two well-separated components with means [-4,0] and [4,0]. We'll define this dataset in a new script CircleK2.py.

''' CircleK2.py
    Simple dataset of 2D points from two "spherical" components.
        Comp A has location [-4, 0]
        Comp B has location [4, 0].
'''
import bnpy.data.XData as XData
import numpy as np

def get_data(seed=123, **kwargs):
  PRNG = np.random.RandomState(seed)
  XA = PRNG.randn(200,2) + [-4, 0]
  XB = PRNG.randn(200,2) + [4, 0]
  X = np.vstack([XA,XB])
  return XData(X=X)

Save that file into $BNPYDATADIR, open a terminal and type

python -m bnpy.Run CircleK2 MixModel Gauss EM --nLap 50
python -m bnpy.viz.PlotComps CircleK2 MixModel Gauss EM --doPlotData 

Passing data directly into bnpy.run

In general, you can create a Data object directly in python code and hand it off to bnpy.run.

import bnpy

Data = bnpy.data.Data("your custom input")

hmodel, LP, Info = bnpy.run(Data, 'allocModelName', 'obsModelName', 'algName', **kwargs)

Toy data example

Here's an equivalent example for creating the CircleK2 data and passing it directly into bnpy.run.

import numpy as np
import bnpy

PRNG = np.random.RandomState(123)
XA = PRNG.randn(200,2) + [-4, 0]
XB = PRNG.randn(200,2) + [4, 0]
X = np.vstack([XA,XB])

kwargs = dict(nLap=50)
hmodel, LP, Info = bnpy.run(bnpy.data.XData(X), 'MixModel', 'Gauss', 'EM', **kwargs)

XData object : Real Data (analyze with Gaussian obsmodel)

Consider a dataset where the $n$-th observation is a real vector $x_n$ of length $D$. For example, imagine $x_n$ gives the 2D latitude, longitude coordinates marking the geo-location where a digital photo was taken.

$$$ x_n = [x_{n1} x_{n2} \ldots x_{nD}] $$$

Next, let our dataset $X$ contain $$N_{obs}$$ such vectors. In our example, there are $N_{obs}$ distinct photos.

$$$ X = { x_1, x_2, \ldots x_{N_{obs}} } \ $$$

To represent this data, we use the bnpy.data.XData object. An XData object is essentially a thin-wrapper around a 2D array $X$, where the rows are distinct observations

$$$ X = \Big[ -- x_{1} -- \ -- x_{2} -- \ \vdots\ -- x_{N_{obs}} -- \Big] $$$

Representing arrays in numpy

Consider observing 3 distinct vectors, each a latitude and longitude (2D).

(45,91), (50,100), and (53, 120)

We'd represent this as a 2D numpy array as

>> import numpy as np
>> X = np.asarray([[45, 91], [50, 100], [53, 120]])
>> print X
[[45  91]
 [50 100]
 [53 120]]

Numpy's indexing allows us to select each row (observation) via the bracket notation.

>> print X[0] # get the first entry
[45 91]

Creating XData

We can turn a numpy array $X$ into a proper XData object very simply

>> myXData = XData(X=X)
>> print myXData
[[45  91]
 [50 100]
 [53 120]]

The constructor needs only one argument: the $N_{obs}$-by-$D$ matrix $X$.

Manipulating XData

In general, most access and manipulation of XData objects should happen by bnpy code, not necessarily user code. However, accessing and manipulating the data is quite simple.

>> myXData.X[0] = [-3, -3]
>> print myXData
[[-3  -3]
 [50 100]
 [53 120]]

Updated