Wiki
Clone wikibnpy-dev / Code / Data / DataFormat
Goal
Here we describe the bnpy data formats and give examples for how to make bnpy load custom datasets.
As a first step, to process a new dataset with bnpy, you must either
- define a custom Python script for loading your dataset
- within Python code, call
bnpy.run
with a validbnpy.data.DataObj
object
The first option is recommended, because it allows running experiments from the command-line.
Loading data via Dataset Scripts
Suppose you want your dataset to be called "MyCustomData". After defining the a script called MyCustomData.py
, you could then load this data and execute experiments with any valid model or learning algorithm.
python -m bnpy.Run MyCustomData [allocModel] [obsModel] [algName] ...
But first, you need to create a Python script called MyCustomData.py
, following the template below.
''' MyCustomData.py A template for real-valued datasets. ''' import bnpy.data.XData as XData def get_data(**kwargs): ''' Returns simple dataset of a single observation: number 0 ''' return XData(X=[0])
The only requirement of this script is that it implements the get_data
interface. This means it defines a function named "get_data" with following properties
- INPUT: keyword arguments
- OUTPUT: any subclass of bnpy.data.DataObj
-
- such as XData, etc.
Toy example with XData
Let's consider a simple dataset of 2D Gaussian observations, generated by two well-separated components with means [-4,0] and [4,0]. We'll define this dataset in a new script CircleK2.py
.
''' CircleK2.py Simple dataset of 2D points from two "spherical" components. Comp A has location [-4, 0] Comp B has location [4, 0]. ''' import bnpy.data.XData as XData import numpy as np def get_data(seed=123, **kwargs): PRNG = np.random.RandomState(seed) XA = PRNG.randn(200,2) + [-4, 0] XB = PRNG.randn(200,2) + [4, 0] X = np.vstack([XA,XB]) return XData(X=X)
Save that file into $BNPYDATADIR
, open a terminal and type
python -m bnpy.Run CircleK2 MixModel Gauss EM --nLap 50 python -m bnpy.viz.PlotComps CircleK2 MixModel Gauss EM --doPlotData
Passing data directly into bnpy.run
In general, you can create a Data object directly in python code and hand it off to bnpy.run
.
import bnpy Data = bnpy.data.Data("your custom input") hmodel, LP, Info = bnpy.run(Data, 'allocModelName', 'obsModelName', 'algName', **kwargs)
Toy data example
Here's an equivalent example for creating the CircleK2 data and passing it directly into bnpy.run
.
import numpy as np import bnpy PRNG = np.random.RandomState(123) XA = PRNG.randn(200,2) + [-4, 0] XB = PRNG.randn(200,2) + [4, 0] X = np.vstack([XA,XB]) kwargs = dict(nLap=50) hmodel, LP, Info = bnpy.run(bnpy.data.XData(X), 'MixModel', 'Gauss', 'EM', **kwargs)
XData object : Real Data (analyze with Gaussian obsmodel)
Consider a dataset where the $n$-th observation is a real vector $x_n$ of length $D$. For example, imagine $x_n$ gives the 2D latitude, longitude coordinates marking the geo-location where a digital photo was taken.
$$$ x_n = [x_{n1} x_{n2} \ldots x_{nD}] $$$
Next, let our dataset $X$ contain $$N_{obs}$$ such vectors. In our example, there are $N_{obs}$ distinct photos.
$$$ X = { x_1, x_2, \ldots x_{N_{obs}} } \ $$$
To represent this data, we use the bnpy.data.XData
object. An XData
object is essentially a thin-wrapper around a 2D array $X$, where the rows are distinct observations
$$$ X = \Big[ -- x_{1} -- \ -- x_{2} -- \ \vdots\ -- x_{N_{obs}} -- \Big] $$$
Representing arrays in numpy
Consider observing 3 distinct vectors, each a latitude and longitude (2D).
(45,91), (50,100), and (53, 120)
We'd represent this as a 2D numpy array as
>> import numpy as np >> X = np.asarray([[45, 91], [50, 100], [53, 120]]) >> print X [[45 91] [50 100] [53 120]]
Numpy's indexing allows us to select each row (observation) via the bracket notation.
>> print X[0] # get the first entry [45 91]
Creating XData
We can turn a numpy array $X$ into a proper XData object very simply
>> myXData = XData(X=X) >> print myXData [[45 91] [50 100] [53 120]]
The constructor needs only one argument: the $N_{obs}$-by-$D$ matrix $X$.
Manipulating XData
In general, most access and manipulation of XData
objects should happen by bnpy code, not necessarily user code. However, accessing and manipulating the data is quite simple.
>> myXData.X[0] = [-3, -3] >> print myXData [[-3 -3] [50 100] [53 120]]
Updated