Class for Graph Based Data

Issue #32 new
Will Stephenson created an issue

There's currently no support for generic graph based data, in which the underlying data would be stored as a N x N x D matrix. I suggest making a GraphXData.py class to support this data.

Data Format

The data would just be stored as a N x N x D matrix. The changes to the ObsModel in issue #31 would make them oblivious to this difference by setting numLinearDims = 2.

TrueZ Format

This will be a N x N matrix, with entries being whatever format is needed. For the case of the MMSB, each entry of the matrix will be a tuple (r_ij, s_ij) (although most real datasets won't have this information).

Online Algorithms

For online algorithms, how complicated things need to be depends a lot on what we want the batches to look like. The minimally complicated but still probably (maybe..) reasonable strategy seems like it would be to make a batch some subset of the rows of the data matrix. Methods such as select_subset_by_mask and get_random_sample would then refer to selecting rows of the data matrix. The fact that a GraphXData object is a subset of a graph would be indicated by nObs != nObsTotal = N.

How we would want add_data to behave isn't totally clear; it seems like this would depend on the details of deletes might work in the graph setting (does it need to add edges to existing nodes or only add new nodes to the graph?).

In any case, online algorithms will probably require a way of keeping track of what person the data in each batch came from (e.g. "which person does this row correspond to?"). For example, the MMSB needs this information to decide which \pi_i to update. In the case of row-based batches, this information can be stored as a vector of numbers.

Comments (5)

  1. Mike Hughes repo owner

    Thanks WIll this is a great start.

    Overall, I'm not certain about whether a dense matrix of size N x N x D is the right idea here for long-term progress.

    Seems like in lots of the datasets, we might have very sparse observations of edges with lots of missing data. In these cases, we probably want something that looks more like E x D, where E is number of observed edges.

    So the required attributes of NetworkData could instead be something like:

    • src_node_id : 1D array, size E, type int32
    • dest_node_id : 1D array, size E, type int32
    • data : 2D array, size E x D, type <data type>

    It's much more obvious that this will work efficiently with minibatches, since a minibatch is just a subset of edges.

    It's also slightly more clear how to add some new observations (edges), and how to select a subset (of edges).

    Of course, we still need to ask how this format would fit into the required obsmodel computations. * Do we encode the responsibilities as a 3D array (E x K x K) or a 2D array (E x J), where J = K^2? The first option seems simpler, but I can see arguments for the second option too.

  2. Will Stephenson reporter

    Attached is a pdf outlining some more issues and proposed solutions that're related to this, one of the main points being avoiding creation of a N^2 x K x K E_log_soft_ev matrix by obsmodels.

    These changes will still involve allocating a N^2 x K x K 'resp' matrix in allocmodel.calcLocalParams. I'm pretty sure this can be fixed too within the framework of these changes (basically you're still going to have to do N^2 computation at this step, but you can build sufficient statistics as you go instead of computing the full phi matrix and then summing it).

  3. Mike Hughes repo owner

    Thanks Will, this raises some good issues. I think you're on the right track with both of them.

    RE: Missing Data

    Probably we want (for now) to store explicitly all observations that are not-missing (aka observed). Suppose there are E observed edges.

    That's easy to handle in the current proposal. We set the data field to a 1D array of size E that (in binary case) might have both ones and zeros. The corresponding indicator vectors that identify source and dest nodes remain 1D arrays of size E.

    This definitely easily handles the case of Gaussian observed data, too. The case of discrete data is tricky here...we could store it densely for now.

    RE: obsmodels and soft evidence

    First, I don't agree with your claim that "we’re completely missing <E[ log 1 - \phi]> term.> In fact, inside BernObsModel you can clearly see that the calcLogSoftEvMatrix function computes this:

        L = np.dot(Data.X, ElogphiT) + np.dot(1.0-Data.X, Elog1mphiT)
    

    However, I do agree that with binary data the soft evidence matrix may be redundant. One possibility is to have the obsmodel instead just hand off the matrix ElogphiT and Elog1mphiT directly, and let the allocmodel deal with that if necessary.

  4. Mike Hughes repo owner

    To follow-up on the missing data point, I would be OK if we allowed a sparse binary case where basically we assume that all non-stored edges are observed equal to zero. But regardless I think we should implement the "all observed edges are explicitly stored" case first, and make sure that's working.

    We may also want to think about data structures like dicts instead of 1D arrays to hold the present (x=1) and absent (x=0) observed edges. This might allow much faster look-ups to decide what the soft evidence is.

  5. Will Stephenson reporter

    Just to be clear: the issues in the post with the .pdf file are things that I think only come up when we're trying to do sparse storage (probably they should have gone in a second post). Without that, I totally agree that the softEv matrix isn't missing anything and we don't have to worry about "is this data missing or just sparse?" I'm working on the mis-named taggedXData branch to get things working without considering either of these issues. I'll put in a pull request for that before dealing with sparsity.

    Assuming the sparse format (for binary data) that we pick is to just toss out all the zeros, leaving GraphXData.X to be a hopefully short vector of 1's, then it does seem like the softEv matrix will be missing Elog1mphiT because:

    L = np.dot(Data.X, ElogphiT) + np.dot(1.0-Data.X, Elog1mphiT)

    will never have L[i] = Elog1mphiT, since Data.X[i] = 1 for all i.

  6. Log in to comment