Object as supported Type

Create issue
Issue #3 wontfix
Kevin Davenport created an issue

First off, thank you for developing such a great tool. MongoDB would be crazy not to integrate this into their product. What you're doing seems more robust than IOPro: https://store.continuum.io/cshop/iopro/

Collections can have an array as a field and I believe the only way I can bring that into my Python environment is as type "Object". Currently Mongo offers no way to calculate variance with the aggregate pipeline so I simply $push to an array and then import the dictionary results into Python via PyMongo to calculate the variance of the array for the given field. It would be excellent if we could bring in a field as an array of arrays or object.

Thanks again

Comments (3)

  1. David Beach repo owner

    I've given this some thought. The problem with using objects is that they would need to be inflated back into normal Python types, and the cost of doing that is likely to kill the performance. However, there is another way to go about this that should still give good performance. By developing a new extension to Monary, we could extract two arrays from the collection, a data array and an offsets array. The data array would contain floats (or whatever numeric type you are computing the stats on) and would simply be the flattening of all the corresponding arrays in that field of each document. Since each document is expected to contain multiple values for this, the data array will typically be longer than the number of documents matched by the query. The offsets array would be an integer type indicating the starting index for each sublist in the data array. The length of the offset array would equal the number of documents queried.

    So, to get the data from the first sublist, you could write:


    The second sublist would be:


    And in general, sublist i would be:


    This strategy keeps the number of objects very low (2 arrays) while still preserving the structure of the sublists. It should be possible to implement this strategy without sacrificing speed.

    With that said, it's on the to do list for the project, but I'm not sure how soon I (or someone else) will be able to get to it.

  2. Kevin Davenport reporter

    That makes sense David, the cost of Python objects overhead slipped my mind. It would be great for everything to stay in numpy.

  3. A. Jesse Jiryu Davis

    Closing based on the outcome of this discussion. PyMongo is better suited to this kind of usage; eventually perhaps Dataframe integration for Monary would give us the best of both worlds.

  4. Log in to comment