numpy2ri/pandas2ri no longer properly update ri2ro

Issue #206 resolved
Dav Clark created an issue

Previously, we were getting automatic conversion of vectors to arrays, and data.frames to pandas DataFrames in the rmagic if pandas was installed. Now, if only numpy is installed, I get errors on vectors, and pandas2ri doesn't do vector conversion.

This is (I think) a regression, and I'm not sure why the tests didn't catch it. I don't have time to fix it right now.

Comments (22)

  1. Laurent Gautier

    pandas2ri should just fail on import if pandas is not installed / not in PYTHONPATH / not in sys.path

  2. Laurent Gautier

    I think that #207 is linked to that.

    There is a regression in the new conversion system in rpy2-2.4.x, in the sense that some confusion what made in the direction R -> Python when converting objects.

    I think that I can come up with a fix reasonably quickly.

  3. Dav Clark reporter

    This also interacts somewhat with the discussion you chimed in on on the pandas issue tracker... In any case, I'm glad you figured out the issue before I even got back to explaining it! I've been away without electricity, not to mention internet. I'll likely be pretty booked through next Wednesday, but happy to work on testing things around then.

  4. Laurent Gautier

    I have fixed #207. Testing/fixing will be very appreciated. If this issue is still present, can you add a unit test triggering it ?

  5. Dav Clark reporter

    I'll try to have another look later this week. I was thinking in response to your musing, and I think it's worth documenting some user models: a technical user (like you) that understands R well enough to be comfortable with python proxies for R objects, vs. someone who is really just black-boxing some of R's functionality. This latter kind of person is the person that presents a problem demanding xx2py style conversion, and they may not have as much need for this conversion.

    But I (currently) think a good over-riding goal would be to consolidate R <-> py conversion code in the rpy2 codebase (which is what I think you're going for).

    Is the above something that would be worth spelling out in the docs? I could take a crack at that.

  6. Laurent Gautier

    I edited the comment in #207. Because I thought that I was just using the comment section as a notepad for future documentation, or discussion about the conversion system modified to use generics/single dispatch (see #197). Also because I realized this triangle of {"non-R Python objects", "rinterface objects", "robjects objects"} was not followed and less clear than what I remembered.

    That's a rewrite of these original notes. I leave up to you whether they should go to the documentation.

    So currently the situation is that there are 3 levels:

    • ri: the rinterface level. This is the low-level interface to R, exposing objects and functions defined in R's C API with eventual modifications and safeguard to ensure that no harm (segfault, concurrent access) can be done)
    • ro: the robjects level. This is the high-level interface to R, exposing the objects of rinterface with a number of convenience methods and attributes.
    • the non-R/rpy2 stuff

    Initially the conversion was designed to ensure that non-R objects are converted into something that R can use whenever needed. The reverse trip (R objects into non-R objects) was left to the user, with the expectation that it would not be an issue because the robjects (and the rinterface to a lesser extent) objects implement methods and attributes found in Python's informal interfaces. For example. R vectors all have a __len__ and a __getitem__/__setitem__, as well as export the buffer interface whenever possible, R functions have a __call__, etc... The system was also designed to let anyone change/customize the robjects layer rather easily.

    However, there was an unanticipated need to provide conversion to non-R/rpy2 objects (I am understanding that various reasons are coming into play - this is for an other post). This lead to use the ri2ro more and more for that (first numpy2ri, then pandas2ri).

    The general idea of conversion is a interesting question. I am appreciating what Julia is doing with. This is a also a question I'd need to think a bit about: the current system in rpy2 is rather ad hoc, and it appears to be "mostly working with eventual glitches".

  7. Dav Clark reporter

    OK, revisiting the issue tracker , it's clear that I was at some point aware of this rather major bug (despite what I claim in #218). In any case, it's much less of a bug now, but I'd like to leave this open until we settle on final semantics for R data.frames to python (as handled by ri2ro. For reasons I'm not clear on, calls into rpy2 will generally now return a numpy array when appropriate. Data frames, however, are not automatically converted. This leads me to suspect that ri2ro is being called selectively somewhere in the guts of rpy2, which seems like making the same decision (but differently) in two places.

    I'd argue that "core" calls to rpy2 (e.g., ro.r('blah'))should just call ri2ro on everything it's returning, or not call ri2ro at all.

    Note in particular, consider the conversion that happens here before and after a call to pandas2ri.activate():

    ro.r('data.frame(a=c(1,2,3))')
    ro.r('c(1,2,3)')
    

    Curiously, the innards of that data.frame are getting converted to numpy arrays after pandas2ri.activate()!

    Anyway, I'll check into this when I get a chance, but things are much less broken now.

  8. Dav Clark reporter

    I've fixed some of the logic for pandas2ri and numpy2ri, so that they successfully convert dataframes into appropriate datatypes. But this exposes some issues, e.g., where robjects.functions.rcall uses ri2ro and ends up converting things to pandas or numpy datatypes. I've tried disabling this call, and it seems to break lots of other things. In any case, I'm finally realizing why you think it'd be easier to whip up a quick generics solution!

    Let me know if you want me to back the maintenence branch back up. I believe things were usefullly working at 2a8f2e5, but I'm not 100% certain, as the tests for interface aren't working even in your last commit (there's another segfault issue).

    But I think if you had a look at my recent commits, it might be obvious to you where the problems remain.

  9. Laurent Gautier

    I'll look at it. Anyone interested in doing it (before me) should not hesitate though.

    I am have just pushed a refactoring of the conversion system that is using singledispatch (39049dfe5706)

  10. Laurent Gautier

    @davclark

    Let me know if you want me to back the maintenance branch back up. I believe things were usefully working at 2a8f2e5, but I'm not 100% certain, as the tests for interface aren't working even in your last commit (there's another segfault issue).

    I do not seem to experience any issue in the branch version_2.4.x. I have the following outcome:

    Ran 376 tests in 5.903s
    
    OK (skipped=1, expected failures=1)
    

    Are you building rpy2 from the source tree ? If so, can you try to rebuild/install after wiping the directory build/ ?

  11. Dav Clark reporter

    Well, that didn't do it, but I'm glad your tests are running OK. I'm running on anaconda, w/ py 2.7.8 and R 3.1.1. I'm a bit tired to diagnose this, but I recon I can go through the individual tests and figure out what's triggering the segfault.

    Also - I see you merged my changes, including removing ANY conversion from rcall. I suspect that's fine, but wanted to make sure you were aware of that change, as it's pretty deep in the core machinery. There's a comment there that could be better worded... it was changed about 3 times as I traced different problems through.

  12. Laurent Gautier

    Thanks for the note about rcall(). I grafted/merged the changes to the branch default (2.5.0-dev) because there is a bit of time to see what is going on. I might revert the changes in the branch version_2.4.x though: the change to rcall() seems break a bit the API (and point bugfix release should not, ideally).

    For the segfault, it should really not happen. File a bug report if it persists...

  13. Dav Clark reporter

    First, tests are working fine on my work laptop, so I can verify that all tests pass (apart from skips and expected fails) when ri2ro() is commented out in rcall().

    The change to rcall() was required because:

    1. once pandas2ro() "correctly" converts R data.frames to pandas DataFrames, then
    2. pandas2ri.activate() changes the behavior of the ri2ro() call in rcall() such that intermediate computations can end up with non-robjects types, so
    3. Intermediate computations involving data.frames start breaking.

    Interestingly, this problem is revealed only through the rmagic tests. For example Rpush ends up passing a pandas dataframe through conversion.py2ri, which ends up going through rcall(), which then calls ri2ro - very counterproductive!

    So, if you leave the ri2ro call in rcall, the following code returns a pandas dataframe, which then confuses rpy2:

    import numpy as np
    import rpy2.robjects as ro
    from rpy2.robjects import pandas2ri
    pandas2ri.activate()
    datapy = np.array([(1, 2.9, 'a'), (2, 3.5, 'b'), (3, 2.1, 'c')],
                     dtype=[('x', '<i4'), ('y', '<f8'), ('z', '|S1')])
    print type(ro.conversion.py2ri(datapy)) # pandas.core.frame.DataFrame !!!
    

    If you comment out the ri2ro in rcall, then you get an rinterface.SexpVector - but ideally you'd get an robject...

    I guess I should turn this into a unit tests for robjects?

    And I hope this isn't too verbose, but I want to make sure I clearly communicate the problem, as it's a bit subtle.

    I think we can make a non-API-breaking change here by moving the default_XXX functions out of init.py, so that rcall can just use default_ri2ro (instead of ri2ro). Does that seem reasonable?

  14. Dav Clark reporter

    OK - I've done some more thinking / hacking.

    After fully refactoring the default_XXX conversions over to robjects.conversion, I realized that the reason you had those in init.py is likely because many other symbols need to be imported from other robjects sub-packages, and putting it in conversions yeilds some nasty import cycles.

    So, while it's ugly, I think the right thing is to have another set of empty functions in conversion, and then overwrite them as well from init.py.

    This will preserve the existing behavior of rcall, while still allowing for functional numpy2ri/pandas2ri at the expense of an additional 3 empty function definitions that get monkey patched from init.py, exactly parallel to the existing 3 functions in robjects.conversion, with the understanding that they should NOT EVER be changed later on.

    And now I really appreciate the utility of proper conversion / dispatch mechanisms.

  15. Laurent Gautier

    @davclark What you are experiencing is likely caused by ri2py missing from the conversion (See comment in issue 218), at least in part. In place ri2ro and/or py2ro were occasionally filling that role. A possible driver for that confusion is that shuttling back and forth between Python and ri/ro level objects (proxy to R objects) would appear to many users as a mysterious source of abysmal performances.

    The way forward that I see emerging after the rewrite using generics and single dispatch, is that will be not more ri2py/ro2py (conversion from ri/ro level to Python-no-R-proxy-object) used within rpy2.

    However, conversions will be implemented and offered to users. Practically this means that the possibility to exit the ri/ro proxys and go to Python objects will be easily accessible whenever a user wants it. In the case of "R magic" cell that conversion can be implicitly made when exiting a cell if you like.

  16. Laurent Gautier

    Default conversion functions were declared in robjects/__init__.py because initially everything was in __init__.py and I factored the code out into modules as it grew really big (IIRC). The default conversion functions stayed there because of circular imports.

    There were "empty" functions in conversion.py and they were overwritten in robjects/__init__.py. The problem is, I think, around the missing ri2py while ri2ro was sometimes filling that role.

    Note that the rewrite using generics still defines "empty" functions in conversion.py. https://bitbucket.org/lgautier/rpy2/src/a446ed23cb3989c4fb818c2410fcd0e18c6e57a3/rpy/robjects/conversion.py?at=default#cl-59

    Additional conversion set of rules are implemented as a independent generic using the current conversion rules as a template (see https://bitbucket.org/lgautier/rpy2/src/a446ed23cb3989c4fb818c2410fcd0e18c6e57a3/rpy/robjects/numpy2ri.py?at=default#cl-128 ) but users do not have to conform to that and can just use .register and add what they like.

  17. Dav Clark reporter

    OK - I've pushed something that is minimally hacky (but still hacky) to version_2.4.x. I see what you're doing in default, and that looks good. I think we can close this issue. But I'll leave it up to you.

  18. Laurent Gautier

    Thanks. If the fix you pushed to 2.4.x is both solving the issue and does not break the API (changing the return value of rcall() was looking a little radical for a bugfix point release - better safe than sorry - all possible, but legitimate, breakage can happen for 2.5.0 though), the next 2.4.x release will be on its way. (An entry about the fix in the file NEWS, under release 2.4.4, would be good).

  19. Laurent Gautier

    Move to a generics-based conversion is almost complete with 3a9eb644fcc9 (check the numpy layer). Only pandas2ri needs to be updated/fixed and this will be good to go.

  20. Log in to comment