Missing features in rpy2's pandas2ri conversion mechanisms compared to pandas

Issue #282 closed
Joris Van den Bossche created an issue

Just opening this issue to have a counterpart to https://github.com/pydata/pandas/issues/9602 on rpy2's tracker.

Resume: pandas' own tools to convert pandas <-> R objects are deprecated, but, there are still some aspects where these functions work better that the current pandas2ri functions in rpy2.

What is reported at the moment at the pandas issue (but there could be more):

  • handling of NaNs
  • index is not preserved in ri2py

Comments (14)

  1. Laurent Gautier

    Thanks. It is better to track rpy2 issues here.

    Would it be possible to add details ? For example the behavior with NaNs seems to be what one would expect:

    import pandas
    pd_dataf = pandas.DataFrame({'a': (1, 2, float('nan'))})
    from rpy2.robjects import pandas2ri
    >>> print(pandas2ri.py2ri(pd_dataf))
    0   1
    1   2
    2 NaN
  2. Jonathan Owen

    I've also noticed a difference in the conversion of a single row DataFrame; the index is getting lost.

    import pandas
    import pandas.rpy.common as com
    import rpy2.robjects as ro
    from rpy2.robjects import pandas2ri
    pd = pandas.DataFrame({'VALUE':5}, index=['IndexA'])
    >>> tmp = ro.r['print'](com.convert_to_r_dataframe(pd))
    IndexA     5
    >>> tmp = ro.r['print'](pandas2ri.py2ri(pd))
    1     5
  3. Laurent Gautier

    @jrowen , @jorisvandenbossche : easier if one issue per issue thread. I moved the index into its own issue (issue #285)

  4. Joris Van den Bossche reporter

    I can't look into detail at the moment, but in the pandas issue, I posted a link to a notebook where I explored some things: http://nbviewer.ipython.org/gist/jorisvandenbossche/d0deaa53ace697e1514e

    In any case, the previous pandas conversion machinery accomplished in some way to retain the index.
    And the NaN value is indeed conserved in a roundtrip from pandas to rpy2 and back, but as you can see in the repr it has a different 'value' (NA_real_ vs nan)

  5. Laurent Gautier

    @jorisvandenbossche : I moved the index part to issue #285. Beside that, conversion to strings and guessing on a round trip the type from the string does not look like a too good idea. The scientific folklore is full of Excel "guessing" data types.

  6. Joris Van den Bossche reporter

    Yes, I agree that the guessing is not a very nice API. But at least for string indexes they could be preserved? Also for a timeseries, now all information is being lost. But of course, we can say it is up to the user that he makes the index a column first. Or would it be an option to have keyword arguments in the pandas2ri.py2ri method whether to include the index as a column?

  7. Laurent Gautier

    R is making the distinction between NaN (not a number) and NA (missing value).

    > NaN
    [1] NaN
    > NA_real_
    [1] NA

    I am inclined to map Python's nan to R's NaN (I actually do not have to do it actively, as it is an IEEE standard used by both R and Python), but leave NA out of NaN.

  8. Cesar Bonilla

    Hi, if I using rpy version 1 because this, and get problems related to

    line 325, in convert_to_r_dataframe value = VECTOR_TYPESvalue_type KeyError: <type 'numpy.int64'>

    What I do?

  9. Laurent Gautier

    @cbonilla20 : this is for rpy2. I think that development and support for rpy stopped quite a while ago.

  10. Laurent Gautier

    @jorisvandenbossche : I am understanding that the issue about NAs is answered (NaN != NA). If no objection I'll close the issue.

  11. Log in to comment