Problem converting some column types from pandas

Issue #130 resolved
Dav Clark created an issue

I discovered this working on my thesis (dataset NDI_Pro2). I'll track down the details later, but want this here as a placeholder. I would assign to myself, but I can't! Also - version was actually 2.3.4 (but version tags are not up to date, it seems?)

/opt/anaconda/envs/thesis/lib/python2.7/site-packages/rpy2/robjects/pandas2ri.pyc in pandas2ri(obj)
     34     elif isinstance(obj, PandasSeries):
     35         # converted as a numpy array
---> 36         res = original_conversion(obj)
     37         # "index" is equivalent to "names" in R
     38         if obj.ndim == 1:

/opt/anaconda/envs/thesis/lib/python2.7/site-packages/rpy2/robjects/numpy2ri.pyc in numpy2ri(o)
     54         # It should be impossible to get here:
     55         else:
---> 56             raise(ValueError("Unknown numpy array type."))
     57     else:
     58         res = ro.default_py2ri(o)

Comments (12)

  1. Laurent Gautier
    • changed version to 2.3.4

    I think that one needs to have write access to be assigned an issue.

    There were fixes since 2.3.4. Check if the problem is still there with 2.3.6 (latest release)

  2. Dav Clark reporter

    Problem goes away with rpy2 2.4.6, but I have some new strange problem. I'll report more when I have more info.

  3. Dav Clark reporter

    This does it for me:

    import pandas as pd
    load_ext rmagic
    a = pd.DataFrame(dict(dates=['05-01-2001', '04-01-2013'], not_necessary=[1, 2]))
    a.dates = pd.to_datetime(a.dates)
    %R -i a print(a$dates)
    
  4. Laurent Gautier

    I rewrote the example to function without ipython:

    import pandas as pd
    a = pd.DataFrame(dict(dates=['05-01-2001', '04-01-2013'], not_necessary=[1, 2]))
    a.dates = pd.to_datetime(a.dates) # works without this line
    
    from rpy2.robjects import pandas2ri
    pandas2ri.activate()
    # trigger the error:
    r_dataf = pandas2ri.pandas2ri(a)
    

    The problem is originating from translating pandas index to R names, it seems. For some reason R does not expand the names properly and the result is the presence of NA values for all but the first element. R is using trick internally, such as making consecutive integers represented as a slice-like object at the C level, and here we end up with "0:1" (string representation for the slice). I am tempted to see a bug with R; I'll try look more into this, if no one does before me.

    from rpy2.robjects.vectors import ListVector, POSIXct
    obj = a['dates']
    pct = POSIXct(obj)
    pct.names  =  ListVector({'x': ro.conversion.py2ri(obj.index)})
    print(pct.names)
    
  5. Dav Clark reporter

    In general, pandas and R dataframes have some really different semantics surrounding indices - and this may be a point to improve things on.

    1) pandas allows repeated index names. If there's a repeated index name, the conversion via rpy2 to R results in integer rownames.

    2) Pandas allows hierarchical index names, which allows really clean separation of data and metadata. Consider how many times folks have implemented reshape in R! Semantically, indices in pandas are used the way that "regular" columns are used in R. They even have names (like columns). Currently, if I have a 2-level index, I end up with columns in R like:

    "(u'R_6RPLGTE8ecJg7M9', 'knwgbl_pre')"

    Which is a royal pain (though avoidable by converting indices to columns in pandas before conversion).

    I know this isn't the best place to discuss this, but if you're thinking about columns and indices, I wanted to bring this to your attention.

  6. Laurent Gautier

    1) pandas allows repeated index names. If there's a repeated index name, the conversion via rpy2 to R results in integer rownames.

    Although R's data.frames are clearly reminiscent of SQL tables, R just does not have indexes. The closest to are row names, with the twist that in case two rows have the same name the first such row returned when called by name. Conceptually, I think that pandas indexes are like SQL indexes (the multi-indexes are making this even more apparent), and an open question is why not have tried to leverage projects like SQLAlchemy and provide a common interface (more on the SQL thread below).

    2) Pandas allows hierarchical index names, which allows really clean separation of data and metadata. Consider how many times folks have implemented reshape in R!

    I do not think that the index covers much of the metadata aspect (the AnnotatedDataFrame in Bioconductor is more complete, I think).

    Semantically, indices in pandas are used the way that "regular" columns are used in R. They even have names (like columns). Currently, if I have a 2-level index, I end up with columns in R like:

    "(u'R_6RPLGTE8ecJg7M9', 'knwgbl_pre')"

    IMHO a broader approach would be to consider entities that are hybrids between SQL indexes and views.

    I know this isn't the best place to discuss this, but if you're thinking about columns and indices, I wanted to bring this to your attention.

    I do have thoughts (see above), and while I agree that this is not the best place to discuss this I am very much supportive of discussing ideas when and where inspiration strikes. Also, there is may be no single good place to discuss this as there are several projects that might find it interesting (DataFrame in Julia, a possible extension to the R data.frames, etc...)

  7. Dav Clark reporter

    Let's keep this simmering for now. I'll be working with Wes over the summer on pandas and R related stuff (if all goes according to plan), so I should be well positioned to make some progress here.

    I note also that while the reshape function in R stats (standard lib) annotates reshaped dataframes with all kinds of stuff. Shown in the output of the following:

    wide <- reshape(Indometh, v.names = "conc", idvar = "Subject",
                     timevar = "time", direction = "wide")
    attributes(wide)
    

    So, a first step might be to look at a reasonable choice for the "best" reshape package and use it's conventions for annotating different kinds of columns. Perhaps not doing this automatically for now, though...

  8. Laurent Gautier

    On the contrary: strike while the iron is hot, and the summer is almost here already. I'll try putting together all thoughts I had about this.

    Regarding the "best" reshape package, I'd be tempted to say that Hadley's stuff ("reshape2", "plyr") would be the first places to look at. He has built on the experience gained with the original "reshape()" (which cannot change for concerns of back-compatibility).

  9. Log in to comment