Dataframe conversion has the wrong dimensions

Issue #540 closed
Craig Citro created an issue

Repro notebook: https://colab.research.google.com/drive/1b2aRPr6pUVXxay0FSCpO8umB-Hb6j2XW

When converting a dataframe from python to R, something is transposing columns into rows.

I didn't dig too much, but I think the problem is this: when converting the pandas dataframe column, we end up here: https://bitbucket.org/rpy2/rpy2/src/4bdd4962f5f4477de36410a670511b33b7f8c201/rpy/robjects/pandas2ri.py?at=default&fileviewer=file-view-default#pandas2ri.py-127:134

The problem is this:

  • we see a column (shape is 1D) that doesn't happen to be all strings/None
  • it's of dtype object
  • we convert it to R, and then assign the index as the names

The last step is the killer: this turns our Nx1 object into a 1xN object.

Comments (6)

  1. Laurent Gautier

    Repro notebook: https://colab.research.google.com/drive/1b2aRPr6pUVXxay0FSCpO8umB-Hb6j2XW

    When converting a dataframe from python to R, something is transposing columns into rows.

    I didn't dig too much, but I think the problem is this: when converting the pandas dataframe column, we end up here: https://bitbucket.org/rpy2/rpy2/src/4bdd4962f5f4477de36410a670511b33b7f8c201/rpy/robjects/pandas2ri.py?at=default&fileviewer=file-view-default#pandas2ri.py-127:134

    The problem is this:

    • we see a column (shape is 1D) that doesn't happen to be all strings/None
    • it's of dtype object
    • we convert it to R, and then assign the index as the names

    The last step is the killer: this turns our Nx1 object into a 1xN object.

  2. Laurent Gautier

    What is happening is that Python/numpy do not have NA values for integers, booleans, or strings, so columns in pandas DataFrame objects will be of dtype object as soon as there is an missing value needed.

    Meanwhile, the columns in an R data frame must be homogeneous array types the R C-API knows about: integer, double, logical (R's booleans), character (R's strings). R's list objects (which allow heterogeneous types among array elements) cannot be columns in a data frame.

    The rpy2 considering is considering a pandas.Series in isolation, and is converting a Series of dtype object into the only R object it can: an R list, and the R constructor for data.frame is behaving in what can be argued to be a surprising way as show in the R example below.

    > data.frame(x = list(1, "a", FALSE), y = 1:3)
      x.1 x..a. x.FALSE y
    1   1     a   FALSE 1
    2   1     a   FALSE 2
    3   1     a   FALSE 3
    

    The converter should probably try to guess whether a suitable R type can be found from examining all values in the Python series (e.g., the dtype might be object but if all Python values are booleans or None this should become an R logical).

  3. Laurent Gautier

    pandas is using numpy.nan rather than None as a default to express missing-ness. <sigh>

    Your colab example will fail on columns where all values are numpy.nan

  4. Log in to comment