pandas2ri.py2ri use way too much RAM

Issue #421 new
ELToulemonde
created an issue

I'm trying to setup rpy2 on my computer.

My config:

  • Windows 10, 16Go RAM
  • Anaconda: 4.4.0 with python 3.6
  • An independant R version: 3.4.1

Python packages:

  • rpy2: 2.8.6
  • pandas: 0.20.3

First thing i do is to load my data set (adult from UCI repo):

import pandas as pd
original_data = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    names=[
        "Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
        "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
        "Hours per week", "Country", "Target"],
        sep=r'\s*,\s*',
        engine='python',
        na_values="?")
original_data.head()

import sys
sys.getsizeof(original_data)
>> 34605590

Now that i have imported my data set that is not that big... Now i start R interface as shown in rpy2 documentation

from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

Control R memory limit (result in Mo)

import rpy2.robjects as robjects
r('memory.limit()')
>> array([ 2047.])

2Go i should be safe... I try to pass my data set to R.

r_dataframe = pandas2ri.py2ri(original_data)

---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
<ipython-input-17-5921ca574db2> in <module>()
----> 1 r_dataframe = pandas2ri.py2ri(original_data)

C:\Users\touleem\AppData\Local\Continuum\Anaconda3\lib\functools.py in wrapper(*args, **kw)
    801 
    802     def wrapper(*args, **kw):
--> 803         return dispatch(args[0].__class__)(*args, **kw)
    804 
    805     registry[object] = func

C:\Users\touleem\AppData\Local\Continuum\Anaconda3\lib\site-packages\rpy2\robjects\pandas2ri.py in py2ri_pandasdataframe(obj)
     58             od[name] = StrVector(values)
     59 
---> 60     return DataFrame(od)
     61 
     62 @py2ri.register(PandasIndex)

C:\Users\touleem\AppData\Local\Continuum\Anaconda3\lib\site-packages\rpy2\robjects\vectors.py in __init__(self, obj)
    956                                  " of type VECSXP")
    957 
--> 958             df = baseenv_ri.get("data.frame").rcall(tuple(kv), globalenv_ri)
    959             super(DataFrame, self).__init__(df)
    960

RRuntimeError: Error: cannot allocate vector of size 254 Kb

Please note that before failling, my RAM usage exploded (from 5Go to 16Go :O).

Why does it take so much RAM?

PS: i found a way around using numpy2ri.py2ri and feeding after complementary informations to R to re-build a data.frame.

Thanks for your help

Comments (4)

  1. Laurent Gautier

    That seems odd. May be R is making unnecessary copies when running df = baseenv_ri.get("data.frame").rcall(tuple(kv), globalenv_ri). Did you try using R to read the CSV ?

    utils = importr('utils')
    dataf = utils.read_csv(...)
    
  2. Ben Ball

    Hello Laurent,

    I was able to do utils.read_csv ok.

    I actually tried to find this problem, and reduced it to a specific scenario in my case:

    I have a two column pandas dataframe. first column is a dtype int64, second column is a dtype object that contains str. That str column has an entry that is missing, which results in it being a float of type NaN (while all the other entries are python str filled in properly).

    Feeding this dataframe into pandas2ri.py2ri results in the memory leak and crash as described in this issue. This only happens with some minimum number of rows, not sure the number, but if it is a small enough number of rows it is ok.

    I fixed my problem by converting NaN in my dtype object column with dataframe.fillna to be blank string

  3. Laurent Gautier

    Thanks. If you have a self-sufficient example to reproduce the issue, this is always helpful.

    In the meantime, here are initial notes.

    In my experience, arrays of strings in numpy (pandas is relying on numpy for arrays) can be of dtype "<U[0-9]+", or object, or "S" (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html ): :

    >>> import numpy as np
    >>> a_u1= np.asarray(['a', 'b', 'c'])
    >>> a_u1 
    array(['a', 'b', 'c'], 
          dtype='<U1')
    >>> a_obj = np.asarray(['a', 'b', 'c'], dtype = object)
    >>> a_obj
    array(['a', 'b', 'c'], dtype=object)
    

    Having a None in the sequence changes the default type to object:

    >>> a = np.asarray(['a', None, 'c'])
    >>> a
    array(['a', None, 'c'], dtype=object)
    

    Now how is rpy2 handling this ?

    >>> from rpy2 import rinterface
    >>> rinterface.initr()
    >>> tuple(rinterface.StrSexpVector(a))
    ('a', 'None', 'c')
    

    Not perfect, as one might expect an NA instead of the string "None", but no float in sight.

    Or may be you meat that missing values your Python/pandas vector are encoded as a float (value NaN) ?

    >>> type(np.NaN)
    float
    >>> a = np.asarray(['a', np.NaN, 'c']) 
    >>> a
    array(['a', 'nan', 'c'], 
          dtype='<U3')
    >>> tuple(rinterface.StrSexpVector(a))
    ('a', 'nan', 'c')
    

    Still no float.

    Now trying the numpy converter:

    >>> from rpy2.robjects import numpy2ri
    >>> a = np.asarray(['a', np.NaN, 'c'])
    >>> numpy2ri.py2ro(a) 
    R object with classes: ('array',) mapped to:
    <StrVector - Python:0x7fdde34a6bc8 / R:0x5556b9c7e240>
    ['a', 'nan', 'c']
    >>> a = np.asarray(['a', None, 'c'])
    (... calling stack printed ...)
    NotImplementedError: Conversion 'py2ri' not defined for objects of type '<class 'NoneType'>'
    

    The error is currently intended as the dtype of the numpy array is object and I was unsure about what the conversion of a Python None to R should be (as R's NA values do have a type). Now that there is more empirical experience with the Python-R bridge , it would probably make sense to convert to the type of the symbol NA (that is boolean). Still no float though.

    From this is I am only seeing to possible path to the behavior experienced:

    • The None is raising an exception while something at the C level (could be with rpy2 code, or with the Cython part of pandas) that is not handled properly
    • Issue with string encoding (the above examples are with Python 3.5, and the environment setting LANG=en_US.UTF-8)
  4. Ben Ball

    Here is my minimal reproduction code:

    import pandas as pd
    import numpy as np
    from rpy2.robjects import pandas2ri
    pandas2ri.activate()
    len = 100000
    dft = pd.DataFrame(np.random.randint(0,len,size=(len,1)), columns=['num'])
    rsc = ['test']*(len - 1)
    rsc.append(np.nan)
    dft['rsc'] = rsc
    
    #This will cause the memory crash:
    rr = pandas2ri.py2ri(dft)
    

    I tried with len of 7500, and it still crashes. I notice though that it returns from the function call, but in another core on my machine is spiked and eats the memory. So, I suspect it is some internals code that is doing garbage cleanup, due to the delayed nature of it occuring after it returns the result

  5. Log in to comment