pandas2ri.py2ri use way too much RAM

Issue #421 resolved
ELToulemonde
created an issue

I'm trying to setup rpy2 on my computer.

My config:

  • Windows 10, 16Go RAM
  • Anaconda: 4.4.0 with python 3.6
  • An independant R version: 3.4.1

Python packages:

  • rpy2: 2.8.6
  • pandas: 0.20.3

First thing i do is to load my data set (adult from UCI repo):

import pandas as pd
original_data = pd.read_csv(
    "http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
    names=[
        "Age", "Workclass", "fnlwgt", "Education", "Education-Num", "Martial Status",
        "Occupation", "Relationship", "Race", "Sex", "Capital Gain", "Capital Loss",
        "Hours per week", "Country", "Target"],
        sep=r'\s*,\s*',
        engine='python',
        na_values="?")
original_data.head()

import sys
sys.getsizeof(original_data)
>> 34605590

Now that i have imported my data set that is not that big... Now i start R interface as shown in rpy2 documentation

from rpy2.robjects import r, pandas2ri
pandas2ri.activate()

Control R memory limit (result in Mo)

import rpy2.robjects as robjects
r('memory.limit()')
>> array([ 2047.])

2Go i should be safe... I try to pass my data set to R.

r_dataframe = pandas2ri.py2ri(original_data)

---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
<ipython-input-17-5921ca574db2> in <module>()
----> 1 r_dataframe = pandas2ri.py2ri(original_data)

C:\Users\touleem\AppData\Local\Continuum\Anaconda3\lib\functools.py in wrapper(*args, **kw)
    801 
    802     def wrapper(*args, **kw):
--> 803         return dispatch(args[0].__class__)(*args, **kw)
    804 
    805     registry[object] = func

C:\Users\touleem\AppData\Local\Continuum\Anaconda3\lib\site-packages\rpy2\robjects\pandas2ri.py in py2ri_pandasdataframe(obj)
     58             od[name] = StrVector(values)
     59 
---> 60     return DataFrame(od)
     61 
     62 @py2ri.register(PandasIndex)

C:\Users\touleem\AppData\Local\Continuum\Anaconda3\lib\site-packages\rpy2\robjects\vectors.py in __init__(self, obj)
    956                                  " of type VECSXP")
    957 
--> 958             df = baseenv_ri.get("data.frame").rcall(tuple(kv), globalenv_ri)
    959             super(DataFrame, self).__init__(df)
    960

RRuntimeError: Error: cannot allocate vector of size 254 Kb

Please note that before failling, my RAM usage exploded (from 5Go to 16Go :O).

Why does it take so much RAM?

PS: i found a way around using numpy2ri.py2ri and feeding after complementary informations to R to re-build a data.frame.

Thanks for your help

Comments (11)

  1. Laurent Gautier

    That seems odd. May be R is making unnecessary copies when running df = baseenv_ri.get("data.frame").rcall(tuple(kv), globalenv_ri). Did you try using R to read the CSV ?

    utils = importr('utils')
    dataf = utils.read_csv(...)
    
  2. Ben Ball

    Hello Laurent,

    I was able to do utils.read_csv ok.

    I actually tried to find this problem, and reduced it to a specific scenario in my case:

    I have a two column pandas dataframe. first column is a dtype int64, second column is a dtype object that contains str. That str column has an entry that is missing, which results in it being a float of type NaN (while all the other entries are python str filled in properly).

    Feeding this dataframe into pandas2ri.py2ri results in the memory leak and crash as described in this issue. This only happens with some minimum number of rows, not sure the number, but if it is a small enough number of rows it is ok.

    I fixed my problem by converting NaN in my dtype object column with dataframe.fillna to be blank string

  3. Laurent Gautier

    Thanks. If you have a self-sufficient example to reproduce the issue, this is always helpful.

    In the meantime, here are initial notes.

    In my experience, arrays of strings in numpy (pandas is relying on numpy for arrays) can be of dtype "<U[0-9]+", or object, or "S" (https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.interface.html ): :

    >>> import numpy as np
    >>> a_u1= np.asarray(['a', 'b', 'c'])
    >>> a_u1 
    array(['a', 'b', 'c'], 
          dtype='<U1')
    >>> a_obj = np.asarray(['a', 'b', 'c'], dtype = object)
    >>> a_obj
    array(['a', 'b', 'c'], dtype=object)
    

    Having a None in the sequence changes the default type to object:

    >>> a = np.asarray(['a', None, 'c'])
    >>> a
    array(['a', None, 'c'], dtype=object)
    

    Now how is rpy2 handling this ?

    >>> from rpy2 import rinterface
    >>> rinterface.initr()
    >>> tuple(rinterface.StrSexpVector(a))
    ('a', 'None', 'c')
    

    Not perfect, as one might expect an NA instead of the string "None", but no float in sight.

    Or may be you meat that missing values your Python/pandas vector are encoded as a float (value NaN) ?

    >>> type(np.NaN)
    float
    >>> a = np.asarray(['a', np.NaN, 'c']) 
    >>> a
    array(['a', 'nan', 'c'], 
          dtype='<U3')
    >>> tuple(rinterface.StrSexpVector(a))
    ('a', 'nan', 'c')
    

    Still no float.

    Now trying the numpy converter:

    >>> from rpy2.robjects import numpy2ri
    >>> a = np.asarray(['a', np.NaN, 'c'])
    >>> numpy2ri.py2ro(a) 
    R object with classes: ('array',) mapped to:
    <StrVector - Python:0x7fdde34a6bc8 / R:0x5556b9c7e240>
    ['a', 'nan', 'c']
    >>> a = np.asarray(['a', None, 'c'])
    (... calling stack printed ...)
    NotImplementedError: Conversion 'py2ri' not defined for objects of type '<class 'NoneType'>'
    

    The error is currently intended as the dtype of the numpy array is object and I was unsure about what the conversion of a Python None to R should be (as R's NA values do have a type). Now that there is more empirical experience with the Python-R bridge , it would probably make sense to convert to the type of the symbol NA (that is boolean). Still no float though.

    From this is I am only seeing to possible path to the behavior experienced:

    • The None item is raising an exception while something at the C level is not handled properly (could be with rpy2 code, or with the Cython part of pandas) .
    • Issue with string encoding (the above examples are with Python 3.5, and the environment setting LANG=en_US.UTF-8).
  4. Ben Ball

    Here is my minimal reproduction code:

    import pandas as pd
    import numpy as np
    from rpy2.robjects import pandas2ri
    pandas2ri.activate()
    len = 100000
    dft = pd.DataFrame(np.random.randint(0,len,size=(len,1)), columns=['num'])
    rsc = ['test']*(len - 1)
    rsc.append(np.nan)
    dft['rsc'] = rsc
    
    #This will cause the memory crash:
    rr = pandas2ri.py2ri(dft)
    

    I tried with len of 7500, and it still crashes. I notice though that it returns from the function call, but in another core on my machine is spiked and eats the memory. So, I suspect it is some internals code that is doing garbage cleanup, due to the delayed nature of it occuring after it returns the result

  5. Laurent Gautier

    I only had a little bit of time to investigate this further and is seems that:

    • garbage collection is not involved
    • C-level on either rpy2, R, pandas, or Python sides is probably not involved.

    This seems to be caused by hard-to-reconcile differences between and R vectors and Python arrays.

    from rpy2.robjects.vectors import StrVector
    
    # Explicitly cast `rsc` to an R vector of strings
    dft['rsc'] = StrVector(rsc)
    #This is working
    rr = pandas2ri.py2ri(dft)
    

    Without this rsc is a Python list, which is something converted to an R list, and R's constructor data.frame() is considering a list to mean columns, as shown below:

    > l <- list("test", "test", NaN)
    > data.frame(l)
      X.test. X.test..1 NaN.
    1    test      test  NaN
    

    Now, when combining this with your column num this will result in an attempt to create a 100,000 x 100,000 data table (see example below), which might exceed the RAM of your machine (if my back-of-envelope calculation is correct, on a 64 bit architecture your example would require at least 120 GB of RAM).

    > data.frame(a = 1:3, b = l)
      a b..test. b..test..1 b.NaN
    1 1     test       test   NaN
    2 2     test       test   NaN
    3 3     test       test   NaN
    

    I am unsure about how this would be best handled on the rpy2 side as mapping Python list objects to R list objects seems the most natural and the R constructor data.frame() does accept lists (although the combination leads to what might be an unintuitive behavior).

  6. Laurent Gautier

    Revision 194937839721 (branch default, which means future release 3.0.0 as the time of writing), is proposing a fix: pandas Series of dtype "O" are converted to R arrays of strings while issuing a warning about it.

    How do people feel about it, especially the part about warnings ? On one hand I like the warnings (as a silent conversion might create other issues for users wondering while their arrays of Python objects are turned into strings), but I am unsure about whether the warnings might turn to be obnoxious in practice.

  7. Laurent Gautier

    As no concern about the proposed fixed was expressed I have already backported it to rpy2-2.9.x (revision 50f81309de07). If no problem is discovered during the next week or so, this will be included in rpy2-2.9.2 (with or without warnings depending on the feedback).

  8. MichaƂ Krassowski

    Thank you Ben for reporting it (I was very mad as my computed kept crashing while converting a simple csv-created table to R and it was hard to debug) and thank you Laurent for the quick fix! I can confirm that it works in a simple case, though the problem persist if pandas2ri.activate() is invoked.

    This one works (warnings are emitted, data frame is created as expected):

    import pandas as pd
    from rpy2.robjects import pandas2ri
    from tempfile import NamedTemporaryFile
    
    
    table = """\
    name    position    comments
    A   1   NaN
    B   2   even
    """
    
    with NamedTemporaryFile(mode='w', delete=False) as f:
        f.write(table)
    df_pandas = pd.read_table(f.name)
    print(df_pandas)
    
    df_R = pandas2ri.py2ri(df_pandas)
    print(df_R)
    

    Output:

      name  position comments
    0    A         1      NaN
    1    B         2     even
    Error while trying to convert the column "name". Fall back to string conversion. The error is: Conversion 'py2ri' not defined for objects of type '<class 'pandas.core.series.Series'>'
      (name, str(e)))
    UserWarning: Error while trying to convert the column "position". Fall back to string conversion. The error is: Conversion 'py2ri' not defined for objects of type '<class 'pandas.core.series.Series'>'
      (name, str(e)))
    UserWarning: Error while trying to convert the column "comments". Fall back to string conversion. The error is: Conversion 'py2ri' not defined for objects of type '<class 'pandas.core.series.Series'>'
      (name, str(e)))
      name position comments
    1    A        1      nan
    2    B        2     even
    

    But this one does not:

    import pandas as pd
    from rpy2.robjects import pandas2ri
    from tempfile import NamedTemporaryFile
    
    pandas2ri.activate()    # so it seems this causes problems here
    
    table = """\
    name    position    comments
    A   1   NaN
    B   2   even
    """
    
    with NamedTemporaryFile(mode='w', delete=False) as f:
        f.write(table)
    df_pandas = pd.read_table(f.name)
    print(df_pandas)
    
    df_R = pandas2ri.py2ri(df_pandas)
    print(df_R)
    

    Output:

      name  position comments
    0    A         1      NaN
    1    B         2     even
      name position comments.0 comments.1
    0    A        1        NaN       even
    1    B        2        NaN       even
    

    PS. I do not like the warnings, these are very mhm... verbose. Could we have a specific subclass (e.g. RPy2Warning or something) so one can easily suppress these warnings? Or use logging.warning instead of warnings.warn?

  9. Laurent Gautier

    I think that the second call might have to be:

    from rpy2.robjects import conversion
    
    df_R = conversion.converter.py2ri(df_pandas)
    

    I would need a better case against warnings to consider removing them. Python/numpy/pandas columns/arrays where items can be arbitrary "objects, which is not feasible in R data frames.

    The current practical approach is to cast such Python arrays to string arrays in R, as arrays of strings are often arrays of "objects" as far as pandas/numpy are concerned. However, doing so silently can cause rather hard-to-find issues. The trade-off here is to be provide immediate convenience at the cost of verbosity. The warnings can be silenced either by writing one's own additional conversion rules (see example in the doc) or by writing one's own "pre-conversion" function. The suggestion to have typed warnings to facilitate filtering is quite reasonable though (now tracked as issue #445).

    Otherwise the conversion systems has a number of design flaws and should be rewritten (...some day - this has been mentioned for some time now, but the time / resources to do it never materialized). In the meantime, using localconverter() (see it demonstrated in the documentation here can provide quite a bit of flexibility for customization while retaining control (custom conversion limited to a block).

    PS: Your code does not appear to work the way advertised here: the object df_pandas has shape (2, 1) here.

  10. Log in to comment