rpy2 fails to decode UTF-8 characters

Issue #537 resolved
Alex Hübner created an issue

When trying to parse a character vector with strings containing non-ASCII characters, I get a different behaviour vector returned by rpy2 than when running it natively in R.

When running it natively in R, I get the following behaviour:

> c("ä", "ö", "ü")
[1] "ä" "ö" "ü"

When running the same using rpy2, I get the following characters returned:

In [3]: robjects.r("c('ä', 'ö', 'ü')")
Out[3]:
R object with classes: ('character',) mapped to:
['ä', 'ö', 'ü']

The locale ("LC_ALL") is set to 'en_US.UTF-8'.

I can reproduce this behaviour on both Linux and Mac OS X with Python v3.7.2, R v3.5.2 and rpy2 v3.0.1.

Is there a way to enforce UTF-8 encoding for R objects processed by rpy2?

Comments (5)

  1. Laurent Gautier

    I can reproduce.

    This seems to be local to r.__call__() as calling the R function directly works as intended:

    >>> robjects.baseenv['c']('ä', 'ö', 'ü')                                    
     R object with classes: ('character',) mapped to:
    ['ä', 'ö', 'ü']
    

  2. Laurent Gautier

    The string reaching R for evaluation seem to conserve the encoding:

    >>> robjects.r("Encoding(c('ä', 'ö', 'ü'))")
    R object with classes: ('character',) mapped to:
    ['UTF-8', 'UTF-8', 'UTF-8']
    

    This would mean that the issue is either upon conversion when mapping the result back to Python, or when displaying the object. The latter seems to be happening (see below). This means that your code is correctly evaluated, and the resulting object did not loose the encoding, but it is showing its content incorrectly.

    >>> res = robjects.r("c('ä', 'ö', 'ü')")
    >>> robjects.r.Encoding(res)
    R object with classes: ('character',) mapped to:
    ['UTF-8', 'UTF-8', 'UTF-8']
    

    What is happening is that the method __repr__() is iterating over elements in the array, and what is actually happening is Latin1 assumed when extracting the element (and this a bit of a problem). The snippet below will demonstrate this, as well as provide the basis for a workaround until the issue is fixed:

    >>> res[0]                                                                 
    'ä'
    >>> res[0].encode('Latin1').decode('utf-8')                                
    'ä'
    

  3. Laurent Gautier

    The problem appears to be here:
    https://bitbucket.org/rpy2/rpy2/src/2216d3f74e1ad7d4f5355fba7f281dc3d2fad891/rpy2/rinterface_lib/conversion.py#lines-94

    The wrong integer/enum is provided for UTF-8 conversion. This is a silly mistake, but an easy fix:

    >>> import rpy2.robjects as robjects
    >>> robjects.r("c('ä', 'ö', 'ü')")
    R object with classes: ('character',) mapped to:['ä', 'ö', 'ü']
    >>> import rpy2.rinterface_lib.conversion
    >>> rpy2.rinterface_lib.conversion._CE_UTF8 = 1
    >>> robjects.r("c('ä', 'ö', 'ü')")
    R object with classes: ('character',) mapped to:['ä', 'ö', 'ü']
    

  4. Log in to comment