Name conversion rule when there is a "." (dot)

Issue #274 resolved
Laurent Gautier created an issue

The objects and parameters in R package appear to have increasingly both the "." and the "_" version of names.

For example, in R 3.2.0 the package stats has both format_perc and format.perc. The current conversion rule for high-level utilities such as importr is to convert "." (invalid character for Python symbols) into "" but this is obviously breaking when the "" version is also defined in the R package (fortunately rpy2 is checking rather than silently override).

What could be a good default conversion rule that would keep the automatic conversion work (without the need to specify the conversion manually for such cases) ?

Comments (25)

  1. Antony Lee

    I can't say I know much about R but I guess you could map . to _ and _ to __ (two underscores). Obviously this would break if someone uses two consecutive dots or variants thereof in a variable name (so you still need a check) but I doubt this happens in practice.

  2. Laurent Gautier reporter

    I thought about this but the trouble is that I find in rather non intuitive to have perfectly valid Python symbols change (e.g., foo_bar would become foo__bar).

    I will look into making the customization of the name converting function rather easy. This will help have this suggestion or other suggestions out in the wild and help settle on a default based on the collective experience.

  3. Laurent Gautier reporter

    With c87b5af58bca the conversion can be specified as a callback:

    def my_translation(rname):
        return rname.replace('_', '__').replace('.', '_')
    base = importr('base',
                   symbol_r2python = my_translation)
    
  4. Shantanu Joshi

    Searching for an error led me to this issue. I get the following error:

    My R version is 3.2. rpy2 is 2.5.6 Sorry, I'm only reporting the error without giving much thought to the discussion above....


    stats = importr('stats')
    

    File "/Applications/BSS/lib/python2.7/site-packages/rpy2/robjects/packages.py", line 412, in importr

    version = version)
    

    File "/Applications/BSS/lib/python2.7/site-packages/rpy2/robjects/packages.py", line 178, in init

    self.__fill_rpy2r__(on_conflict = on_conflict)
    

    File "/Applications/BSS/lib/python2.7/site-packages/rpy2/robjects/packages.py", line 280, in fill_rpy2r

    super(SignatureTranslatedPackage, self).__fill_rpy2r__(on_conflict = on_conflict)
    

    File "/Applications/BSS/lib/python2.7/site-packages/rpy2/robjects/packages.py", line 214, in fill_rpy2r

    raise LibraryError(msg)
    

    rpy2.robjects.packages.LibraryError: Conflict when converting R symbol in the package "stats" to a Python symbol (format.perc -> format_perc while there is already format_perc)

  5. Antony Lee

    Actually, because you only need to avoid conflict between names in a single package and you know all the names that you're going to translate from the beginning, you could map both . and _ to _, except when this would lead to a conflict, in which case you map . to _ and _ to __ (or just . to __), possibly printing a warning in that case. Then symbol2rpython should be a function taking a list of R names as a single argument and returning a mapping of these names to unique Python identifiers.

    Something like:

    def symbol2rpython(names):
        simple = [name.replace(".", "_") for name in names]
        if len(set(simple)) < len(names):
            warnings.warn("blablablah")
            return {name: name.replace("_", "__").replace(".", "_") for name in names}
        else:
            return dict(zip(names, simple))
    
  6. Laurent Gautier reporter

    The conditional translation rules is something I considered but thought it would be quite confusing when trying to port R code to rpy2 (whether it is _ or __ depending on the other symbols in the package). I am fine with allowing the callback to do it though.

    Once the callback is operating at the level of all symbols, the way the parameter to importr() robject_translations is handled has to be considered. That parameter could also disappear but I currently like the offer the option to fix a small number of translations while writing as little code as a {'foo': 'bar'}.

  7. Laurent Gautier reporter

    @shjoshi The package stat in R-3.2.0 is creating the issue. If you are not using the symbol incriminated, just use the option on_conflict="warn" when calling importr as it will not make any difference to you. This is what I did for the unit tests as an emergency response to the change in R.

  8. Serrano Pereira

    Strangely the workaround with on_conflict="warn" works in the interactive Python interpreter, but not when I use it inside my Django application. It still raises a LibraryError exception despite having set on_conflict="warn".

    This temporary workaround works for me:

    stats = importr('stats', robject_translations={'format_perc': '_format_perc'})
    
  9. Laurent Gautier reporter

    @figure002 - this is rather strange. Does your Django application point to the same rpy2 and R as your ipython does ?

  10. Serrano Pereira

    @lgauthier Yes, I only have one version of R and rpy2 installed. And I run them both in the same virtualenv. But I just found out that the exception is only raised with an Apache + mod_wsgi setup. The exception does not occur if I run it using Django's development server. Should I create a separate issue for this?

  11. Laurent Gautier reporter

    @figure002 While I suspect that it might have more to do with mod_wsgi and Apache configuration, it might also be a bug. Thanks for filing it so it is not lost; we can always mark it as invalid later.

  12. Shantanu Joshi
    stats = importr('stats', robject_translations={'format_perc': '_format_perc'})
    

    works for me too. I'm going to use this fix until something changes.

    Thanks @figure002, @lgautier

  13. Laurent Gautier reporter

    The revision 349a02b95895 is proposing to solve the customization of translations with the following:

    def default_translation(rname):
        return rname.replace('.', '_')
    
    def my_check(symbol_mapping):
        # dict to store the Python symbol -> R symbols mapping causing problems.
        conflicts = dict()
        for py_symbol, r_symbols in symbol_mapping.items():
            if len(r_symbols) > 1:
                # more than 1 R symbol associated with this Python symbol
                # First we delete it then an alternative translation is tried.
                del(symbol_mapping[py_symbol])
                for s in r_symbol:
                    new_py_symbol = s.replace("_", "__").replace(".", "_")
                    symbol_mapping[new_py_symbol].append(s)
        return conflicts
    
    base = importr('stats',
                   symbol_r2python = default_translation,
                   symbol_check_after = my_check)
    

    This is no as short as @anntzer 's wish, but would have the benefit of keeping sanity check happening behind the hood and keep the specification of initial alternative translation rules simpler.

    Opinions ?

  14. Serrano Pereira

    Looks good to me. The downside is that you'd always have to specify the conversion manually for such cases, but in terms of R code portability I can't think of a better solution.

  15. Laurent Gautier reporter

    @figure002 - The manual specification is a major annoyance should go away at leas for packages shipping with R (aka "recommended packages"). The customization scheme now in rpy2-2.6.0-dev will let all experiment easily and propose translation logic to make this happen.

    The alternative to this would be to have package-specific translation schemes for these common R packages shipping with rpy2 and have within importr a switch that picks the right scheme from the package name. At the moment I am thinking this might cause more complication than a relatively simple translation logic that would work for the common R packages.

  16. Laurent Gautier reporter

    What would you think of the following default translation ?

    def default_translation(rname):
        pyname = rname.replace('.', '_')
        if pyname != rname:
            # a translation occurred - indicate this with a suffix
            pyname += "_rpy2tr"
        return pyname
    

    It would solve the case where R has both the . and _ variants of the same name, and the use of a suffix (rather than a prefix) would make rather natural to use with either an interactive console (such as ipython) or an IDE (since name completion is using the prefix).

  17. Serrano Pereira

    I really like how one can call stats.t_test for example and I would prefer it to stay like that. Having to call stats.t_test_rpy2tr feels ugly. I don't think package-specific translation schemes are a good solution either (it only makes things more complicated).

    I was wondering though. How often is foo.bar != foo_bar for R objects and parameters in cases where both are available? Or is there an easy way of checking that two objects are equal? If they are synonyms, than it would be safe to silently ignore duplicate names (or otherwise rename them).

  18. Laurent Gautier reporter

    Whenever in a situation where both foo.bar and foo_bar exist they appear to be different (if I am wrong about this, someone should correct me).

    What about the following ? The suffix is only added to the translation of foo.bar and only when there is also foo_bar in the same namespace/R package.

    def default_translation(rname):
        return rname.replace('.', '_')
    
    def default_check(symbol_mapping):
        # dict to store the Python symbol -> R symbols mapping causing problems.
        conflicts = dict()
        for py_symbol, r_symbols in symbol_mapping.items():
            if len(r_symbols) > 1:
                # more than 1 R symbol associated with this Python symbol
                # First we delete it then an alternative translation is tried.
                del(symbol_mapping[py_symbol])
                try:
                    idx = r_symbols.index(py_symbol)
                    # there is an R symbol identical to the proposed Python symbol;
                    # we keep that pair mapped, and change the Python symbol for the
                    # other R symbol(s)
                    for i, s in r_symbols:
                        if i == idx:
                            symbol_mapping[py_symbol] = [s,]
                        else:
                            new_py_symbol = py_symbol + '_rpy2tr'
                            symbol_mapping[new_py_symbol].append(s)
                except ValueError:
                    # I am unsure about what to do at this point.
                    conflicts[py_symbol] = r_symbols 
        return conflicts
    
  19. Serrano Pereira

    That seems like a good solution. Existing code would still work and this way one could always override the default translation and/or check.

    I was also thinking about following PEP 0008 for the suffix. The convention is to use trailing underscores to avoid name conflicts. The suffix _rpy2tr makes more sense if you use it for all translations, but now we merely use it to avoid conflicts.

  20. Laurent Gautier reporter

    It would make sense to follow that PEP then.

    I can try having this in over the weekend, but anyone with a pull request before that would also be fine.

  21. Laurent Gautier reporter

    Implemented in revision d77fb78c3cf8. It will part of rpy2-2.6.0.

    In summary, the changes are:

    • Customization of the translation through 2 functions (symbol_r2python(rname), symbol_check_after(symbol_mapping))

    • default symbol_r2python(rname) translating . into _

    • Use of PEP0008 to resolve simple situations where both foo.bar and foo_bar exist in R (become foo_bar_ and foo_bar in Python).

    Thanks to all.

  22. Log in to comment