Eliminate wrapper c code with CFFI?

Issue #298 resolved
lampkld
created an issue

Hello,

This is a package that IMO is critical to the pydata ecosystem so thank you for maintaining it!

I noticed that the need for the wrapper c code causes some installation errors as well as likely being more difficult to maintain.

Rcall.jl was able to obviate the need for any wrapper code using ccall facilities. I think Cffi has similiar functionality of being able to easily call R's C api in pure python. If this is true, would it be possible to rewrite those parts of Rpy2 in C to python with the CFFI?

If it is technically feasible but we are sparse on dev time, its possible we can make a big community push with various stakeholders. I really think this would be a boon for end users, expand the pool of possible maintainers (the set of people that know python and R > than those that know R, Python and C), and make it easier on current devs.

What do you think?

Thanks.

CC @Dav Clark @Laurent Gautier

Comments (6)

  1. Laurent Gautier

    The thing has been on and off, with a dotted trail scattered in email correspondance and may be in posts on the mailing list or comments on issues.

    Before the going into the not-so-short answer that this deserves, I'd like to point out that there might be four issues reported here. Three being nested:

    1. Is it possible to eliminate wrapper C code, with cffi proposed as the way to do it ?
    2. The wrapper C code is causing installation errors
    3. The wrapper C code is difficult to maintain
    4. How can the pool of contributors be expanded

    2/ The wrapper C code is causing installation errors

    I'll start with this because it is the easiest to "close". At the exception of Windows (discussed further in the comment to 4/), the errors seem to me to be a relatively small proportion of cases (given the number of downloads for the source package recorded on Pypi, and the number of distros packaging binaries or compilation recipes) and almost always caused by either less usual installations of R (that the elimination of wrapper C code would not completely solve) or an incomplete environment for compiling (e.g., no gcc, not headers for Python, etc...). Any specific issue should be reported. Otherwise, I'd need to see contradicting data to believe that this is a significant problem.

    3/ The wrapper C code is difficult to maintain

    This is not an unreasonable statement to make. There are about 9,000 lines of C code (according to https://www.openhub.net/p/rpy2/analyses/latest/languages_summary), split across several files .c files glued together at build with #include statements. Not the best layout. If anyone has ideas about how to organize better C extensions for Python, I am all ears. Beside this, that source has evolved to accommodate changes in both R and Python (the jump to Python 3 being rather significant ) and while we are thriving to clean the code, there are almost certainly small parts for older R or Python versions rotting. On the other hand, the code was written to be testable and there are quite a few unit tests. This has allowed me to refactor code relatively painlessly, or find almost instantly undocumented changes in R's C API.

    4/ How can the pool of contributors be expanded

    This is a question that has surfaced several times. The most obvious and earliest manifestation of it has been around Windows support. After I announced that I would drop support for Windows (rpy2-2.1.x) until the community of users provides resources to continue it. While there were few patches and offers to contribute unofficial builds, and Windows users owe much to these individuals (as well as does the rpy2 project - we can still at least say that there is unofficial support for Windows thanks to them !), we never reached a sustained diversified stream of contributions. Windows or not (https://www.openhub.net/p/rpy2/contributors/summary). While there were suggestions that the license or the host for the code repository where limiting factors, I mostly saw a manifestation of groupthink (BSD and GitHub were demanded respectively) from people who were not contributors and did not accede to the demands. Would it be better for rpy2 to do it ? Is it better for the Open Source to favor a diversity of licenses and hosting solutions ? I won't be sure without a test vs control experiment I suppose.

    Moving to foreign function interface might allow more people to tinker (hey, no need to compile anything !) but
    it will likely not remove the need to know R's C interface as well as Python (it might remove the requirement to know Python's C interface but I am not certain of this). More on this in the answer to 1/.

    1/ Is it possible to eliminate wrapper C code, with cffi proposed as the way to do it ?

    Simplifying the C extension code has been on the to-do list for quite a while now (several years - a good proportion of the existence of the project). Since you are citing RCall, mentioning Rif (https://github.com/lgautier/Rif.jl) predated RCall and the reason it exists should be part of this answer. Rif was created 2 years before RCall (and at a time where getting anything out of Julia meant running nightly build). It was written with the primary intent of rewriting the C wrapper in rpy2 as a minimal C library to interface with R that would facilitate the building of language-specific bindings (through C API or a foreign function interface). Working this out with Julia was both a way to learn it, and help identify general (non Python-specific) needs. Regrettably the RCall author has unfortunately never answered about why creating a competing project rather than join efforts (*), and I found myself exploring other aspects related to rpy2 such as translating the the expressivity R packages such as dplyr and magrittr are offering to R users (I am pretty excited about what will be in rpy2-2.7.0 : https://bitbucket.org/rpy2/rpy2/src/2f2f9242e5bd1df971091ff11993f94e0427b975/doc/notebooks/dplyr.md?at=default)

    (: after pushing I get list-addressed arguments about BSD-above-all. Lacking the interest to engage in a ping contest with someone that probably has a lot free time, and having practical alternatives to Julia, I asked then-contributors to Rif to look at RCall).

    Getting initially toward a smaller C-written code base and the use of a foreign function interface is still on the table, and I would agree that cffi is a prime candidte for the job (although ctypes or Cython can be seen as options).
    The next step would to see whether eliminating completely all wrapping C is possible (I am thinking that whenever R is (finally) moving from its NAMED system to reference counting will make things easier/nicer).

    However, the following factors have contributed, and are contributing, to let other issues have a higher priority:

    • Waiting for R to move to reference counting
    • Performance. To my knowledge the cost of calls through ctypes or cffi is higher than the cost of calling Python C extensions. May be this is not an important factor.
    • Pypy not yet supporting Python 3.4 (rpy2's target Python is Python 3.4, with backcompatibility with Python 2.7 mostly completely handled by six). rpy2 does not have the resources to multiply Python version-dependent features.
    • crashing Python/rpy2. It is currently believe to be difficult to crash the embedded R through rpy2 (and terminate the host Python process with a segfault). Moving to a foreign function interface will create many opportunity to do so. Particularly when tinkering (see answer to 4/).

    This turns out to be a rather long answer. The issue contained important questions.

  2. Laurent Gautier

    Finally some progress on this, and hopefully significantly significant progress:

    • The plan is to have rpy2-3.0.0 use cffi, in ABI mode if no compilation available and in API mode if available. The benefit of the API mode being speed and less possible segfault when the R API / library between the R binary used and the one targeted by rpy2 differ.

    • I am sharing progress in a branch until ready to merge to to default: https://bitbucket.org/rpy2/rpy2/branch/cffi

    • At the time of writing the rinterface layer is almost complete. Running pytest --cov=rpy2.rinterface --cov=rpy2.robjects tests/rinterface/ shows 86% coverage and tests results: 22 failed, 179 passed, 11 skipped

  3. Laurent Gautier

    Quick update:

    • the port of the low-level interface (rpy2.rinterface) to cffi is mostly done for the ABI mode (86% coverage for the test >90% of the tests passing). Ironing out the last wrinkles and handful of failing tests results in an apparent slower progress, but behind the scene rinterface is being rewritten and the API tweaked to make it better.

    • the port of the high-level interface (rpy2.robjects) is in progress. Coverage is not great (~60%) but that what it is already with the current latest release for rpy2 (coverage calculation was not calculated before). well over half of the unit tests are passing, but there remain segfaults with some of the failing ones. I suspect that they share the same reason for segfaults but I have not identified it yet.

  4. Laurent Gautier

    Quick update: iterative improvements of the new rpy2.rinterface while working out adaptation of rpy2.robjects. The unit test coverage for the rinterface level is shown below (97% of the unit tests are passing):

    Selection_020.png

  5. Log in to comment