Wiki

Clone wiki

rnumpy / API

rnumpy API

rpy2.robjects is in some ways a bit unwieldy to use. rnumpy is an attempt to provide a more comfortable and ergonomic environment for programmers. I like it, anyway... so I'm dumping it up here for feedback and we can see where things go from here.

Concept

The name is a bit misleading... rnumpy does depend on Numpy, because that's the de facto standard API for vector-like objects in Python, and therefore a natural way to interact with R vectors. But other enhancements include:

  • More natural/smarter passing of Python objects to R (Python dicts map to R lists, Python lists have their contents autodetected and are mapped to boolean/integral/numeric/character/list types as appropriate, numpy arrays are mapped correctly by default, etc.)
  • Convenient access to NA objects
  • Convenience API for the common differences between R and Python identifiers
  • Easy assignment to variables in the R global workspace
  • R functions mapped to Python have their help pages mapped to docstrings
  • Support for multidimensional and slice-based indexing of R objects (in R style, e.g. mydataframe[:, "Foo"] works.)
  • Support for R plotting from the interactive Python prompt.
  • Python tracebacks for R errors automatically include R tracebacks.
  • "Method-style" access to object attributes
  • repr on R wrapped objects matches the output of the R print() function, for interactive convenience.
  • Controlled API for modifying R arrays directly and managing copies of large data sets.
  • IPython integration
  • ...probably some other stuff I'm forgetting.

API

If you're used to rpy2.robjects, then the basic rnumpy API is very familiar:

from rnumpy import *  # gives 'r', 'rcopy', 'rarray', 'rzeros', 'rones'

The r object is about what you'd expect -- it supports evaluation of arbitrary code, passed as a string:

r("c(1, 2, 3)")

Accessing R functions and variables through indexing syntax:

r["c"](1, 2, 3)

And (unlike robjects) assigning to those variables:

r["x"] = 2
r("x * 2")

And as a convenience, you can also access variables and functions through an attribute syntax:

r.c(1, 2, 3)

As a further convenience, minor mangling of names is done (only when using the attribute syntax). In particular, trailing underscores are stripped:

r.class_([1, 2, 3])  # --> "integer"

Other underscores are mapped to periods:

r.as_numeric([1, 2, 3])

And "dollar" is short for "$":

r.dollar(my_data_frame, "x")

There are two basic kinds of Python wrappers for R objects. Your average R object is returned to Python as a generic "RWrapper" object. An RWrapper object doesn't have a lot of API. The main features are:

  • repr(obj) gives the output of R's print() function on the object, handy when working at the interactive Python prompt.
  • obj.NA is an object representing an NA of the same type as obj.
  • len() and simple unidimensional indexing work.
  • obj.r gives a magic attribute with some extra features:
    • obj.r[1, "Foo"] does R-style indexing (and accepts slices of the form ":", which means "everything in this axis" and is similar to leaving the argument blank in R, and "1:3", which means [1, 2, 3], similar to R but different from Python.)
    • obj.r(1, "Foo", drop=False) also does R-style indexing, but can take keyword arguments (like R's [], but unlike Python's []).
    • obj.r.fn(arg1, arg2) acts like a method call -- it's the same as r.fn(obj, arg1, arg2). So for example, you can do mymatrix.r.nrows(), or myframe.r.dollar("x").

(There are also RClosures, which are the same as RWrappers but can also be called. The calling syntax allows keyword arguments, and does the same underscore-munging on keyword argument names as the magic r.attribute code, e.g. Python 'r.fn(a_b=True)' is the same as R 'fn(a.b=TRUE)'.)

A few specific sorts of R objects map to a different type entirely -- the RArray. Specifically, vectors of type: logical, integer, numeric, and complex. These are mapped to Python as a special sort of numpy array. It can be used exactly as a numpy array (with broadcasting arithmetic functions, multidimensional indexing, all that sort of thing), with the following additions:

  • It has a additional .NA attribute (see above)
  • It has an additional .r attribute (see above)
  • repr(myarray) gives a numpy-style representation; if you want to see the R-style pretty-printed form of an RArray at the prompt, type 'myarray.r' and hit enter.
  • They are read-only by default (see next section).

There's also the convenient function 'rcopy', which takes a Python object, copies it into R space using the standard conversion rules, and then gives you a wrapped R object back again. 'rcopy(x)' is basically identical to calling 'r.identity(x)', but less silly.

Modifying RArray's in-place

Sometimes it can be very useful to avoid unnecessary copies of data structures. Like when those data structures are giant arrays.

R's views on data copying, however, are peculiar, and quite different from Python's. In Python, it's assumed that if you want to make a copy of an object you will request that explicitly, and all objects can be modified in place. In R, it's assumed that whenever you pass an object to another function, you want to give that other function a copy of the object. However, because actually copying objects all the time is inefficient, they have a workaround -- instead of copying the object every time it gets passed to another function, they basically copy the object every time you modify it.

Well, sometimes they can avoid this. But mostly, in R, every time you type <- you end up making a new copy of the objects involved.

And then when you have a Python<->R bridge, this causes a problem, because R *expects* that this is how assignment works, and you can break things if you just go modifying objects willy-nilly without copying them first.

Therefore, the current setup in rnumpy is this:

  • RWrapper objects are read-only.
  • RArray objects default to read-only (this is enforced through the numpy WRITEABLE flag).
    • But, if you want to mutate an RArray object, you can call '.unseal()' on it. This might have to make a copy, but after it is unsealed you can modify it all you want. The usual idiom here is
myarray = r.dosomething()
myarray = myarray.unseal()

If .unseal() has to make a copy, then the old array remains sealed, and it returns the new unsealed copy. If it doesn't have to make a copy, then it still returns the array unsealed. So you just capture its return value directly. However, you cannot pass an unsealed array to R. (If you could, then we would be back in the situation where R had seen your array and also your array was unsealed, and unspeakable things might happen.) You have to seal it again first. This is done by simply calling .seal(), and never makes a copy.

The whole setup is a bit obnoxious, and I have a (really baroque and even more obnoxious, but only for me as the implementor) plan to make assignment Just Work in most cases. But this setup is safe, and perhaps will make you pay more attention to when you are making copies and when you are not...

rnumpy also provides as a convenience the functions rones, rzeros, and rarray. They work like their non-r counterparts in numpy (i.e., rones makes an array of all-1 of the specified type, rzeros makes an array of all-0 of the specified type, and rarray converts a python list structure into an RArray), and return unsealed arrays by default. This is useful for allocating an array directly in R space, and then writing whatever you want into it directly without first constructing an object in Python and then copying it over.

Bugs

It's not a bug, it's a...well, they are... Limitations!

Well, okay, they're bugs. Check them out, write them down, send them to me, you know the drill.

Updated