pypy-dmalcolm / pypy / doc / theory.rst

Full commit

Techniques used in PyPy

Abstract Interpretation

Abstract Interpretation is a general technique which consists of an interpreter that follows the bytecode instructions of a user program, just like a normal interpreter does, but with abstract objects instead of concrete ones. Remember that in PyPy this is done by using alternate object spaces with the same bytecode interpreter main loop.

As a theoretical example, the most abstract object space would be the one manipulating the most abstract objects that you could imagine: they are all equivalent, because we have abstracted away any information about the object. There is actually only one of them left, and we could call it "the object". In Python terms, an AbstractObjectSpace could use None for all its wrapped objects. Any operation between wrapped objects gives None again as the wrapped result -- there is nothing else it could give anyway. So when you have said that the add method of AbstractObjectSpace takes None and None and returns None you have said everything.

The point of such an object space is for example to check the bytecode. The bytecode interpreter will really run your bytecode, just with completely abstract arguments. If there is no problem then you are sure that the bytecode is valid. You could also record, during this abstract interpretation, how much the stack ever grows; that would give you a fool-proof method of computing or checking the co_stacksize argument of a code object. (There are subtleties which I won't describe here, but that's the basic idea.)

Typically, however, abstract object spaces are a (little) bit less abstract, still maintaining a minimal amount of information about the objects. For example, a wrapped object could be represented by its type. You then define the object space's add to return int when the two arguments are int and int. That way, you abstractedly call a function with the input argument's types and what the interpreter will do is a type inference. (Here also there are subtle problems, even besides the remark that integer operations can overflow and actually return longs in a real Python implementation.)

As an example of more abstract object spaces you have the ones with finite domain, i.e. with a finite number of different possible wrapped objects. For example, you can use True and False as wrapped values to denote the fact that the object is, respectively, a non-negative integer or anything else. In this way you are doing another kind of type inference that just tells you which variables will only ever contain non-negative integers.

In PyPy, the FlowObjSpace uses the abstract interpretation technique to generate a control flow graph of the functions of RPython programs.

In its more formal definition, Abstract Interpretation typically considers abstract objects that are organized in a lattice: some of these objects are more (or less) abstract than others, in the sense that they represent less (or more) known information; to say that this forms a lattice essentially means that any two abstract objects have well-defined unions and intersections (which are again abstract objects).


A "multimethod" is the generalization of the OOP notion of "method". Theoretically, a method is a "message name" and signature attached to a particular base class, which is implemented in the class or its subclasses. To do a "method call" means to send a message to an object, using a message name and actual arguments. We call "message dispatch" the operation of finding which actual implementation is suitable for a particular call. For methods, a message is dispatched by looking up the class of the "self" object, and finding an implementation in that class, or in its base classes, in a certain order.

A multimethod is a message name and signature that can have implementations that depend not only on the class of the first "self" argument, but on the class of several arguments. Because of this we cannot use Python's nice model of storing method implementations as functions, in the attributes of the class.

Here is a common implementation of multimethods: they are instances of a specific MultiMethod class, and the instances are callable (there is a __call__ operator on MultiMethod). When a MultiMethod is called, a dispatch algorithm is used to find which, among the registered implementations, is the one that should be called; this implementation is then immediately called. The most important difference with normal methods is that the MultiMethod object to call is no longer syntactically attached to classes. In other words, whereas a method is called with obj.somemethod(args), a multimethod is called much like a function, e.g. dosomething(obj1, obj2, obj3...). You have to find the MultiMethod object dosomething in some namespace; it is no longer implicitly looked up in the namespace of the "self" object.

PyPy contains two different implementations of multimethods: a quite general one written in RPython for the purposes of the StdObjSpace, and a short two-arguments-dispatching one used internally by the annotator.