they can use the existing ``threading`` module, with its associated GIL
and the complexities of real multi-threaded programming (locks,
deadlocks, races, etc.), which make this solution less attractive. The
-big alternative is for them to rely on one of various multi-process
-solutions that are outside the scope of the core language. All of them require a
-big restructuring of the program and often need extreme care and extra
+most attractive alternative for most developers is to rely on one of various multi-process
+solutions that are outside the scope of the core Python language. All of them require a
+major restructuring of the program and often need extreme care and extra
The aim of this series of proposals is to research an d implement
+ an implement
Transactional Memory in PyPy. This is a technique that recently came to
the forefront of the multi-core scene. It promises to offer multi-core CPU
-usage without requiring to fall back to the multi-process solutions
-described above, and also should allow to change the core of the event systems
-mentioned above to enable the use of multiple cores without the explicit use of
-the ``threading`` module by the user.
+usage in a single process.
+In particular, by modifying the core of the event systems
+mentioned above, we will enable the use of multiple cores, without the
+user needing to use explicitly the ``threading`` module.
The first proposal was launched near the start of 2012 and has covered
-the fundamental research
part, up to the point of getting a first
+the fundamental research, up to the point of getting a first
version of PyPy working in a very roughly reasonable state (after
collecting about USD$27'000, which is little more than half of the money
asked; hence the present second call for donations).
+that was ; hence the present second call for donations).
-This second proposal aims at fixing the remaining issues until we get a
-really good GIL-free PyPy (described in `goal 1`_ below); and then we
-will focus on the various new features needed to actually use multiple
+We now propose fixing the remaining issues to obtaining a
+really good GIL-free PyPy (described in `goal 1`_ below). We
+will then focus on the various new features needed to actually use multiple
cores without explicitly using multithreading (`goal 2`_ below), up to
-and including adapting some existing framework libraries
+and including adapting some existing framework libraries for
example Twisted, Tornado, Stackless, or gevent (`goal 3`_ below).
This is a call for financial help in implementing a version of PyPy able
to use multiple processors in a single process, called PyPy-TM; and
Armin Rigo and Remi Meier and possibly others.
We currently estimate the final performance goal to be a slow-down of
-25% to 40%, i.e. running a fully serial application would take between
-1.25 and 1.40x the time it takes in a regular PyPy. (This goal has
+25% to 40% from the current non-TM PyPy; i.e. running a fully serial application would take between
+1.25 and 1.40x the time it takes in a regular PyPy. This goal has
been reached already in some cases, but we need to make this result more
-broadly applicable.) We feel confident that it can work, in the
-following sense: the performance of PyPy-TM running any suitable
+broadly applicable. We feel confident that we can reach this goal more
+generally: the performance of PyPy-TM running any suitable
application should scale linearly or close-to-linearly with the number
of processors. This means that starting with two cores, such
-applications should perform better than
in a regular PyPy. (All numbers
+applications should perform better than PyPy. (All numbers
presented here are comparing different versions of PyPy which all have
+the JIT enabled. A "suitable application" is one without many conflicts;
You will find below a sketch of the `work plan`_. If more money than
requested is collected, then the excess will be entered into the general
Software Transactional Memory (STM) library currently used inside PyPy
with a much smaller Hardware Transactional Memory (HTM) library based on
hardware features and running on Haswell-generation processors. This
-has been attempted by Remi Meier recently. However, it seems that we
-see scaling problems (as we expected them): the current generation of HTM
+has been attempted by Remi Meier recently. However, it seems that it
+fails to scale as we would expect it to: the current generation of HTM
processors is limited to run small-scale transactions. Even the default
transaction size used in PyPy-STM is often too much for HTM; and
reducing this size increases overhead without completely solving the
generally. A CPU with support for the virtual memory described in this
paper would certainly be better for running PyPy-HTM.
-Another issue is sub-cache-line false conflicts (conflicts caused by two
+Another issue is sub-cache-line false conflicts (conflicts caused by two
independent objects that happens to live in the same cache line, which
is usually 64 bytes). This is in contrast with the current PyPy-STM,
which doesn't have false conflicts of this kind at all and might thus be
-ultimately better for very-long-running transactions. None of the
-papers we know of discusses this issue.
+ultimately better for very-long-running transactions. We are not aware of
+published research discussing issues of sub-cache-line false conflicts.
Note that right now PyPy-STM has false conflicts within the same object,
-e.g. within a list or a dictionary; but we can
more easily do something
+e.g. within a list or a dictionary; but we can easily do something
about it (see `goal 2_`). Also, it might be possible in PyPy-HTM to
arrange objects in memory ahead of time so that such conflicts are very
rare; but we will never get a rate of exactly 0%, which might be
.. _`Virtualizing Transactional Memory`: http://pages.cs.wisc.edu/~isca2005/papers/08A-02.PDF
it with PyPy instead of CPython?
+Why do with PyPy instead of CPython?
While there have been early experiments on Hardware Transactional Memory
with CPython (`Riley and Zilles (2006)`__, `Tabba (2010)`__), there has
-been no recent one. The closest is an attempt using `Haswell on the
+been none in the past few years. To the best of our knowledge,
+the closest is an attempt using `Haswell on the
Ruby interpreter`__. None of these attempts tries to do the same using
Software Transactional Memory. We would nowadays consider it possible
to adapt our stmgc-c7 library for CPython, but it would be a lot of
-work, starting from changing the reference-counting scheme. PyPy is
+work, starting from changing the reference-counting scheme. PyPy is
better designed to be open to this kind of research.
-But the best argument from an external point of view is probably that
-PyPy has got a JIT to start with. It is thus starting from a better
-position in terms of performance, particularly for the long-running kind
-of programs that we target here.
+However, the best argument from an objective point of view is probably
+that PyPy has already implemented a Just-in-Time compiler. It is thus
+starting from a better position in terms of performance, particularly
+for the long-running kind of programs that we target here.
.. __: http://sabi.net/nriley/pubs/dls6-riley.pdf
.. __: http://www.cs.auckland.ac.nz/~fuad/parpycan.pdf
PyPy-TM will be slower than judicious usage of existing alternatives,
based on multiple processes that communicate with each other in one way
or another. The counter-argument is that TM is not only a cleaner
-solution: there are cases in which it is not
doable to organize (or
+solution: there are cases in which it is not ble to organize (or
retrofit) an existing program into the particular format needed for the
alternatives. In particular, small quickly-written programs don't need
the additional baggage of cross-process communication; and large
rest of the program should work without changes.
Other platforms than the x86-64 Linux
+latforms than the x86-64 Linux
-The first thing to note is that the current solution depends on having a
-huge address space available. If it were to be ported to any 32-bit
-architecture, the limitation to 2GB or 4GB of address space would become
-very restrictive: the way it works right now would further divide this
+The current solution depends on having a
+huge address space available. Porting to any 32-bit
+architecture would quickly run into the limitation of a 2GB or 4GB of address space.
+The way TM works right now would further divide this
limit by N+1, where N is the number of segments. It might be possible
to create partially different memory views for multiple threads that
-each access the same range of addresses; this would require extensions
-that are very OS-specific. We didn't investigate so far.
+each access the same range of addresses; but this would likely require
+changes inside the OS. We didn't investigate so far.
-The current version
, which thus only works on 64-bit, still relies
+The current version relies
heavily on Linux- and clang-only features. We believe it is a suitable
restriction: a lot of multi- and many-core servers commonly available
are nowadays x86-64 machines running Linux. Nevertheless, non-Linux
solutions appear to be possible as well. OS/X (and likely the various
BSDs) seems to handle ``mmap()`` better than Linux does, and can remap
individual pages of an existing mapping to various pages without hitting
-a limit of 65536 like Linux. Windows might also have a way, although we
-didn't measure yet; but the first issue with Windows would be to support
-Win64, which the regular PyPy doesn't.
+a limit of 65536 like Linux. Windows might also have a solution, although we
+didn't measure yet; but first we would need a 64-bit Windows PyPy, which has
+not seen much active support.
-We will likely explore the OS/X way (as well as the Windows way if Win64
-support grows in PyPy), but this is not included in the scope of this
+We will likely explore the OS/X path (as well as the Windows path if Win64
+support grows in PyPy), but this is not part of this current
It might be possible to adapt the work done on x86-64 to the 64-bit
-ARMv8 as well
, but we didn't investigate so far.
+ARMv8 as wellt investigate so far.