Title: Asynchronous Callbacks and Intrinsic Language Parallelism
Author: Trent Nelson <email@example.com>
Type: Standards Track
The goal of this PEP is to produce a Python interpreter that, through
the introduction of new language constructs and library interfaces,
will allow code to be written that automatically exploits all
available CPU cores.
It proposes to achieve this by introducing two inter-related concepts:
new parallel language constructs and a new asynchronous library.
More and more cores are readily becoming available in both consumer
hardware and enterprise servers. Python does not have a way to
naturally exploit multiple cores. This PEP tries to address this
issue by introducing language support for parallel constructs and
Relationship to Existing Efforts, PEPs and Design Discussions
Asynchronous IO became a hot topic on python-ideas during September,
October and November of 2012. An output of those discussions was a
new library called tulip, by Guido van Rossum, that focuses on a new
async IO library. There is also PEP-3153 'Asynchronous IO Support'
which tackles some of the issues regarding async IO.
With regards to parallelism, the GIL has always been a contentious
issue, and the mail archives are filled with numerous attempts to
remove it (or simply lament its performance impact).
The reason the proposal here is being submitted in its own PEP is
because it specifically addresses the two issues in concert: handling
asynchronous events *and* providing truly concurrent execution of
Python code within a single interpreter.
It's also worth noting that the proposed solution is low-level enough
such that Tulip, Twisted, Tornado, PEP-3153 and any other library can
easily leverage the primitives introduced by this PEP.
Primary - Event Driven
There are two use cases this PEP addresses. The primary use case is
writing software that is event-driven; for example, an SMTP or HTTP
server, which waits for client connections and then "services" them.
Note that the onus is on event-driven, not "server software"; a load-
testing client should be just as able to spin up tens of thousands of
connections to a target server, as that same server is able to handle
The goal of this PEP for the primary use case is to allow Python to
maximize the underlying system resources as optimally as it can; if
a high-performance, multi-threaded native C/C++ library can handle
65k clients and saturate multiple gigabit links (by utilising all
cores), then Python should be able to come as close to this as
possible (minus the interpreter overhead).
Secondary - Parallel Task/Data Decomposition
The secondary use case is software that is not primarily driven by
external events or IO (although it still may perform this sort of
work (reading files, connecting to a database, etc)), but may deal
with "embarassingly parallel" data structures, and thus, would also
benefit from parallel language constructs. Such constructs would be
things like native map/reduce support and parallel task execution.
The goal of this PEP for the secondary use case is to provide elegant
language primitives that allow code to be written in such a way that
all available cores will be exploited. This use case differs from the
first in that it will be primarily a single thread of execution that
periodically runs in parallel, only to collapse back down to a single
thread of execution again. Compare this to the event-driven example,
where server/client software revolves around an event loop that will
facilitate parallel execution automatically (based on the receipt of
events), without requiring the programmer to explicitly delineate
"parallel" sections of code.
It is important to establish the requirements for such a significant
language change up-front, before enumerating possible solutions. (It
may even be worth separating these requirements out into their own
informational PEP, which would be evaluated and approved independently
to any soutions.) Agreeing on a set of requirements for asynchronous
and parallel primitives allows multiple solutions to be evaluated
Proposed solutions, including the one in this PEP, should include a
traceability matrix that maps each requirement to satisfying design
and implementation details/features.
1. The solution must not introduce a performance penalty of more than
4% when running existing Python code that hasn't been converted to
take advantage of the new primitives. (4% has been picked as it
was the overhead introduced by the garbage collector implementation
circa Python 2.0.) This is often referred to in GIL-removal threads
as "performance must not be worse for single-threaded code".
2. The solution should scale linearly in proportion to the number of
cores available. Scale in this context refers to both performance
(throughput, connection concurrency and overall latency) as well as
resource usage (memory overhead).
3. In addition to scaling linearly, there should not be an excessive
performance penalty for code written against the new primitives if
there is only one core available.
4. The solution must not rely on major backwards-incompatible changes
to the public C API, especially with regards to the GIL. That is,
the solution should be compatible with existing extension code that
relies on GIL semantics (such as acquiring and releasing the GIL).
5. The solution may introduce minor changes to the C API that extension
modules will have to accommodate in order to compile against the
target version of Python, even if said module uses none of the new
primitives. The solution should attempt to keep such changes to
the absolute minimum (i.e. adding #defines, possibly calling new
CPython initialization methods when loading, etc).
6. The solution will not "automagically parallelize" existing code.
6.1 Specifically, the solution will not suddenly allow concurrent
execution of threading.Thread() instances. These objects
(and their underlying Python object counterparts) will still
be subject to the GIL.
6.2 Parallelization will only apply to new code written against
the new async/parallel primitives.
7. The new asynchronous and parallel primitives can be used alongside
legacy code without incurring any additional performance penalties.
(Legacy code in this context refers to IO multiplexing via select/
poll/epoll/kqueue and handling "blocking" code by deferring it to
8. Although the solution will present a platform-agnostic interface to
the new primitives, it shall be designed against the platform with
the best native support for underlying asynchronous and parallel
primitives. This allows such a platform (or platforms) to run at
their full potential. On platforms with lesser support for such
primitives, the solution should mimic the semantics of the interface
using traditional techniques for "simulating" asynchronicity (i.e.
select/poll, blocking threads, etc).
9. The solution should be low-level enough to facilitate easy adoption
by existing libraries such as Twisted and Tornado, yet high-level
enough to be pleasant to work with directly without the need for
10. The solution should not incur excessive cognitive overhead; the
average Python day-programmer should be able to easily grok the
core concepts and start writing high-performance code within a
single sitting. It should not expose leaky-abstractions that
require substantial background knowledge in parallel programming
or socket/file multiplexing.
11. The parallel aspect of the solution must offer tangible benefits
over and above those provided by the multiprocessing module, which
already facilitates concurrency through the use of multiple Python
12. The solution may introduce new restrictions or constraints against
code written to use the new primitives, subject to the following:
* The constraint is essential for meeting the requirements
outlined in this section (i.e. performance, maintaining
GIL semantics, etc).
* The constraint should be logical to the Python programmer
within the context of asynchronous/parallel execution.
For example, a constraint that says a nonlocal variable
cannot be affected from within a parallel execution context
is reasonable within the context of parallel programming.
Disallowing the use of generators, list comprehensions or
integer arithmetic within a parallel execution context,
however, is not.
* Any violation of the new constraints/restrictions shall be
detected and propagated as exceptions. (If this is not
possible for performance reasons, then the cost of silent
errors needs to be weighed against the performance benefits
on a case-by-case basis.)
Impact of Requirements on Solution Design
The following requirements significantly constrain the possible ways a
design could be implemented:
* Negligible performance penalties for existing code (or code
written using the new parallel and asynchronous primitives
but still executing on a single core) [R1, R3, R7].
* No change to opaque GIL semantics (from the perspective of
extension modules) [R4, R5].
These requirements rule out any solutions that attempt to remove the
GIL (in lieu of fine-grained locking, for example). Such experiments
have always resulted in code that runs orders of magnitude slower in
a single-threaded context (due to the overhead incurred by locking
and unlocking mutexes).
None of these requirements are new concepts -- every time GIL removal
comes up for discussion, these requirements are cited as reasons why
a given proposal won't be feasible.
threading.Thread() is a Dead End
However, The common trait of past proposals is that they focused on
the wrong level of abstraction: threading.Thread(). Discussions always
revolve around what can be done to automatically make threading.Thread
instances run concurrently.
That line of thinking is a dead end. In order for existing threading
Thread() code to run concurrently, you'd need to replace the GIL with
fine-grained locking, such that only one thread could access the GC
and memory allocation routines at a time (specifically with regards
to reference counting). And fine-grained locking is a dead end due
to the overhead incurred by single-threaded code.
Formalizing Concurrency Entry and Exit Points
This is the rationale behind [R6]: threading.Thread() will not become
magically concurrent, nor will any concurrency be observed on multiple
cores unless the new primitives are explicitly used.
That requirement is significant, because it prescribes the guidelines
and constraints under which true concurrency can be achieved:
* The GIL must be held.
* The new primitives must be used.
[R11] stipulates that the solution must offer tangible benefits over
and above what could be achieved through multiprocessing. In order
to meet this requirement, the solution will need to leverage the most
optimal multicore/threading programming paradigms, such as eliminating
read/write contention for shared data, using lockless data structures
where possible, and reducing context switching overhead incurred when
excessive threads are all vying for attention.
So, the solution must be performant, and it must be achieved within
the confines of holding the GIL.
The Importance of New Langugage Constraints
However, the final requirement, [R12], affords us the ability to
introduce new language constraints within the context of the new
primitives referred to in [R6/7], as long as the constraints are:
- Acceptable within the context of parallel execution.
- Essential for achieving both high-performance and meeting the
other strict guidelines (maintain GIL semantics, etc).
The notion of what's considered an 'acceptable' constraint will
obviously vary from person to person, and will undoubtedly be best
suited to a BDFL-style pronouncement. It's worth noting, though,
that the community's willingness to accept certain constraints will
be directly proportional to the performance improvements said
For example, a possible constraint may be that Python code executing
within a parallel context is not able to affect any global/nonlocal
Solution Proposal - Quick Overview
In order to help solidify some of the concepts being presented, here's
the general overview of the solution the PEP author has in mind. Keep
in mind it's only an example, and it is being presented in order to
stimulate ideas and further discussion.
The essence of the idea is this: in addition to the main interpreter
thread, an additional thread is created for each available CPU core.
These threads idle/wait against a condition variable by default. That
is, when there's nothing to be done in parallel, they do nothing.
The main intepreter thread's behaviour when executing non-parallel code
is identical to how it behaves now when executing normal Python code;
all GIL, memory and GC semantics are preserved.
Because we have introduced new language primitives, we can detect
entry and exit from parallel execution contexts at the interpreter
level (i.e. at CFrame_EvalEx), via specific op-codes, for example.
Upon detecting entry to a parallel context, the main interpreter
thread preps any relevant data structures, then signals to all other
threads to begin execution. The other threads are essentially frame
execution pipelines, essentially performing the same role as the main
CFrame_EvalEx method, only customized to run in parallel.
It will be helpful to visualize the sort of Python code that these
frame pipelines will execute. Remember that there are two use cases
this PEP targets: event-driven clients/servers, and parallel data/task
In either use case, the idea is that you should be able to separate
out the kernels of logic that can be performed in parallel (or
concurrently, in response to an external event). This is analogous to
GPU programming, where one loads a single "kernel" of logic (a C
function) into hundreds or thousands of GPU hardware threads, which
then perform the action against different chunks of data. The logic
is self-contained; once primed with input data, it doesn't rely on
anything else to produce its output data.
A similar approach would be adopted for Python. The kernel of logic
would be contained within a Python callable. The input would simply
be the arguments passed to the callable. The optional output would be
whatever the callable returns, which would be utilized for parallel
task/data decomposition algorithms (i.e. map/reduce). For the event
driven pattern, in lieu of a return value, the callable will invoke
one of many "parallel safe" methods in order to move the processing of
the event to the next state/stage. Such methods will always return
immediately, enqueuing the action to be performed in the background.
It it these callables that will be executed concurrently by the other
"frame pipeline" threads once the main thread signals that a parallel
context has been entered.
The concurrent frame pipelines (modified versions of CEval_FrameEx),
simply churn away executing these callables. Execution of the
callable is atomic in the sense that, once started, it will run 'til
Thus, presuming there are available events/data, all cores will be
occupied executing the frame pipelines.
This will last for a configurable amount of time. Once that time
expires, no more callables will be queued (or the threads will sleep
via some other mechanism), and the main thread will wait 'til all
parallel execution has completed.
Upon detecting this, it performs any necessary "post-parallel"
cleanup, releases the GIL, acquires it again, and then, presuming
there is more parallel computation to be done, starts the whole
Thus, the program flow can be viewed a constant switch between the
single-threaded execution (indistinguishable from current interpreter
behaviour) to explicit parallel execution, where all cores churn away
on a modified CFrame_EvalEx that has been primed with a "parallel"
During this parallel execution, the interpreter will periodically take
breaks and switch back to the single-thread model, release the GIL, do
any signal handling and anything else it needs to in order to play
nice with legacy code (and extensions). Once it acquires the GIL
again, it can un-pause the parallel execution, and all cores continue
executing the parallel callables.
This cycle repeats as often as necessary.
Exiting a parallel context is similar to the periodic breaks; the only
difference is that "parallel finalization" methods might also need to
be run (like the "reduce" part of map/reduce).
The Role of Asynchronous Callbacks
The vast majority of discussion so far has focused on the parallel
aspect of this PEP. That is, introducing new primitives that allow
Python code to run concurrently across multiple cores as efficiently
The term "efficiently as possible" can be broken down as follows:
We've also established that threading.Thread will not be the mechanism
used for achieving concurrency. We still need to play nice with
existing The solution must still play nice with
The Proposed Solution
* Create the main interpreter thread and bind it to whatever
processor core it's currently running on.
* For each additional core found, create and bind another thread.
* Refactor the role of CEval_FrameEx and supporting functions.
* No change from current behavior in single-threaded case.
* Detection of a parallel context entry point (opcode?) results in
prepping every other thread's frameex pipeline with the available
work, signalling all threads to begin their frameex pipelines, then
waiting for work to complete to transition back to single-thread
* This repeats every time parallel constructs are encountered.
* Other cores are idle when the program is not in a parallel context.
Parallelizing the Interpreter
For now, we'll ignore the asynchronous requirements and just focus on
the concurrency. threading.Thread() is already off the table, so we
can ignore that aspect completely.
Without threading.Thread(), we need to come up with a way to run code
concurrently within the confines of parallel contexts. We'll achieve
this using a new type of thread which we'll call ``IThread``, for
This document has been placed in the public domain.
vim:set ts=8 sw=4 sts=4 et syntax=rst tw=70: