+Title: Asynchronous Callbacks and Intrinsic Language Parallelism
+Author: Trent Nelson <firstname.lastname@example.org>
+The goal of this PEP is to produce a Python interpreter that, through
+the introduction of new language constructs and library interfaces,
+will allow code to be written that automatically exploits all
+It proposes to achieve this by introducing two inter-related concepts:
+new parallel language constructs and a new asynchronous library.
+More and more cores are readily becoming available in both consumer
+hardware and enterprise servers. Python does not have a way to
+naturally exploit multiple cores. This PEP tries to address this
+issue by introducing language support for parallel constructs and
+Relationship to Existing Efforts, PEPs and Design Discussions
+Asynchronous IO became a hot topic on python-ideas during September,
+October and November of 2012. An output of those discussions was a
+new library called tulip, by Guido van Rossum, that focuses on a new
+async IO library. There is also PEP-3153 'Asynchronous IO Support'
+which tackles some of the issues regarding async IO.
+With regards to parallelism, the GIL has always been a contentious
+issue, and the mail archives are filled with numerous attempts to
+remove it (or simply lament its performance impact).
+The reason the proposal here is being submitted in its own PEP is
+because it specifically addresses the two issues in concert: handling
+asynchronous events *and* providing truly concurrent execution of
+Python code within a single interpreter.
+It's also worth noting that the proposed solution is low-level enough
+such that Tulip, Twisted, Tornado, PEP-3153 and any other library can
+easily leverage the primitives introduced by this PEP.
+There are two use cases this PEP addresses. The primary use case is
+writing software that is event-driven; for example, an SMTP or HTTP
+server, which waits for client connections and then "services" them.
+Note that the onus is on event-driven, not "server software"; a load-
+testing client should be just as able to spin up tens of thousands of
+connections to a target server, as that same server is able to handle
+The goal of this PEP for the primary use case is to allow Python to
+maximize the underlying system resources as optimally as it can; if
+a high-performance, multi-threaded native C/C++ library can handle
+65k clients and saturate multiple gigabit links (by utilising all
+cores), then Python should be able to come as close to this as
+possible (minus the interpreter overhead).
+The secondary use case is software that is not primarily driven by
+external events or IO (although it still may perform this sort of
+work (reading files, connecting to a database, etc)), but may deal
+with "embarassingly parallel" data structures, and thus, would also
+benefit from parallel language constructs. Such constructs would be
+things like native map/reduce support and parallel task execution.
+The goal of this PEP for the secondary use case is to provide elegant
+language primitives that allow code to be written in such a way that
+all available cores will be exploited. This use case differs from the
+first in that it will be primarily a single thread of execution that
+periodically runs in parallel, only to collapse back down to a single
+thread of execution again. Compare this to the event-driven example,
+where server/client software revolves around an event loop that will
+facilitate parallel execution automatically (based on the receipt of
+events), without requiring the programmer to explicitly delineate
+"parallel" sections of code.
+It is important to establish the requirements for such a significant
+language change up-front, before enumerating possible solutions. (It
+may even be worth separating these requirements out into their own
+informational PEP, which would be evaluated and approved independently
+to any soutions.) Agreeing on a set of requirements for asynchronous
+and parallel primitives allows multiple solutions to be evaluated
+Proposed solutions, including the one in this PEP, should include a
+traceability matrix that maps each requirement to satisfying design
+and implementation details/features.
+1. The solution must not introduce a performance penalty of more than
+ 4% when running existing Python code that hasn't been converted to
+ take advantage of the new primitives. (4% has been picked as it
+ was the overhead introduced by the garbage collector implementation
+ circa Python 2.0.) This is often referred to in GIL-removal threads
+ as "performance must not be worse for single-threaded code".
+2. The solution should scale linearly in proportion to the number of
+ cores available. Scale in this context refers to both performance
+ (throughput, connection concurrency and overall latency) as well as
+ resource usage (memory overhead).
+3. In addition to scaling linearly, there should not be an excessive
+ performance penalty for code written against the new primitives if
+ there is only one core available.
+4. The solution must not rely on major backwards-incompatible changes
+ to the public C API, especially with regards to the GIL. That is,
+ the solution should be compatible with existing extension code that
+ relies on GIL semantics (such as acquiring and releasing the GIL).
+5. The solution may introduce minor changes to the C API that extension
+ modules will have to accommodate in order to compile against the
+ target version of Python, even if said module uses none of the new
+ primitives. The solution should attempt to keep such changes to
+ the absolute minimum (i.e. adding #defines, possibly calling new
+ CPython initialization methods when loading, etc).
+6. The solution will not "automagically parallelize" existing code.
+ 6.1 Specifically, the solution will not suddenly allow concurrent
+ execution of threading.Thread() instances. These objects
+ (and their underlying Python object counterparts) will still
+ 6.2 Parallelization will only apply to new code written against
+ the new async/parallel primitives.
+7. The new asynchronous and parallel primitives can be used alongside
+ legacy code without incurring any performance penalties. (Legacy
+ code in this context refers to IO multiplexing using select/poll/
+ epoll/kqueue and handling "blocking" code by deferring it to
+ threading.Thread() instances.)
+8. Although the solution will present a platform-agnostic interface to
+ the new primitives, it shall be designed against the platform with
+ the best native support for underlying asynchronous and parallel
+ primitives. This allows such a platform (or platforms) to run at
+ their full potential. On platforms with lesser support for such
+ primitives, the solution should mimic the semantics of the interface
+ using traditional techniques for "simulating" asynchronicity (i.e.
+ select/poll, blocking threads, etc).
+9. The solution should be low-level enough to facilitate easy adoption
+ by existing libraries such as Twisted and Tornado, yet high-level
+ enough to be pleasant to work with directly without the need for
+10. The solution should not incur excessive cognitive overhead; the
+ average Python day-programmer should be able to easily grok the
+ core concepts and start writing high-performance code within a
+ single sitting. It should not expose leaky-abstractions that
+ require substantial background knowledge in parallel programming
+ or socket/file multiplexing.
+11. The parallel aspect of the solution must offer tangible benefits
+ over and above those provided by the multiprocessing module, which
+ already facilitates concurrency through the use of multiple Python
+12. The solution may introduce new restrictions or constraints against
+ code written to use the new primitives, subject to the following:
+ * The constraint is essential for meeting the requirements
+ outlined in this section (i.e. performance, maintaining
+ * The constraint should be logical to the Python programmer
+ within the context of asynchronous/parallel execution.
+ For example, a constraint that says a nonlocal variable
+ cannot be affected from within a parallel execution context
+ is reasonable within the context of parallel programming.
+ Disallowing the use of generators, list comprehensions or
+ integer arithmetic within a parallel execution context,
+ * Any violation of the new constraints/restrictions shall be
+ detected and propagated as exceptions. (If this is not
+ possible for performance reasons, then the cost of silent
+ errors needs to be weighed against the performance benefits
+ on a case-by-case basis.)
+Impact of Requirements on Solution Design
+The following requirements significantly constrain the possible ways a
+design could be implemented:
+ * Negligible performance penalties for existing code (or code
+ written using the new parallel and asynchronous primitives
+ but still executing on a single core) [R1, R3, R7].
+ * No change to opaque GIL semantics (from the perspective of
+ extension modules) [R4, R5].
+These requirements rule out any solutions that attempt to remove the
+GIL (in lieu of fine-grained locking, for example). Such experiments
+have always resulted in code that runs orders of magnitude slower in
+a single-threaded context (due to the overhead incurred by locking
+None of these requirements are new concepts -- every time GIL removal
+comes up for discussion, these requirements are cited as reasons why
+a given proposal won't be feasible.
+The common trait of past proposals is that they focused on the wrong
+level of abstraction: threading.Thread(). Discussions always revolve
+around what can be done to automatically make threading.Thread()s run
+That line of thinking is a dead end. In order for existing threading
+Thread() code to run concurrently, you'd need to replace the GIL with
+fine-grained locking, such that only one thread could access the GC
+and memory allocation routines at a time (specifically with regards
+to reference counting). And fine-grained locking is a dead end due
+to the overhead incurred by single-threaded code.
+This is the rationale behind [R6]: threading.Thread() will not become
+magically concurrent, nor will any concurrency be observed on multiple
+cores unless the new primitives are explicitly used.
+That requirement is significant, because it prescribes the guidelines
+and constraints under which true concurrency can be achieved:
+ * The GIL must be held.
+ * The new primitives must be used.
+ * Create the main interpreter thread and bind it to whatever
+ processor core it's currently running on.
+ * For each additional core found, create and bind another thread.
+ * Refactor the role of CEval_FrameEx and supporting functions.
+ * No change from current behavior in single-threaded case.
+ * Detection of a parallel context entry point (opcode?) results in
+ prepping every other thread's frameex pipeline with the available
+ work, signalling all threads to begin their frameex pipelines, then
+ waiting for work to complete to transition back to single-thread
+ * This repeats every time parallel constructs are encountered.
+ * Other cores are idle when the program is not in a parallel context.
+Parallelizing the Interpreter
+For now, we'll ignore the asynchronous requirements and just focus on
+the concurrency. threading.Thread() is already off the table, so we
+can ignore that aspect completely.
+Without threading.Thread(), we need to come up with a way to run code
+concurrently within the confines of parallel contexts. We'll achieve
+this using a new type of thread which we'll call ``IThread``, for
+This document has been placed in the public domain.
+ sentence-end-double-space: t
+ vim:set ts=8 sw=4 sts=4 et syntax=rst tw=70: