There are two use cases this PEP addresses. The primary use case is
writing software that is event-driven; for example, an SMTP or HTTP
server, which waits for client connections and then "services" them.
cores), then Python should be able to come as close to this as
possible (minus the interpreter overhead).
+Secondary - Parallel Task/Data Decomposition
The secondary use case is software that is not primarily driven by
external events or IO (although it still may perform this sort of
work (reading files, connecting to a database, etc)), but may deal
the new async/parallel primitives.
7. The new asynchronous and parallel primitives can be used alongside
- legacy code without incurring any performance penalties. (Legacy
- code in this context refers to IO multiplexing using select/poll/
- epoll/kqueue and handling "blocking" code by deferring it to
+ legacy code without incurring any additional performance penalties.
+ (Legacy code in this context refers to IO multiplexing via select/
+ poll/epoll/kqueue and handling "blocking" code by deferring it to
8. Although the solution will present a platform-agnostic interface to
Impact of Requirements on Solution Design
The following requirements significantly constrain the possible ways a
design could be implemented:
comes up for discussion, these requirements are cited as reasons why
a given proposal won't be feasible.
-The common trait of past proposals is that they focused on the wrong
-level of abstraction: threading.Thread(). Discussions always revolve
-around what can be done to automatically make threading.Thread()s run
+threading.Thread() is a Dead End
+However, The common trait of past proposals is that they focused on
+the wrong level of abstraction: threading.Thread(). Discussions always
+revolve around what can be done to automatically make threading.Thread
+instances run concurrently.
That line of thinking is a dead end. In order for existing threading
Thread() code to run concurrently, you'd need to replace the GIL with
to reference counting). And fine-grained locking is a dead end due
to the overhead incurred by single-threaded code.
+Formalizing Concurrency Entry and Exit Points
This is the rationale behind [R6]: threading.Thread() will not become
magically concurrent, nor will any concurrency be observed on multiple
cores unless the new primitives are explicitly used.
* The new primitives must be used.
+[R11] stipulates that the solution must offer tangible benefits over
+and above what could be achieved through multiprocessing. In order
+to meet this requirement, the solution will need to leverage the most
+optimal multicore/threading programming paradigms, such as eliminating
+read/write contention for shared data, using lockless data structures
+where possible, and reducing context switching overhead incurred when
+excessive threads are all vying for attention.
+So, the solution must be performant, and it must be achieved within
+the confines of holding the GIL.
+The Importance of New Langugage Constraints
+However, the final requirement, [R12], affords us the ability to
+introduce new language constraints within the context of the new
+primitives referred to in [R6/7], as long as the constraints are:
+ - Acceptable within the context of parallel execution.
+ - Essential for achieving both high-performance and meeting the
+ other strict guidelines (maintain GIL semantics, etc).
+The notion of what's considered an 'acceptable' constraint will
+obviously vary from person to person, and will undoubtedly be best
+suited to a BDFL-style pronouncement. It's worth noting, though,
+that the community's willingness to accept certain constraints will
+be directly proportional to the performance improvements said
+For example, a possible constraint may be that Python code executing
+within a parallel context is not able to affect any global/nonlocal
+Solution Proposal - Quick Overview
+In order to help solidify some of the concepts being presented, here's
+the general overview of the solution the PEP author has in mind. Keep
+in mind it's only an example, and it is being presented in order to
+stimulate ideas and further discussion.
+The essence of the idea is this: in addition to the main interpreter
+thread, an additional thread is created for each available CPU core.
+These threads idle/wait against a condition variable by default. That
+is, when there's nothing to be done in parallel, they do nothing.
+The main intepreter thread's behaviour when executing non-parallel code
+is identical to how it behaves now when executing normal Python code;
+all GIL, memory and GC semantics are preserved.
+Because we have introduced new language primitives, we can detect
+entry and exit from parallel execution contexts at the interpreter
+level (i.e. at CFrame_EvalEx), via specific op-codes, for example.
+Upon detecting entry to a parallel context, the main interpreter
+thread preps any relevant data structures, then signals to all other
+threads to begin execution. The other threads are essentially frame
+execution pipelines, essentially performing the same role as the main
+CFrame_EvalEx method, only customized to run in parallel.
+It will be helpful to visualize the sort of Python code that these
+frame pipelines will execute. Remember that there are two use cases
+this PEP targets: event-driven clients/servers, and parallel data/task
+In either use case, the idea is that you should be able to separate
+out the kernels of logic that can be performed in parallel (or
+concurrently, in response to an external event). This is analogous to
+GPU programming, where one loads a single "kernel" of logic (a C
+function) into hundreds or thousands of GPU hardware threads, which
+then perform the action against different chunks of data. The logic
+is self-contained; once primed with input data, it doesn't rely on
+anything else to produce its output data.
+A similar approach would be adopted for Python. The kernel of logic
+would be contained within a Python callable. The input would simply
+be the arguments passed to the callable. The optional output would be
+whatever the callable returns, which would be utilized for parallel
+task/data decomposition algorithms (i.e. map/reduce). For the event
+driven pattern, in lieu of a return value, the callable will invoke
+one of many "parallel safe" methods in order to move the processing of
+the event to the next state/stage. Such methods will always return
+immediately, enqueuing the action to be performed in the background.
+It it these callables that will be executed concurrently by the other
+"frame pipeline" threads once the main thread signals that a parallel
+context has been entered.
+The concurrent frame pipelines (modified versions of CEval_FrameEx),
+simply churn away executing these callables. Execution of the
+callable is atomic in the sense that, once started, it will run 'til
+Thus, presuming there are available events/data, all cores will be
+occupied executing the frame pipelines.
+This will last for a configurable amount of time. Once that time
+expires, no more callables will be queued (or the threads will sleep
+via some other mechanism), and the main thread will wait 'til all
+parallel execution has completed.
+Upon detecting this, it performs any necessary "post-parallel"
+cleanup, releases the GIL, acquires it again, and then, presuming
+there is more parallel computation to be done, starts the whole
+Thus, the program flow can be viewed a constant switch between the
+single-threaded execution (indistinguishable from current interpreter
+behaviour) to explicit parallel execution, where all cores churn away
+on a modified CFrame_EvalEx that has been primed with a "parallel"
+During this parallel execution, the interpreter will periodically take
+breaks and switch back to the single-thread model, release the GIL, do
+any signal handling and anything else it needs to in order to play
+nice with legacy code (and extensions). Once it acquires the GIL
+again, it can un-pause the parallel execution, and all cores continue
+executing the parallel callables.
+This cycle repeats as often as necessary.
+Exiting a parallel context is similar to the periodic breaks; the only
+difference is that "parallel finalization" methods might also need to
+be run (like the "reduce" part of map/reduce).
+The Role of Asynchronous Callbacks
+The vast majority of discussion so far has focused on the parallel
+aspect of this PEP. That is, introducing new primitives that allow
+Python code to run concurrently across multiple cores as efficiently
+The term "efficiently as possible" can be broken down as follows:
+We've also established that threading.Thread will not be the mechanism
+used for achieving concurrency. We still need to play nice with
+existing The solution must still play nice with