Source

sandbox/trent.peps / pep-async.txt

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
PEP: async
Title: Asynchronous Callbacks and Intrinsic Language Parallelism
Version: 0.1
Last-Modified: $Date$
Author: Trent Nelson <trent@trent.me>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 02-Dec-2012
Python-Version: 3.x

Abstract
========

The goal of this PEP is to produce a Python interpreter that, through
the introduction of new language constructs and library interfaces,
will allow code to be written that automatically exploits all
available CPU cores.

It proposes to achieve this by introducing two inter-related concepts:
new parallel language constructs and a new asynchronous library.

Rationale
=========

More and more cores are readily becoming available in both consumer
hardware and enterprise servers.  Python does not have a way to
naturally exploit multiple cores.  This PEP tries to address this
issue by introducing language support for parallel constructs and
asynchronous callbacks.

Relationship to Existing Efforts, PEPs and Design Discussions
=============================================================

Asynchronous IO became a hot topic on python-ideas during September,
October and November of 2012.  An output of those discussions was a
new library called tulip, by Guido van Rossum, that focuses on a new
async IO library.  There is also PEP-3153 'Asynchronous IO Support'
which tackles some of the issues regarding async IO.

With regards to parallelism, the GIL has always been a contentious
issue, and the mail archives are filled with numerous attempts to
remove it (or simply lament its performance impact).

The reason the proposal here is being submitted in its own PEP is
because it specifically addresses the two issues in concert: handling
asynchronous events *and* providing truly concurrent execution of
Python code within a single interpreter.

It's also worth noting that the proposed solution is low-level enough
such that Tulip, Twisted, Tornado, PEP-3153 and any other library can
easily leverage the primitives introduced by this PEP.

Use Cases
=========
Primary - Event Driven
----------------------
There are two use cases this PEP addresses.  The primary use case is
writing software that is event-driven; for example, an SMTP or HTTP
server, which waits for client connections and then "services" them.
Note that the onus is on event-driven, not "server software"; a load-
testing client should be just as able to spin up tens of thousands of
connections to a target server, as that same server is able to handle
them.

The goal of this PEP for the primary use case is to allow Python to
maximize the underlying system resources as optimally as it can; if
a high-performance, multi-threaded native C/C++ library can handle
65k clients and saturate multiple gigabit links (by utilising all
cores), then Python should be able to come as close to this as
possible (minus the interpreter overhead).

Secondary - Parallel Task/Data Decomposition
--------------------------------------------
The secondary use case is software that is not primarily driven by
external events or IO (although it still may perform this sort of
work (reading files, connecting to a database, etc)), but may deal
with "embarassingly parallel" data structures, and thus, would also
benefit from parallel language constructs.  Such constructs would be
things like native map/reduce support and parallel task execution.

The goal of this PEP for the secondary use case is to provide elegant
language primitives that allow code to be written in such a way that
all available cores will be exploited.  This use case differs from the
first in that it will be primarily a single thread of execution that
periodically runs in parallel, only to collapse back down to a single
thread of execution again.  Compare this to the event-driven example,
where server/client software revolves around an event loop that will
facilitate parallel execution automatically (based on the receipt of
events), without requiring the programmer to explicitly delineate
"parallel" sections of code.

Requirements
============
It is important to establish the requirements for such a significant
language change up-front, before enumerating possible solutions.  (It
may even be worth separating these requirements out into their own
informational PEP, which would be evaluated and approved independently
to any soutions.)  Agreeing on a set of requirements for asynchronous
and parallel primitives allows multiple solutions to be evaluated
objectively.

Proposed solutions, including the one in this PEP, should include a
traceability matrix that maps each requirement to satisfying design
and implementation details/features.

1.  The solution must not introduce a performance penalty of more than
    4% when running existing Python code that hasn't been converted to
    take advantage of the new primitives.  (4% has been picked as it
    was the overhead introduced by the garbage collector implementation
    circa Python 2.0.)  This is often referred to in GIL-removal threads
    as "performance must not be worse for single-threaded code".

2.  The solution should scale linearly in proportion to the number of
    cores available.  Scale in this context refers to both performance
    (throughput, connection concurrency and overall latency) as well as
    resource usage (memory overhead).

3.  In addition to scaling linearly, there should not be an excessive
    performance penalty for code written against the new primitives if
    there is only one core available.

4.  The solution must not rely on major backwards-incompatible changes
    to the public C API, especially with regards to the GIL.  That is,
    the solution should be compatible with existing extension code that
    relies on GIL semantics (such as acquiring and releasing the GIL).

5.  The solution may introduce minor changes to the C API that extension
    modules will have to accommodate in order to compile against the
    target version of Python, even if said module uses none of the new
    primitives.  The solution should attempt to keep such changes to
    the absolute minimum (i.e. adding #defines, possibly calling new
    CPython initialization methods when loading, etc).

6.  The solution will not "automagically parallelize" existing code.

    6.1 Specifically, the solution will not suddenly allow concurrent
        execution of threading.Thread() instances.  These objects
        (and their underlying Python object counterparts) will still
        be subject to the GIL.

    6.2 Parallelization will only apply to new code written against
        the new async/parallel primitives.

7.  The new asynchronous and parallel primitives can be used alongside
    legacy code without incurring any additional performance penalties.
    (Legacy code in this context refers to IO multiplexing via select/
    poll/epoll/kqueue and handling "blocking" code by deferring it to
    threading.Thread() instances.)

8.  Although the solution will present a platform-agnostic interface to
    the new primitives, it shall be designed against the platform with
    the best native support for underlying asynchronous and parallel
    primitives.  This allows such a platform (or platforms) to run at
    their full potential.  On platforms with lesser support for such
    primitives, the solution should mimic the semantics of the interface
    using traditional techniques for "simulating" asynchronicity (i.e.
    select/poll, blocking threads, etc).

9.  The solution should be low-level enough to facilitate easy adoption
    by existing libraries such as Twisted and Tornado, yet high-level
    enough to be pleasant to work with directly without the need for
    such libraries.

10. The solution should not incur excessive cognitive overhead; the
    average Python day-programmer should be able to easily grok the
    core concepts and start writing high-performance code within a
    single sitting.  It should not expose leaky-abstractions that
    require substantial background knowledge in parallel programming
    or socket/file multiplexing.

11. The parallel aspect of the solution must offer tangible benefits
    over and above those provided by the multiprocessing module, which
    already facilitates concurrency through the use of multiple Python
    processors.

12. The solution may introduce new restrictions or constraints against
    code written to use the new primitives, subject to the following:

        * The constraint is essential for meeting the requirements
          outlined in this section (i.e. performance, maintaining
          GIL semantics, etc).

        * The constraint should be logical to the Python programmer
          within the context of asynchronous/parallel execution.
          For example, a constraint that says a nonlocal variable
          cannot be affected from within a parallel execution context
          is reasonable within the context of parallel programming.
          Disallowing the use of generators, list comprehensions or
          integer arithmetic within a parallel execution context,
          however, is not.

        * Any violation of the new constraints/restrictions shall be
          detected and propagated as exceptions.  (If this is not
          possible for performance reasons, then the cost of silent
          errors needs to be weighed against the performance benefits
          on a case-by-case basis.)


Impact of Requirements on Solution Design
=========================================

Implementation Constraints
--------------------------
The following requirements significantly constrain the possible ways a
design could be implemented:

    * Negligible performance penalties for existing code (or code
      written using the new parallel and asynchronous primitives
      but still executing on a single core) [R1, R3, R7].

    * No change to opaque GIL semantics (from the perspective of
      extension modules) [R4, R5].

These requirements rule out any solutions that attempt to remove the
GIL (in lieu of fine-grained locking, for example).  Such experiments
have always resulted in code that runs orders of magnitude slower in
a single-threaded context (due to the overhead incurred by locking
and unlocking mutexes).

None of these requirements are new concepts -- every time GIL removal
comes up for discussion, these requirements are cited as reasons why
a given proposal won't be feasible.

threading.Thread() is a Dead End
--------------------------------
However, The common trait of past proposals is that they focused on
the wrong level of abstraction: threading.Thread().  Discussions always
revolve around what can be done to automatically make threading.Thread
instances run concurrently.

That line of thinking is a dead end.  In order for existing threading
Thread() code to run concurrently, you'd need to replace the GIL with
fine-grained locking, such that only one thread could access the GC
and memory allocation routines at a time (specifically with regards
to reference counting).  And fine-grained locking is a dead end due
to the overhead incurred by single-threaded code.

Formalizing Concurrency Entry and Exit Points
---------------------------------------------
This is the rationale behind [R6]: threading.Thread() will not become
magically concurrent, nor will any concurrency be observed on multiple
cores unless the new primitives are explicitly used.

That requirement is significant, because it prescribes the guidelines
and constraints under which true concurrency can be achieved:

    * The GIL must be held.
    * The new primitives must be used.

Performance Requirements
------------------------
[R11] stipulates that the solution must offer tangible benefits over
and above what could be achieved through multiprocessing.  In order
to meet this requirement, the solution will need to leverage the most
optimal multicore/threading programming paradigms, such as eliminating
read/write contention for shared data, using lockless data structures
where possible, and reducing context switching overhead incurred when
excessive threads are all vying for attention.

So, the solution must be performant, and it must be achieved within
the confines of holding the GIL.

The Importance of New Langugage Constraints
-------------------------------------------
However, the final requirement, [R12], affords us the ability to
introduce new language constraints within the context of the new
primitives referred to in [R6/7], as long as the constraints are:

    - Acceptable within the context of parallel execution.
    - Essential for achieving both high-performance and meeting the
      other strict guidelines (maintain GIL semantics, etc).

The notion of what's considered an 'acceptable' constraint will
obviously vary from person to person, and will undoubtedly be best
suited to a BDFL-style pronouncement.  It's worth noting, though,
that the community's willingness to accept certain constraints will
be directly proportional to the performance improvements said
constraints offer.

For example, a possible constraint may be that Python code executing
within a parallel context is not able to affect any global/nonlocal
variables.

Solution Proposal - Quick Overview
----------------------------------
In order to help solidify some of the concepts being presented, here's
the general overview of the solution the PEP author has in mind.  Keep
in mind it's only an example, and it is being presented in order to
stimulate ideas and further discussion.

The essence of the idea is this: in addition to the main interpreter
thread, an additional thread is created for each available CPU core[1].
These threads idle/wait against a condition variable by default.  That
is, when there's nothing to be done in parallel, they do nothing.

The main intepreter thread's behaviour when executing non-parallel code
is identical to how it behaves now when executing normal Python code;
all GIL, memory and GC semantics are preserved.

Because we have introduced new language primitives, we can detect
entry and exit from parallel execution contexts at the interpreter
level (i.e. at CFrame_EvalEx), via specific op-codes, for example.

Upon detecting entry to a parallel context, the main interpreter
thread preps any relevant data structures, then signals to all other
threads to begin execution.  The other threads are essentially frame
execution pipelines, essentially performing the same role as the main
CFrame_EvalEx method, only customized to run in parallel.

It will be helpful to visualize the sort of Python code that these
frame pipelines will execute.  Remember that there are two use cases
this PEP targets: event-driven clients/servers, and parallel data/task
decomposition.

In either use case, the idea is that you should be able to separate
out the kernels of logic that can be performed in parallel (or
concurrently, in response to an external event).  This is analogous to
GPU programming, where one loads a single "kernel" of logic (a C
function) into hundreds or thousands of GPU hardware threads, which
then perform the action against different chunks of data.  The logic
is self-contained; once primed with input data, it doesn't rely on
anything else to produce its output data.

A similar approach would be adopted for Python.  The kernel of logic
would be contained within a Python callable.  The input would simply
be the arguments passed to the callable.  The optional output would be
whatever the callable returns, which would be utilized for parallel
task/data decomposition algorithms (i.e. map/reduce).  For the event
driven pattern, in lieu of a return value, the callable will invoke
one of many "parallel safe" methods in order to move the processing of
the event to the next state/stage.  Such methods will always return
immediately, enqueuing the action to be performed in the background.

It it these callables that will be executed concurrently by the other
"frame pipeline" threads once the main thread signals that a parallel
context has been entered.

The concurrent frame pipelines (modified versions of CEval_FrameEx),
simply churn away executing these callables.  Execution of the
callable is atomic in the sense that, once started, it will run 'til
completion.

Thus, presuming there are available events/data, all cores will be
occupied executing the frame pipelines.

This will last for a configurable amount of time.  Once that time
expires, no more callables will be queued (or the threads will sleep
via some other mechanism), and the main thread will wait 'til all
parallel execution has completed.

Upon detecting this, it performs any necessary "post-parallel"
cleanup, releases the GIL, acquires it again, and then, presuming
there is more parallel computation to be done, starts the whole
process again.

Thus, the program flow can be viewed a constant switch between the
single-threaded execution (indistinguishable from current interpreter
behaviour) to explicit parallel execution, where all cores churn away
on a modified CFrame_EvalEx that has been primed with a "parallel"
callable.

During this parallel execution, the interpreter will periodically take
breaks and switch back to the single-thread model, release the GIL, do
any signal handling and anything else it needs to in order to play
nice with legacy code (and extensions).  Once it acquires the GIL
again, it can un-pause the parallel execution, and all cores continue
executing the parallel callables.

This cycle repeats as often as necessary.

Exiting a parallel context is similar to the periodic breaks; the only
difference is that "parallel finalization" methods might also need to
be run (like the "reduce" part of map/reduce).


The Role of Asynchronous Callbacks
==================================
The vast majority of discussion so far has focused on the parallel
aspect of this PEP.  That is, introducing new primitives that allow
Python code to run concurrently across multiple cores as efficiently
as possible.

The term "efficiently as possible" can be broken down as follows:

    - Th


We've also established that threading.Thread will not be the mechanism
used for achieving concurrency.  We still need to play nice with
existing The solution must still play nice with 



The Proposed Solution
=====================
TL;DR Version
-------------
 * Create the main interpreter thread and bind it to whatever
   processor core it's currently running on.
 * For each additional core found, create and bind another thread.
 * Refactor the role of CEval_FrameEx and supporting functions.
 * No change from current behavior in single-threaded case.
 * Detection of a parallel context entry point (opcode?) results in
   prepping every other thread's frameex pipeline with the available
   work, signalling all threads to begin their frameex pipelines, then
   waiting for work to complete to transition back to single-thread
   of control.
 * This repeats every time parallel constructs are encountered.
 * Other cores are idle when the program is not in a parallel context.

Parallelizing the Interpreter
-----------------------------
For now, we'll ignore the asynchronous requirements and just focus on
the concurrency.  threading.Thread() is already off the table, so we
can ignore that aspect completely.

Without threading.Thread(), we need to come up with a way to run code
concurrently within the confines of parallel contexts.  We'll achieve
this using a new type of thread which we'll call ``IThread``, for
"interpreter thread".



References
==========



Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
   vim:set ts=8 sw=4 sts=4 et syntax=rst tw=70: