sandbox/trent.peps / pep-async.txt

   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
PEP: async
Title: Asynchronous Callbacks and Intrinsic Language Parallelism
Version: 0.1
Last-Modified: $Date$
Author: Trent Nelson <trent@trent.me>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 02-Dec-2012
Python-Version: 3.x

Abstract
========

The goal of this PEP is to produce a Python interpreter that, through
the introduction of new language constructs and library interfaces,
will allow code to be written that automatically exploits all
available CPU cores.

It proposes to achieve this by introducing two inter-related concepts:
new parallel language constructs and a new asynchronous library.

Rationale
=========

More and more cores are readily becoming available in both consumer
hardware and enterprise servers.  Python does not have a way to
naturally exploit multiple cores.  This PEP tries to address this
issue by introducing language support for parallel constructs and
asynchronous callbacks.

Relationship to Existing Efforts, PEPs and Design Discussions
=============================================================

Asynchronous IO became a hot topic on python-ideas during September,
October and November of 2012.  An output of those discussions was a
new library called tulip, by Guido van Rossum, that focuses on a new
async IO library.  There is also PEP-3153 'Asynchronous IO Support'
which tackles some of the issues regarding async IO.

With regards to parallelism, the GIL has always been a contentious
issue, and the mail archives are filled with numerous attempts to
remove it (or simply lament its performance impact).

The reason the proposal here is being submitted in its own PEP is
because it specifically addresses the two issues in concert: handling
asynchronous events *and* providing truly concurrent execution of
Python code within a single interpreter.

It's also worth noting that the proposed solution is low-level enough
such that Tulip, Twisted, Tornado, PEP-3153 and any other library can
easily leverage the primitives introduced by this PEP.

Use Cases
=========
Primary - Event Driven
----------------------
There are two use cases this PEP addresses.  The primary use case is
writing software that is event-driven; for example, an SMTP or HTTP
server, which waits for client connections and then "services" them.
Note that the onus is on event-driven, not "server software"; a load-
testing client should be just as able to spin up tens of thousands of
connections to a target server, as that same server is able to handle
them.

The goal of this PEP for the primary use case is to allow Python to
maximize the underlying system resources as optimally as it can; if
a high-performance, multi-threaded native C/C++ library can handle
65k clients and saturate multiple gigabit links (by utilising all
cores), then Python should be able to come as close to this as
possible (minus the interpreter overhead).

Secondary - Parallel Task/Data Decomposition
--------------------------------------------
The secondary use case is software that is not primarily driven by
external events or IO (although it still may perform this sort of
work (reading files, connecting to a database, etc)), but may deal
with "embarassingly parallel" data structures, and thus, would also
benefit from parallel language constructs.  Such constructs would be
things like native map/reduce support and parallel task execution.

The goal of this PEP for the secondary use case is to provide elegant
language primitives that allow code to be written in such a way that
all available cores will be exploited.  This use case differs from the
first in that it will be primarily a single thread of execution that
periodically runs in parallel, only to collapse back down to a single
thread of execution again.  Compare this to the event-driven example,
where server/client software revolves around an event loop that will
facilitate parallel execution automatically (based on the receipt of
events), without requiring the programmer to explicitly delineate
"parallel" sections of code.

Requirements
============
It is important to establish the requirements for such a significant
language change up-front, before enumerating possible solutions.  (It
may even be worth separating these requirements out into their own
informational PEP, which would be evaluated and approved independently
to any soutions.)  Agreeing on a set of requirements for asynchronous
and parallel primitives allows multiple solutions to be evaluated
objectively.

Proposed solutions, including the one in this PEP, should include a
traceability matrix that maps each requirement to satisfying design
and implementation details/features.

1.  The solution must not introduce a performance penalty of more than
    4% when running existing Python code that hasn't been converted to
    take advantage of the new primitives.  (4% has been picked as it
    was the overhead introduced by the garbage collector implementation
    circa Python 2.0.)  This is often referred to in GIL-removal threads
    as "performance must not be worse for single-threaded code".

2.  The solution should scale linearly in proportion to the number of
    cores available.  Scale in this context refers to both performance
    (throughput, connection concurrency and overall latency) as well as
    resource usage (memory overhead).

3.  In addition to scaling linearly, there should not be an excessive
    performance penalty for code written against the new primitives if
    there is only one core available.

4.  The solution must not rely on major backwards-incompatible changes
    to the public C API, especially with regards to the GIL.  That is,
    the solution should be compatible with existing extension code that
    relies on GIL semantics (such as acquiring and releasing the GIL).

5.  The solution may introduce minor changes to the C API that extension
    modules will have to accommodate in order to compile against the
    target version of Python, even if said module uses none of the new
    primitives.  The solution should attempt to keep such changes to
    the absolute minimum (i.e. adding #defines, possibly calling new
    CPython initialization methods when loading, etc).

6.  The solution will not "automagically parallelize" existing code.

    6.1 Specifically, the solution will not suddenly allow concurrent
        execution of threading.Thread() instances.  These objects
        (and their underlying Python object counterparts) will still
        be subject to the GIL.

    6.2 Parallelization will only apply to new code written against
        the new async/parallel primitives.

7.  The new asynchronous and parallel primitives can be used alongside
    legacy code without incurring any additional performance penalties.
    (Legacy code in this context refers to IO multiplexing via select/
    poll/epoll/kqueue and handling "blocking" code by deferring it to
    threading.Thread() instances.)

8.  Although the solution will present a platform-agnostic interface to
    the new primitives, it shall be designed against the platform with
    the best native support for underlying asynchronous and parallel
    primitives.  This allows such a platform (or platforms) to run at
    their full potential.  On platforms with lesser support for such
    primitives, the solution should mimic the semantics of the interface
    using traditional techniques for "simulating" asynchronicity (i.e.
    select/poll, blocking threads, etc).

9.  The solution should be low-level enough to facilitate easy adoption
    by existing libraries such as Twisted and Tornado, yet high-level
    enough to be pleasant to work with directly without the need for
    such libraries.

10. The solution should not incur excessive cognitive overhead; the
    average Python day-programmer should be able to easily grok the
    core concepts and start writing high-performance code within a
    single sitting.  It should not expose leaky-abstractions that
    require substantial background knowledge in parallel programming
    or socket/file multiplexing.

11. The parallel aspect of the solution must offer tangible benefits
    over and above those provided by the multiprocessing module, which
    already facilitates concurrency through the use of multiple Python
    processors.

12. The solution may introduce new restrictions or constraints against
    code written to use the new primitives, subject to the following:

        * The constraint is essential for meeting the requirements
          outlined in this section (i.e. performance, maintaining
          GIL semantics, etc).

        * The constraint should be logical to the Python programmer
          within the context of asynchronous/parallel execution.
          For example, a constraint that says a nonlocal variable
          cannot be affected from within a parallel execution context
          is reasonable within the context of parallel programming.
          Disallowing the use of generators, list comprehensions or
          integer arithmetic within a parallel execution context,
          however, is not.

        * Any violation of the new constraints/restrictions shall be
          detected and propagated as exceptions.  (If this is not
          possible for performance reasons, then the cost of silent
          errors needs to be weighed against the performance benefits
          on a case-by-case basis.)


Impact of Requirements on Solution Design
=========================================

Implementation Constraints
--------------------------
The following requirements significantly constrain the possible ways a
design could be implemented:

    * Negligible performance penalties for existing code (or code
      written using the new parallel and asynchronous primitives
      but still executing on a single core) [R1, R3, R7].

    * No change to opaque GIL semantics (from the perspective of
      extension modules) [R4, R5].

These requirements rule out any solutions that attempt to remove the
GIL (in lieu of fine-grained locking, for example).  Such experiments
have always resulted in code that runs orders of magnitude slower in
a single-threaded context (due to the overhead incurred by locking
and unlocking mutexes).

None of these requirements are new concepts -- every time GIL removal
comes up for discussion, these requirements are cited as reasons why
a given proposal won't be feasible.

However, The common trait of past proposals is that they focused on
the wrong level of abstraction: threading.Thread().  Discussions always
revolve around what can be done to automatically make threading.Thread
instances run concurrently.

threading.Thread() is a Dead End
--------------------------------
That line of thinking is a dead end.  In order for existing threading
Thread() code to run concurrently, you'd need to replace the GIL with
fine-grained locking, such that only one thread could access the GC
and memory allocation routines at a time (specifically with regards
to reference counting).  And fine-grained locking is a dead end due
to the overhead incurred by single-threaded code.

Formalizing Concurrency Entry and Exit Points
---------------------------------------------
This is the rationale behind [R6]: threading.Thread() will not become
magically concurrent, nor will any concurrency be observed on multiple
cores unless the new primitives are explicitly used.

That requirement is significant, because it prescribes the guidelines
and constraints under which true concurrency can be achieved:

    * The GIL must be held.
    * The new primitives must be used.

Performance Requirements
------------------------
[R11] stipulates that the solution must offer tangible benefits over
and above what could be achieved through multiprocessing.  In order
to meet this requirement, the solution will need to leverage the most
optimal multicore/threading programming paradigms, such as eliminating
read/write contention for shared data, using lockless data structures
where possible, and reducing context switching overhead incurred when
excessive threads are all vying for attention.

So, the solution must be performant, and it must be achieved within
the confines of holding the GIL.

The Importance of New Langugage Constraints
-------------------------------------------
However, the final requirement, [R12], affords us the ability to
introduce new language constraints within the context of the new
primitives referred to in [R6/7], as long as the constraints are:

    - Acceptable within the context of parallel execution.
    - Essential for achieving both high-performance and meeting the
      other strict guidelines (maintain GIL semantics, etc).

The notion of what's considered an 'acceptable' constraint will
obviously vary from person to person, and will undoubtedly be best
suited to a BDFL-style pronouncement.  It's worth noting, though,
that the community's willingness to accept certain constraints will
be directly proportional to the performance improvements said
constraints offer.

For example, a possible constraint may be that Python code executing
within a parallel context is not able to affect any global/nonlocal
variables.

Solution Proposal - Overview
============================
In order to help solidify some of the concepts being presented, here's
the general overview of the solution the PEP author has in mind.  Keep
in mind it's only an example, and it is being presented in order to
stimulate ideas and further discussion.

The essence of the idea is this: in addition to the main interpreter
thread, an additional thread is created for each available CPU core.
These threads idle/wait against a condition variable by default.  That
is, when there's nothing to be done in parallel, they do nothing.

The main interpreter thread's behaviour when executing non-parallel code
is identical to how it behaves now when executing normal Python code;
all GIL, memory and GC semantics are preserved.

Because we have introduced new language primitives, we can detect
entry and exit from parallel execution contexts at the interpreter
level (i.e. at CFrame_EvalEx), via specific op-codes, for example.

Upon detecting entry to a parallel context, the main interpreter
thread preps any relevant data structures, then signals to all other
threads to begin execution.  The other threads are essentially frame
execution pipelines, essentially performing the same role as the main
CFrame_EvalEx method, only customized to run in parallel.

It will be helpful to visualize the sort of Python code that these
frame pipelines will execute.  Remember that there are two use cases
this PEP targets: event-driven clients/servers, and parallel data/task
decomposition.

In either use case, the idea is that you should be able to separate
out the kernels of logic that can be performed in parallel (or
concurrently, in response to an external event).  This is analogous to
GPU programming, where one loads a single "kernel" of logic (a C
function) into hundreds or thousands of GPU hardware threads, which
then perform the action against different chunks of data.  The logic
is self-contained; once primed with input data, it doesn't rely on
anything else to produce its output data.

A similar approach would be adopted for Python.  The kernel of logic
would be contained within a Python callable.  The input would simply
be the arguments passed to the callable.  The optional output would be
whatever the callable returns, which would be utilized for parallel
task/data decomposition algorithms (i.e. map/reduce).  For the event
driven pattern, in lieu of a return value, the callable will invoke
one of many "parallel safe" methods in order to move the processing of
the event to the next state/stage.  Such methods will always return
immediately, enqueuing the action to be performed in the background.

It it these callables that will be executed concurrently by the other
"frame pipeline" threads once the main thread signals that a parallel
context has been entered.

The concurrent frame pipelines (modified versions of CEval_FrameEx),
simply churn away executing these callables.  Execution of the
callable is atomic in the sense that, once started, it will run 'til
completion.

Thus, presuming there are available events/data, all cores will be
occupied executing the frame pipelines.

This will last for a configurable amount of time.  Once that time
expires, no more callables will be queued (or the threads will sleep
via some other mechanism), and the main thread will wait 'til all
parallel execution has completed.

Upon detecting this, it performs any necessary "post-parallel"
cleanup, releases the GIL, acquires it again, and then, presuming
there is more parallel computation to be done, starts the whole
process again.

Thus, the program flow can be viewed a constant switch between the
single-threaded execution (indistinguishable from current interpreter
behaviour) to explicit parallel execution, where all cores churn away
on a modified CFrame_EvalEx that has been primed with a "parallel"
callable.

During this parallel execution, the interpreter will periodically take
breaks and switch back to the single-thread model, release the GIL, do
any signal handling and anything else it needs to in order to play
nice with legacy code (and extensions).  Once it acquires the GIL
again, it can un-pause the parallel execution, and all cores continue
executing the parallel callables.

As long as the parallel context is still active (i.e. hasn't been
explicitly exited), this cycle repeats over and over.  Single thread,
to concurrent parallel callable execution, to single thread, back to
concurrent parallel callable execution, etc.

Exiting a parallel context will probably require slightly different
behaviour depending on whether it was an event driven context, or
parallel task/data decomposition.  For the latter, it is likely that
some sort of a "parallel finalization" method needs to be run (like
the "reduce" part of map/reduce), typically against the "output data"
each parallel callable "returned" (i.e. via a Python ``return foo``).

Concurrent Frame Execution Pipelines and Memory, GC and Ref. Counting
---------------------------------------------------------------------

Those familiar with CPython internals will have probably noticed a
flaw in the logic presented above: you can't have multiple threads
evaluating Python frames (i.e. CFrame_EvalEx) concurrently because
of the way Python currently allocates memory, counts references and
manages garbage collection.

That is 100% correct.  You can't run the interpreter concurrently
because none of the memory allocation, reference counting and garbage
collection facilities are thread safe.  Incrementing a reference to an
object translates to a simple integer increment; there is no mutex
locking and unlocking, no atomic ops or memory barries protecting
critical sections.  In fact, not only are they not thread safe, the
assumption that there is only one, global thread of execution pervades
every aspect of the implementation.

It is *this* aspect of CPython that limits concurrent execution; not
the GIL.  The GIL is simply the tool used to enforce this pervasive
assumption that there is only a single global thread of execution.

Again, it's worth noting that prior attempts at achieving concurrency
were done by adding all sorts of fine grained locking around critical
sections of memory allocation, ref. counting and GC code.

That has a disasterous impact on CPython's performance.  Incrementing
and decrementing a C integer variable now becomes a dance involving
mutex locking and unlocking.  Combine that with the fact those two
actions are probably the single most frequently performed action by a
CPython interpreter, and you can appreciate how futile any approach
relying on fine-grained locking is.

If fine-grained locking is out, how can we allow multiple CPU cores
to execute a (modified) CFrame_EvalEx concurrently without breaking
all of the existing memory allocation, reference counting and garbage
collection framework?

The idea offered by this PEP is as follows: each parallel thread is
set up such that it thinks its using the normal global malloc/refcnt
facilities, when in fact it has a localized instance.

All object allocation, deallocation, reference counting and garbage
collection that takes place during concurrent CFrame_EvalEx execution
is localized to that thread, and we maintain the illusion of a single
global thread of execution.

We can do this because our design entails only ever having a single
"parallel frame execution pipeline" thread for each available core.
This means that the thread's execution won't ever be interrupted by
another identical thread trying to access the same memory facilities.

(Additionally, we can set the thread affinity such that it always runs
on the same CPU core.  This allows us to benefit from cache locality,
among other things.)

Thus, all of the existing memory allocation, reference counting and
garbage collection facilities can be used without modification by
each thread's CFrame_EvalEx pipeline.  As everything is local to that
thread/CPU, no locking is required around ref. counting primitives.
No locking means no additional overhead.

In fact, because each parallel thread will only ever be executing
short-lived "parallel friendly" callables, we could omit the garbage
collection functionality completely in lieu of free()'ing allocated
memory as soon as the refcnt hits 0.

Or we could even experiment with omitting reference counting entirely
and just automatically free all memory upon completion of the parallel
callable.  After all, the callable shouldn't be affecting any global
(nonlocal) state (i.e. it's side-effect free), and as we're providing
well defined mechanisms for communicating the results of execution
back into the program flow (via ``return ...`` or calling other
"parallel-safe" methods), the lifetime of all other objects allocated
during the callable ends as soon as that callable completes.

The Constraints Imposed by Thread-Local Object Allocation
---------------------------------------------------------
Remember that [R12] says that constraints may be introduced by the new
parallel facilities as long as the following conditions are met:

    - The constraint is essential for performance reasons.
    - The constraint is logical to your average Python programmer.
    - The benefit yielded by concurrent execution outweighs the
      inconvenience of the constraint.

The implication of thread-local object allocation facilities is that
parallel code will have no visibility to nonlocal/global objects.
Thus, parallel callables will have to rely solely on their function
arguments providing them with everything they need to perform their
work.  Likewise, they can't affect global variable state, they can
only communicate the results of their execution back to the main
program flow by either returning results (via Python ``return``),
or calling one of the "parallel friendly" methods to push execution
onto the next state/stage.

At surface level, these constraints hopefully meet the criteria in
[R12]; they're essential for performance, they're logical within the
context of parallel programming, and the benefit of concurrency
outweighs the cost of writing "parallel friendly" callable code.

There will undoubtedly be more constraints that surface once a working
prototype has been implemented, but for now, it's probably safe to
assume the biggest one will be the limited view of the universe (from
the perspective of available/known objects) presented to code that
executes within a parallel callable.

The Importance of Never Blocking
================================
The vast majority of discussion so far has focused on the parallel
aspect of this PEP.  That is, introducing new primitives that allow
Python code to run concurrently across multiple cores as efficiently
as possible.

The term "efficiently as possible", within the context of a Python
interpreter executing within a parallel context (i.e. with all CPU
cores running parallel callables), means the following:

    - Each CPU core is spending 100% of its time in userland,
      executing the modified CFrame_EvalEx pipeline.
    - There is no cross-talk between cores;  code in core 1 doesn't
      try to communicate with code in core 2 half way through the
      parallel callable.
    - There is no read/write contention for shared data, which means
      no need for synchronization primitives (at least in the most
      common code paths).
    - The code within the callable never blocks.

The first three points are addressed in the previous section.  The
final point, that callable code never blocks, is discussed in this
section.

Blocking, in this context, refers to a thread calling something that
doesn't immediately/always return in a deterministic fashion.  System
calls that involve writing and reading to sockets/files are examples
of calls that can "block".

Once blocked, a thread is useless.  Not only does execution of the
current parallel callable come grinding to a halt -- the entire
pipeline stalls.  Execution only resumes when the thread "unblocks",
typically when underlying IO has been completed.

Because our implementation relies heavily on the notion of binding a
single thread to each CPU core, when one of these parallel pipeline
threads stalls, the entire CPU core stalls -- there are no other
threads ready and waiting in the wings to pick up execution.

So, blocking must be avoided at all costs.  Luckily, this is far from
being a foreign concept -- the ability to avoid blocking is critical
to the success of all networking/event libraries.

Different platforms provide different facilities for avoiding blocking
system calls.  On POSIX systems, programs primarily rely on setting a
socket or file descriptor to "non-blocking"; alternatively, a blocking
call can be deferred to a separate thread, which allows the main
program flow to continue unimpeded.

Windows, on the other hand, has much more sophisticated facilities for
achieving high-performance IO by way of excellent asynchronous IO
support (usually referred to as "overlapped IO"), IO completion ports,
thread pools, efficient locking primitives and more.

AIX has facilities equal to that of Windows XP/2003 (overlapped IO and
IOCP, but no intrinsic thread pools).  Solaris introduced a similar
API to IOCP called "event ports", which is functionally equivalent to
AIX and Windows XP/2003.

The key point to take away from this is that although all platforms
provide a means for achieving non-blocking or asynchronous calls, how
each platform actually goes about doing it is vastly different.

The Requirement for Asynchronous Callbacks
==========================================

From the perspective of the parallel callable, how the underlying
operating system goes about asynchronous IO or non-blocking is an
irrelevant implementation detail.

However, what *is* important is the set of methods made available
to parallel callables that allow potentially-blocking calls to be
made in a "parallel-safe" manner.

There are many ways to expose this functionality to programs.  In
fact, how this functionality should be exposed is exactly the topic
being discussed on python-ideas and prototyped in libraries like
tulip.  It is also the essence of existing networking libraries like
Twisted and Tornado, and is the focus of another PEP: PEP 3145.

This PEP proposes an alternate approach to async IO compared to the
libraries above.  It was first presented to python-ideas on the 30th
of November, and the core concept hasn't changed since.

    (http://mail.python.org/pipermail/python-ideas/2012-November/018076.html)

The relevant code example is as follows::
    class Callback:
        __slots__ = [
            'success',
            'failure',
            'timeout',
            'cancel',
        ]

    async module/class async:
        def getaddrinfo(host, port, ..., cb):
            ...

        def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
            ...

        def accept(sock, cb):
            ...

        def accept_then_write(sock, buf, (cb1, cb2)):
            ...

        def accept_then_expect_line(sock, line, (cb1, cb2)):
            ...

        def accept_then_expect_multiline_regex(sock, regex, cb):
            ...

        def read_until(fd_or_sock, bytes, cb):
            ...

        def read_all(fd_or_sock, cb):
            return self.read_until(fd_or_sock, EOF, cb)

        def read_until_lineglob(fd_or_sock, cb):
            ...

        def read_until_regex(fd_or_sock, cb):
            ...

        def read_chunk(fd_or_sock, chunk_size, cb):
            ...

        def write(fd_or_sock, buf, cb):
            ...

        def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
            ...

        def connect_then_expect_line(..):
            ...

        def connect_then_write_line(..):
            ...

        def submit_work(callable, cb):
            ...

        def submit_blocking_work(callable, cb):
            ...

        def submit_scheduled_work(callable, schedule, cb):
            """
            Allows for periodic work to be submitted (i.e. run every
            hour).
            """
            ...

        def submit_future_work(callable, delay, cb):
            """
            Allows for work to be executed at an arbitrary point in
            the future.
            """
            ...

        def submit_synchronized_work(callable, cb):
            """
            Submits a callable to be executed on the main Python
            interpreter thread (i.e. outside of any parallel context
            execution).
            """
            ...

        def run_once(..):
            """Run the event loop once."""

        def run(..):
            """Keep running the event loop until exit."""

The two key aspects of these methods are: a) every method (except the
run_once/run ones) must be passed a callback, and b) every method
*always* returns immediately.  From the perspective of the parallel
callable, everything happens asynchronously.

Other aspects of this proposed API worth noting (particularly where it
differs from conventional proposals to date):

    - There's a single flat interface to *all* asynchronous calls.
      This is in stark contrast to the object-oriented/interface
      approach used by Twisted, which relies on concepts/classes
      such as Protocols, Producers, Consumers, Transports, etc.

    - It has been written in such a way to satisfy [R8], which states
      that the interface should focus on allowing the platform with
      the best asynchronous primitives to run natively, and simulating
      the behaviour elsewhere.  This means that functions like
      getaddrinfo() is exposed via the interface, because Windows
      provides a native way to call this method asynchronously.

    - Every single function bar the submit_* ones deal with IO in some
      way or another.  All possible operations you could call against
      a socket or file descriptor are accessible asynchronously.  This
      is particularly important for calls such as connect and accept,
      which may not have asynchronous counterparts on some systems.
      (In which case, the "asynchronicity" would be mimicked in the
      same fashion Twisted et al do today.)

    - There are a lot of combined methods:

        def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
        def accept_then_write(sock, buf, (cb1, cb2)):
        def accept_then_expect_line(sock, line, (cb1, cb2)):
        def accept_then_expect_multiline_regex(sock, regex, cb):
        def read_until(fd_or_sock, bytes, cb):
        def read_until_lineglob(fd_or_sock, cb):
        def read_until_regex(fd_or_sock, cb):
        def read_chunk(fd_or_sock, chunk_size, cb):
        def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
        def connect_then_expect_line(..):
        def connect_then_write_line(..):

      These combined methods provide optimal shortcuts for the most
      common actions performed by client/server event driven software.
      The idea is to delay the need to run Python code for as long as
      possible -- or in some cases, avoid the need to call it at all,
      especially in the case of error handling, where the interface
      can detect whether the incoming bytes are erroneous or not.

      For example, write_then_expect_line() automatically handles
      three possible code paths that could occur after data has been
      written to a network client:
        - The expected line is received -> cb.success() is executed.
        - The expected line is not received.  Either the line didn't
          match the expected pattern, the line was too long, had
          invalid chars, etc.  Whatever the case, cb.failure() is
          automatically called -- which would be configured to send an
          error code (if appropriate to the protocol) and then close
          the connection.
        - A terminated \r\n line isn't received within the timeout
          specified by cb.timeout.  cb.failure() is run with the
          failure type indicating 'timeout'.

      The general principle is to try and delay the need to execute
      Python code for as long as possible -- and, when it can't be
      delayed any longer, make sure the Python code has enough data to
      do useful work.

      This frees the Python programmer from writing repetitive IO
      processing code (i.e. is my buffer full, has a full line been
      received, has the correct pattern been detected, etc) and focus
      more on program logic -- "when a line matching this pattern is
      received, run this Python code" (i.e. the callable accessible
      from the cb.success callback).

      Additionally, by keeping all the buffer processing and input
      detection work in C, the Python interpreter doesn't need to
      switch back to Python code until there's something useful to
      do.

      This becomes an important optimization when dealing with heavily
      loaded servers.  If you've got 64k concurrent clients reading
      and writing to your server, the faster you can serve each
      request, the better you're going to perform.  Minimizing the
      number of times Python code is being executed over the course of
      a given interval will improve the overall performance.

    - Finally, there is a set of methods specifically tailored towards
      handling blocking calls:

        def submit_work(callable, cb):
        def submit_blocking_work(callable, cb):
        def submit_scheduled_work(callable, schedule, cb):
            """
            Allows for periodic work to be submitted (i.e. run every
            hour).
            """

        def submit_future_work(callable, delay, cb):
            """
            Allows for work to be executed at an arbitrary point in
            the future.
            """

        def submit_synchronized_work(callable, cb):
            """
            Submits a callable to be executed on the main Python
            interpreter thread (i.e. outside of any parallel context
            execution).
            """

      These methods serve an important role.  Because the proposed
      solution advises against using threading.Thread (or rather,
      threading.Threads inhibit the parallel capabilities of the
      interpreter, as it has to let the threads run at regular
      intervals, pausing the parallel pipelines to do so), there
      needs to be a way to submit non-IO work that may potentially
      block from within the context of a parallel callable.

      This is what the submit_* methods are for.  They allow for
      submission of arbitrary callables.  Where and how the callable
      is executed depends on what submit method is used.  Pure
      "compute" parallel callables can be submitted via submit_work.
      Blocking calls can be submitted via submit_blocking_call (this
      allows the underlying implementation to differentiate between
      blocking calls that need to be handled in a separate, on-demand
      thread, versus compute callables that can simply be queued to
      one of the pipelines).

      Additionally, work can be submitted for processing
      "synchronously", that is, when parallel processing is paused (or
      has stopped) and the main interpreter thread is executing in as
      a single-thread with the GIL acquired and visibility to all
      global/nonlocal variables.  This provides a means to affect the
      global state of the program from within a parallel callable
      *without* requiring shared data synchronization primitives.

      (Note that most of these methods have equivalent counterparts in
      libraries like Twisted, i.e. reactor.callFromThread, callLater,
      etc.)


Tying It All Together
=====================

With the provision of asynchronous callbacks, we have all the pieces
we need to facilitate our main goal of exploiting multiple cores
within the context of the following two use cases:

    - Writing high-performance, event-driven client/server software.
    - Writing parallel task/data decomposition algorithms.

We are able to exploit multiple cores by introducing a set of parallel
primitives (the specifics of which we've yet to discuss) that the main
interpreter thread uses to signal the entry and exit of parallel
execution contexts.

We have introduced the concept of "parallel callables".  These are
simply Python functions that can be run concurrently across multiple
cores.  In order to achieve concurrent execution of Python code as
efficiently as possible, some constraints need to be observed by the
code within such a "parallel callable":

    - The code should never block.  This is achieved by providing an
      asynchronous facade for all possible IO operations.

    - The code will not have visibility to any global variables or
      objects (globals() will literally return an empty dict (or maybe
      just the same dict as locals())).

    - Because no shared or global state can be accessed directly from
      within the body of the code, any "input data" the code requires
      must be made accessible via normal function arguments.

    - The lifetime of all objects created within a parallel callable
      ends when the callable finishes (that is, they are explicitly
      deallocated).

    - Within the context of parallel task/data composition, if a
      callable needs to communicate the result of its work back to the
      main program, it does so by simply returning that data (i.e. via
      a normal Python ``return foo`` statement).

    - Within the context of event-driven clients/servers, callables
      can use a set of async.submit_*_work() facilities to either
      communicate with the main program or affect global program
      state in some way.

    - The code within the parallel callable should be reduced to the
      simplest expression of program logic suitable for exposing as an
      atomic unit.  In general, more callables with less functionality
      are preferable to less callables with greater functionality.

    - Parallel callables are idempotent (an important property if they
      are to be executed concurrently).  Provided with the same input,
      they'll always produce the same output.  They can't affect each
      other's state, nor do they leak state (i.e. object allocation)
      once finished, nor can they affect nonlocal/global objects
      unless the standard communication primitives are used (using
      ``return`` or async.submit*).

With these constraints in place, we propose achieving interpreter
parallelism/concurrency by creating a thread for each CPU core, and,
upon a signal from the main thread that parallel computation can
begin, a modified CFrame_EvalEx pipeline.

The parallel pipeline will only ever execute parallel callables.  It
will execute a single callable to completion (i.e. until the function
has finished), perform any data marshalling required by ``return``,
free all objects allocated during execution of the callable, and then
check whether or not the main thread has paused (or completed) parallel
execution.  This determines whether the parallel thread yields control
back to a waiting main interpreter thread, or whether it can pull down
another callable off its queue and start the process again.

We identify that the major issue with attempting concurrency in the
past is that all approaches try and make threading.Thread suddenly
behave concurrently.  This requires revamping a significant amount of
CPython internal structures such that they're protected by locks.

This adds an unacceptable overhead; frequent operations like incref
and decref, which are currently extremely trivial and almost free from
a "cycle cost" perspective, suddenly need to be protected with mutex
lock and unlocks.

This sort of an approach will never work with the current memory
allocation, reference counting and garbage collection implementation,
because it has all been written with the assumption that there is a
single global thread of execution at any one time.

We propose an alternate approach to for achieving concurrency without
the overhead of fine-grained locking: provide each thread pipeline
with its own, localized object allocation.  This would be done by
replacing the global object head static pointers currently used with a
set of thread-local replacements.  This maintains the illusion of a
single global thread of execution from the perspective of CPython
internals.

We also suggest exploring avenues such as dropping garbage collection
altogether and simply deallocating objects (i.e. free()'ing them) as
soon as there refcnt hits 0, or even doing away with reference
counting altogether and automatically deallocating everything that was
allocated as part of the "parallel callable cleanup" steps each thread
must take before starting a new callable.

This proposal introduces a new constraint: no visibility to
global/nonlocal variables from within the context of a parallel
callable.  The callable must be written such that it has all the data
it needs from its function arguments.  If it needs to affect global
program state, it does so by returning a value or using one of the
async.* methods.

Our proposed approach also meets all of the requirements with regards
to maintaining GIL semantics.  It does this by having the main
interpreter thread periodically pause parallel execution in order to
release the GIL, as well as perform other bits of housekeeping that
can only be done with the GIL held from the single-threaded execution
context of the main interpreter thread (which, unlike parallel
threads, has access to the "global" memory state, garbage collection,
etc).

Obviously, the longer the parallel threads are permitted to run
without being paused the better, but legacy code that relies on GIL
semantics (especially with extension codes) will still run without
issue in the mean time.

Because we don't introduce any fine-grained locking primitives in
order to achieve concurrency, no additional overhead is incurred by
Python code running in the normal single-threaded context or the new
parallel context.

As part of ensuring parallel callables run as efficiently as possible,
we propose a new "async facade" API.  This API exposes methods that
allow potentially-blocking system calls to *always* be executed
asynchronously.  Calls to these methods always return instantly,
ensuring that parallel callable code never blocks.

The choice to expose low-level methods asynchronously versus proposing
a higher level async API (such as Producer/Consumer/Transport/Protocol
etc) was a conscious one; just like the parallel callables, these
methods are intended to be idempotent; they do not need any class
state or instance objects in order to be invoked.  They can simply be
called directly from the context of a parallel callable with the
intended target object (i.e. a socket or file descriptor) as the first
argument.

The proposed async facade also leverages the new concept of parallel
callables via the explicit callback requirements.  As long as the same
"parallel callable" constraints are adhered to by a callback (i.e.
cb.success), the execution can be scheduled for the parallel pipeline
automatically.  The more "parallel callables" used to handle program
state, the longer Python can stay executing parallel contexts.  This
is important for being able to achieve high performance across all
cores in heavily loaded situations.

Also note that we don't discuss how the backend for the async facade
should be implemented, nor do we leak any underlying platform details
into the interface.  There is no mention of IOCP, select, poll, epoll,
kqueue, non-blocking sockets/fds or AIO.  These are all implementation
details that the "parallel callable" code shouldn't be interested in.
All it needs is an asynchronous interface to all system calls that
could potentially block, as well as a means to submit general parallel
work and work that needs to be done by the main interpreter thread
outside of the parallel execution context.

The final point worth mentioning regarding the async facade is that it
isn't attempting to compete with Tulip, Twisted, Tornado or any other
async IO library.

In fact, its primary goal isn't async IO at all.  Its primary goal is
to provide parallel callables with a means to call potentially-blocking
system calls asynchronously, such that they don't stall the parallel
thread pipeline.

Because the new primitives being proposed are designed to exploit
multiple cores concurrently, the idea is that subsequent versions of
Twisted could be written to leverage the new capabilities quite
easily.  This applies to any other event loops like Tornado et al.

XXX TODO
========
Example Python Code Using New Parallel Primitives
-------------------------------------------------
I haven't thought much about what the new parallel primitives will
look like from a syntactical perspective.  I'm confident that the most
elegant solution will naturally present itself as part of subsequent
PEP reviews.

The primary goal of this PEP is to introduce the concept of parallel
primitives.  Such primitives have well defined entry and exit points,
which the main interpreter thread relies upon to control parallel
execution contexts.

There are two use cases that need to be catered for: event-driven
client/server programs and parallel task/data decomposition.

A possible example of what the former would look like::

    class SimpleEchoServer(async.TCPServer):
        def __init__(self, cb):
            cb.success = self.connection_established
            async.accept(self.sock, cb)

        @parallel.event
        def connection_established(self, cb):
            cb.success = self.line_received
            async.expect_line(self.sock, cb)

        @parallel.event
        def line_received(self, data, cb):
            cb.success = self.connection_established
            async.write(self.sock, data, cb)

    server = SimpleEchoServer()
    async.register(server)
    async.run()

The parallel task/data decomposition primitives will probably share
some similarities with the OpenMP-style of interface, as it's well
suited to this sort of task::

    @parallel.map
    def linelen(line)
        return len(line)

    # Runs linelen(line) in parallel against each sequence
    # element in ``lines``, then sums the result.
    total = parallel.reduce(text.splitline(), lenline, lambda i: i+1)

Underlying Implementation of Async Facade
-----------------------------------------
More discussion is warranted regarding how the async facade would be
implemented on different platforms.  POSIX platforms will use epoll
or kqueue, falling back to poll.  Windows, AIX and Solaris can use
their native AIO facilities with IOCP or event ports.

How each underlying implementation interacts with the parallel
primitives also needs to be discussed.  Are separate threads used, is
it all done in the main thread during parallel breaks, etc.

References
==========



Copyright
=========

This document has been placed in the public domain.



..
   Local Variables:
   mode: indented-text
   indent-tabs-mode: nil
   sentence-end-double-space: t
   fill-column: 70
   coding: utf-8
   End:
   vim:set ts=8 sw=4 sts=4 et syntax=rst tw=70:
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.