Trent Nelson  committed 5d34a2e Draft


  • Participants
  • Parent commits 824c0e9
  • Branches async

Comments (0)

Files changed (1)

File pep-async.txt

 within a parallel context is not able to affect any global/nonlocal
-Solution Proposal - Quick Overview
+Solution Proposal - Overview
 In order to help solidify some of the concepts being presented, here's
 the general overview of the solution the PEP author has in mind.  Keep
 in mind it's only an example, and it is being presented in order to
 These threads idle/wait against a condition variable by default.  That
 is, when there's nothing to be done in parallel, they do nothing.
-The main intepreter thread's behaviour when executing non-parallel code
+The main interpreter thread's behaviour when executing non-parallel code
 is identical to how it behaves now when executing normal Python code;
 all GIL, memory and GC semantics are preserved.
 again, it can un-pause the parallel execution, and all cores continue
 executing the parallel callables.
-This cycle repeats as often as necessary.
+As long as the parallel context is still active (i.e. hasn't been
+explicitly exited), this cycle repeats over and over.  Single thread,
+to concurrent parallel callable execution, to single thread, back to
+concurrent parallel callable execution, etc.
-Exiting a parallel context is similar to the periodic breaks; the only
-difference is that "parallel finalization" methods might also need to
-be run (like the "reduce" part of map/reduce).
+Exiting a parallel context will probably require slightly different
+behaviour depending on whether it was an event driven context, or
+parallel task/data decomposition.  For the latter, it is likely that
+some sort of a "parallel finalization" method needs to be run (like
+the "reduce" part of map/reduce), typically against the "output data"
+each parallel callable "returned" (i.e. via a Python ``return foo``).
+Concurrent Frame Execution Pipelines and Memory, GC and Ref. Counting
-The Role of Asynchronous Callbacks
+Those familiar with CPython internals will have probably noticed a
+flaw in the logic presented above: you can't have multiple threads
+evaluating Python frames (i.e. CFrame_EvalEx) concurrently because
+of the way Python currently allocates memory, counts references and
+manages garbage collection.
+That is 100% correct.  You can't run the interpreter concurrently
+because none of the memory allocation, reference counting and garbage
+collection facilities are thread safe.  Incrementing a reference to an
+object translates to a simple integer increment; there is no mutex
+locking and unlocking, no atomic ops or memory barries protecting
+critical sections.  In fact, not only are they not thread safe, the
+assumption that there is only one, global thread of execution pervades
+every aspect of the implementation.
+It is *this* aspect of CPython that limits concurrent execution; not
+the GIL.  The GIL is simply the tool used to enforce this pervasive
+assumption that there is only a single global thread of execution.
+Again, it's worth noting that prior attempts at achieving concurrency
+were done by adding all sorts of fine grained locking around critical
+sections of memory allocation, ref. counting and GC code.
+That has a disasterous impact on CPython's performance.  Incrementing
+and decrementing a C integer variable now becomes a dance involving
+mutex locking and unlocking.  Combine that with the fact those two
+actions are probably the single most frequently performed action by a
+CPython interpreter, and you can appreciate how futile any approach
+relying on fine-grained locking is.
+If fine-grained locking is out, how can we allow multiple CPU cores
+to execute a (modified) CFrame_EvalEx concurrently without breaking
+all of the existing memory allocation, reference counting and garbage
+collection framework?
+The idea offered by this PEP is as follows: each parallel thread's
+environment is set up such that it thinks its using the normal global
+malloc/refcount facilities, when in fact it has a localized instance.
+All object allocation, deallocation, reference counting and garbage
+collection that takes place during concurrent CFrame_EvalEx execution
+is localized to that thread, and we maintain the illusion of a single
+global thread of execution.
+We can do this because our design entails only ever having a single
+"parallel frame execution pipeline" thread for each available core.
+This means that the thread's execution won't ever be interrupted by
+another identical thread trying to access the same memory facilities.
+(Additionally, we can set the thread affinity such that it always runs
+on the same CPU core.  This allows us to benefit from cache locality,
+among other things.)
+Thus, all of the existing memory allocation, reference counting and
+garbage collection facilities can be used without modification by
+each thread's CFrame_EvalEx pipeline.  As everything is local to that
+thread/CPU, no locking is required around ref. counting primitives.
+No locking means no additional overhead.
+In fact, because each parallel thread will only ever be executing
+short-lived "parallel friendly" callables, we could omit the garbage
+collection functionality completely in lieu of free()'ing allocated
+memory as soon as the refcnt hits 0.
+Or we could even experiment with omitting reference counting entirely
+and just automatically free all memory upon completion of the parallel
+callable.  After all, the callable shouldn't be affecting any global
+(nonlocal) state (i.e. it's side-effect free), and as we're providing
+well defined mechanisms for communicating the results of execution
+back into the program flow (via ``return ...`` or calling other
+"parallel-safe" methods), the lifetime of all other objects allocated
+during the callable ends as soon as that callable completes.
+The Constraints Imposed by Thread-Local Object Allocation
+Remember that [R12] says that constraints may be introduced by the new
+parallel facilities as long as the following conditions are met:
+    - The constraint is essential for performance reasons.
+    - The constraint is logical to your average Python programmer.
+    - The benefit yielded by concurrent execution outweighs the
+      inconvenience of the constraint.
+The implication of thread-local object allocation facilities is that
+parallel code will have no visibility to nonlocal/global objects.
+Thus, parallel callables will have to rely solely on their function
+arguments providing them with everything they need to perform their
+work.  Likewise, they can't affect global variable state, they can
+only communicate the results of their execution back to the main
+program flow by either returning results (via Python ``return``),
+or calling one of the "parallel friendly" methods to push execution
+onto the next state/stage.
+At surface level, these constraints hopefully meet the criteria in
+[R12]; they're essential for performance, they're logical within the
+context of parallel programming, and the benefit of concurrency
+outweighs the cost of writing "parallel friendly" callable code.
+There will undoubtedly be more constraints that surface once a working
+prototype has been implemented, but for now, it's probably safe to
+assume the biggest one will be the limited view of the universe (from
+the perspective of available/known objects) presented to code that
+executes within a parallel callable.
+The Importance of Never Blocking
 The vast majority of discussion so far has focused on the parallel
 aspect of this PEP.  That is, introducing new primitives that allow
 Python code to run concurrently across multiple cores as efficiently
 as possible.
-The term "efficiently as possible" can be broken down as follows:
+The term "efficiently as possible", within the context of a Python
+interpreter executing within a parallel context (i.e. with all CPU
+cores running parallel callables), means the following:
-    - Th
+    - Each CPU core is spending 100% of its time in userland,
+      executing the modified CFrame_EvalEx pipeline.
+    - There is no cross-talk between cores;  code in core 1 doesn't
+      try to communicate with code in core 2 half way through the
+      parallel callable.
+    - There is no read/write contention for shared data, which means
+      no need for synchronization primitives (at least in the most
+      common code paths).
+    - The code within the callable never blocks.
+The first three points are addressed in the previous section.  The
+final point, that callable code never blocks, is discussed in this
-We've also established that threading.Thread will not be the mechanism
-used for achieving concurrency.  We still need to play nice with
-existing The solution must still play nice with 
+Blocking, in this context, refers to a thread calling something that
+doesn't immediately/always return in a deterministic fashion.  System
+calls that involve writing and reading to sockets/files are examples
+of calls that can "block".
+Once blocked, a thread is useless.  Not only does execution of the
+current parallel callable come grinding to a halt -- the entire
+pipeline stalls.  Execution only resumes when the thread "unblocks",
+typically when underlying IO has been completed.
+Because our implementation relies heavily on the notion of binding a
+single thread to each CPU core, when one of these parallel pipeline
+threads stalls, the entire CPU core stalls -- there are no other
+threads ready and waiting in the wings to pick up execution.
-The Proposed Solution
+So, blocking must be avoided at all costs.  Luckily, this is far from
+being a foreign concept -- the ability to avoid blocking is critical
+to the success of all networking/event libraries.
+Different platforms provide different facilities for avoiding blocking
+system calls.  On POSIX systems, programs primarily rely on setting a
+socket or file descriptor to "non-blocking"; alternatively, a blocking
+call can be deferred to a separate thread, which allows the main
+program flow to continue unimpeded.
+Windows, on the other hand, has much more sophisticated facilities for
+achieving high-performance IO by way of excellent asynchronous IO
+support (usually referred to as "overlapped IO"), IO completion ports,
+thread pools, efficient locking primitives and more.
+AIX has facilities equal to that of Windows XP/2003 (overlapped IO and
+IOCP, but no intrinsic thread pools).  Solaris introduced a similar
+API to IOCP called "event ports", which is functionally equivalent to
+AIX and Windows XP/2003.
+The key point to take away from this is that although all platforms
+provide a means for achieving non-blocking or asynchronous calls, how
+each platform actually goes about doing it is vastly different.
+The Requirement for Asynchronous Callbacks
+From the perspective of the parallel callable, how the underlying
+operating system goes about asynchronous IO or non-blocking is an
+irrelevant implementation detail.
+However, what *is* important is the set of methods made available
+to parallel callables that allow potentially-blocking calls to be
+made in a "parallel-safe" manner.
+There are many ways to expose this functionality to programs.  In
+fact, how this functionality should be exposed is exactly the topic
+being discussed on python-ideas and prototyped in libraries like
+tulip.  It is also the essence of existing networking libraries like
+Twisted and Tornado, and is the focus of another PEP: PEP 3145.
+This PEP proposes an alternate approach to async IO compared to the
+libraries above.  It was first presented to python-ideas on the 30th
+of November, and the core concept hasn't changed since.
+    (
+The relevant code example is as follows::
+    class Callback:
+        __slots__ = [
+            'success',
+            'failure',
+            'timeout',
+            'cancel',
+        ]
+    async module/class async:
+        def getaddrinfo(host, port, ..., cb):
+            ...
+        def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
+            ...
+        def accept(sock, cb):
+            ...
+        def accept_then_write(sock, buf, (cb1, cb2)):
+            ...
+        def accept_then_expect_line(sock, line, (cb1, cb2)):
+            ...
+        def accept_then_expect_multiline_regex(sock, regex, cb):
+            ...
+        def read_until(fd_or_sock, bytes, cb):
+            ...
+        def read_all(fd_or_sock, cb):
+            return self.read_until(fd_or_sock, EOF, cb)
+        def read_until_lineglob(fd_or_sock, cb):
+            ...
+        def read_until_regex(fd_or_sock, cb):
+            ...
+        def read_chunk(fd_or_sock, chunk_size, cb):
+            ...
+        def write(fd_or_sock, buf, cb):
+            ...
+        def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
+            ...
+        def connect_then_expect_line(..):
+            ...
+        def connect_then_write_line(..):
+            ...
+        def submit_work(callable, cb):
+            ...
+        def submit_blocking_work(callable, cb):
+            ...
+        def submit_scheduled_work(callable, schedule, cb):
+            """
+            Allows for periodic work to be submitted (i.e. run every
+            hour).
+            """
+            ...
+        def submit_future_work(callable, delay, cb):
+            """
+            Allows for work to be executed at an arbitrary point in
+            the future.
+            """
+            ...
+        def submit_synchronized_work(callable, cb):
+            """
+            Submits a callable to be executed on the main Python
+            interpreter thread (i.e. outside of any parallel context
+            execution).
+            """
+            ...
+        def run_once(..):
+            """Run the event loop once."""
+        def run(..):
+            """Keep running the event loop until exit."""
+The two key aspects of these methods are: a) every method (except the
+run_once/run ones) must be passed a callback, and b) every method
+*always* returns immediately.  From the perspective of the parallel
+callable, everything happens asynchronously.
+Other aspects of this proposed API worth noting (particularly where it
+differs from conventional proposals to date):
+    - There's a single flat interface to *all* asynchronous calls.
+      This is in stark contrast to the object-oriented/interface
+      approach used by Twisted, which relies on concepts/classes
+      such as Protocols, Producers, Consumers, Transports, etc.
+    - It has been written in such a way to satisfy [R8], which states
+      that the interface should focus on allowing the platform with
+      the best asynchronous primitives to run natively, and simulating
+      the behaviour elsewhere.  This means that functions like
+      getaddrinfo() is exposed via the interface, because Windows
+      provides a native way to call this method asynchronously.
+    - Every single function bar the submit_* ones deal with IO in some
+      way or another.  All possible operations you could call against
+      a socket or file descriptor are accessible asynchronously.  This
+      is particularly important for calls such as connect and accept,
+      which may not have asynchronous counterparts on some systems.
+      (In which case, the "asynchronicity" would be mimicked in the
+      same fashion Twisted et al do today.)
+    - There are a lot of combined methods:
+        def getaddrinfo_then_connect(.., callbacks=(cb1, cb2))
+        def accept_then_write(sock, buf, (cb1, cb2)):
+        def accept_then_expect_line(sock, line, (cb1, cb2)):
+        def accept_then_expect_multiline_regex(sock, regex, cb):
+        def read_until(fd_or_sock, bytes, cb):
+        def read_until_lineglob(fd_or_sock, cb):
+        def read_until_regex(fd_or_sock, cb):
+        def read_chunk(fd_or_sock, chunk_size, cb):
+        def write_then_expect_line(fd_or_sock, buf, (cb1, cb2)):
+        def connect_then_expect_line(..):
+        def connect_then_write_line(..):
+      These combined methods provide optimal shortcuts for the most
+      common actions performed by client/server event driven software.
+      The idea is to delay the need to run Python code for as long as
+      possible -- or in some cases, avoid the need to call it at all,
+      especially in the case of error handling, where the interface
+      can detect whether the incoming bytes are erroneous or not.
+      For example, write_then_expect_line() automatically handles
+      three possible code paths that could occur after data has been
+      written to a network client:
+        - The expected line is received -> cb.success() is executed.
+        - The expected line is not received.  Either the line didn't
+          match the expected pattern, the line was too long, had
+          invalid chars, etc.  Whatever the case, cb.failure() is
+          automatically called -- which would be configured to send an
+          error code (if appropriate to the protocol) and then close
+          the connection.
+        - A terminated \r\n line isn't received within the timeout
+          specified by cb.timeout.  cb.failure() is run with the
+          failure type indicating 'timeout'.
+      The general principle is to try and delay the need to execute
+      Python code for as long as possible -- and, when it can't be
+      delayed any longer, make sure the Python code has enough data to
+      do useful work.
+      This frees the Python programmer from writing repetitive IO
+      processing code (i.e. is my buffer full, has a full line been
+      received, has the correct pattern been detected, etc) and focus
+      more on program logic -- "when a line matching this pattern is
+      received, run this Python code" (i.e. the callable accessible
+      from the cb.success callback).
+      Additionally, by keeping all the buffer processing and input
+      detection work in C, the Python interpreter doesn't need to
+      switch back to Python code until there's something useful to
+      do.
+      This becomes an important optimization when dealing with heavily
+      loaded servers.  If you've got 64k concurrent clients reading
+      and writing to your server, the faster you can serve each
+      request, the better you're going to perform.  Minimizing the
+      number of times Python code is being executed over the course of
+      a given interval will improve the overall performance.
+    - Finally, there is a set of methods specifically tailored towards
+      handling blocking calls:
+        def submit_work(callable, cb):
+        def submit_blocking_work(callable, cb):
+        def submit_scheduled_work(callable, schedule, cb):
+            """
+            Allows for periodic work to be submitted (i.e. run every
+            hour).
+            """
+        def submit_future_work(callable, delay, cb):
+            """
+            Allows for work to be executed at an arbitrary point in
+            the future.
+            """
+        def submit_synchronized_work(callable, cb):
+            """
+            Submits a callable to be executed on the main Python
+            interpreter thread (i.e. outside of any parallel context
+            execution).
+            """
+      These methods serve an important role.  Because the proposed
+      solution advises against using threading.Thread (or rather,
+      threading.Threads inhibit the parallel capabilities of the
+      interpreter, as it has to let the threads run at regular
+      intervals, pausing the parallel pipelines to do so), there
+      needs to be a way to submit non-IO work that may potentially
+      block from within the context of a parallel callable.
+      This is what the submit_* methods are for.  They allow for
+      submission of arbitrary callables.  Where and how the callable
+      is executed depends on what submit method is used.  Pure
+      "compute" parallel callables can be submitted via submit_work.
+      Blocking calls can be submitted via submit_blocking_call (this
+      allows the underlying implementation to differentiate between
+      blocking calls that need to be handled in a separate, on-demand
+      thread, versus compute callables that can simply be queued to
+      one of the pipelines).
+      Additionally, work can be submitted for processing
+      "synchronously", that is, when parallel processing is paused (or
+      has stopped) and the main interpreter thread is executing in as
+      a single-thread with the GIL acquired and visibility to all
+      global/nonlocal variables.  This provides a means to affect the
+      global state of the program from within a parallel callable
+      *without* requiring shared data synchronization primitives.
+      (Note that most of these methods have equivalent counterparts in
+      libraries like Twisted, i.e. reactor.callFromThread, callLater,
+      etc.)
+Tying It All Together
-TL;DR Version
- * Create the main interpreter thread and bind it to whatever
-   processor core it's currently running on.
- * For each additional core found, create and bind another thread.
- * Refactor the role of CEval_FrameEx and supporting functions.
- * No change from current behavior in single-threaded case.
- * Detection of a parallel context entry point (opcode?) results in
-   prepping every other thread's frameex pipeline with the available
-   work, signalling all threads to begin their frameex pipelines, then
-   waiting for work to complete to transition back to single-thread
-   of control.
- * This repeats every time parallel constructs are encountered.
- * Other cores are idle when the program is not in a parallel context.
-Parallelizing the Interpreter
-For now, we'll ignore the asynchronous requirements and just focus on
-the concurrency.  threading.Thread() is already off the table, so we
-can ignore that aspect completely.
+With the provision of asynchronous callbacks, we have all the pieces
+we need to facilitate our main goal of exploiting multiple cores
+within the context of the following two use cases:
-Without threading.Thread(), we need to come up with a way to run code
-concurrently within the confines of parallel contexts.  We'll achieve
-this using a new type of thread which we'll call ``IThread``, for
-"interpreter thread".
+    - Writing high-performance, event-driven client/server software.
+    - Writing parallel task/data decomposition algorithms.
+We are able to exploit multiple cores by introducing a set of parallel
+primitives (the specifics of which we've yet to discuss) that the main
+interpreter thread uses to signal the entry and exit of parallel
+execution contexts.
+We have introduced the concept of "parallel callables".  These are
+simply Python functions that can be run concurrently across multiple
+cores.  In order to achieve concurrent execution of Python code as
+efficiently as possible, some constraints need to be observed by the
+code within such a "parallel callable":
+    - The code should never block.  This is achieved by providing an
+      asynchronous facade for all possible IO operations.
+    - The code will not have visibility to any global variables or
+      objects (globals() will literally return an empty dict (or maybe
+      just the same dict as locals())).
+    - Because no shared or global state can be accessed directly from
+      within the body of the code, any "input data" the code requires
+      must be made accessible via normal function arguments.
+    - The lifetime of all objects created within a parallel callable
+      ends when the callable finishes (that is, they are explicitly
+      deallocated).
+    - Within the context of parallel task/data composition, if a
+      callable needs to communicate the result of its work back to the
+      main program, it does so by simply returning that data (i.e. via
+      a normal Python ``return foo`` statement).
+    - Within the context of event-driven clients/servers, callables
+      can use a set of async.submit_*_work() facilities to either
+      communicate with the main program or affect global program
+      state in some way.
+    - The code within the parallel callable should be reduced to the
+      simplest expression of program logic suitable for exposing as an
+      atomic unit.  In general, more callables with less functionality
+      are preferable to less callables with greater functionality.
+    - Parallel callables are idempotent (an important property if they
+      are to be executed concurrently).  Provided with the same input,
+      they'll always produce the same output.  They can't affect each
+      other's state, nor do they leak state (i.e. object allocation)
+      once finished, nor can they affect nonlocal/global objects
+      unless the standard communication primitives are used (using
+      ``return`` or async.submit*).
+With these constraints in place, we propose achieving interpreter
+parallelism/concurrency by creating a thread for each CPU core, and,
+upon a signal from the main thread that parallel computation can
+begin, a modified CFrame_EvalEx pipeline.
+The parallel pipeline will only ever execute parallel callables.  It
+will execute a single callable to completion (i.e. until the function
+has finished), perform any data marshalling required by ``return``,
+free all objects allocated during execution of the callable, and then
+check whether or not the main thread has paused (or completed) parallel
+execution.  This determines whether the parallel thread yields control
+back to a waiting main interpreter thread, or whether it can pull down
+another callable off its queue and start the process again.
+We identify that the major issue with attempting concurrency in the
+past is that all approaches try and make threading.Thread suddenly
+behave concurrently.  This requires revamping a significant amount of
+CPython internal structures such that they're protected by locks.
+This adds an unacceptable overhead; frequent operations like incref
+and decref, which are currently extremely trivial and almost free from
+a "cycle cost" perspective, suddenly need to be protected with mutex
+lock and unlocks.
+This sort of an approach will never work with the current memory
+allocation, reference counting and garbage collection implementation,
+because it has all been written with the assumption that there is a
+single global thread of execution at any one time.
+We propose an alternate approach to for achieving concurrency without
+the overhead of fine-grained locking: provide each thread pipeline
+with its own, localized object allocation.  This would be done by
+replacing the global object head static pointers currently used with a
+set of thread-local replacements.  This maintains the illusion of a
+single global thread of execution from the perspective of CPython
+We also suggest exploring avenues such as dropping garbage collection
+altogether and simply deallocating objects (i.e. free()'ing them) as
+soon as there refcnt hits 0, or even doing away with reference
+counting altogether and automatically deallocating everything that was
+allocated as part of the "parallel callable cleanup" steps each thread
+must take before starting a new callable.
+This proposal introduces a new constraint: no visibility to
+global/nonlocal variables from within the context of a parallel
+callable.  The callable must be written such that it has all the data
+it needs from its function arguments.  If it needs to affect global
+program state, it does so by returning a value or using one of the
+async.* methods.
+Our proposed approach also meets all of the requirements with regards
+to maintaining GIL semantics.  It does this by having the main
+interpreter thread periodically pause parallel execution in order to
+release the GIL, as well as perform other bits of housekeeping that
+can only be done with the GIL held from the single-threaded execution
+context of the main interpreter thread (which, unlike parallel
+threads, has access to the "global" memory state, garbage collection,
+Obviously, the longer the parallel threads are permitted to run
+without being paused the better, but legacy code that relies on GIL
+semantics (especially with extension codes) will still run without
+issue in the mean time, nor will any performance penalties be incurred
+for not using the new parallel primitives.
+Because we don't introduce any fine-grained locking primitives in
+order to achieve concurrency, no additional overhead is incurred by
+Python code running in the normal single-threaded context or the new
+parallel context.
+As part of ensuring parallel callables run as efficiently as possible,
+we propose a new "async facade" API.  This API exposes methods that
+allow potentially-blocking system calls to *always* be executed
+asynchronously.  Calls to these methods always return instantly,
+ensuring that parallel callable code never blocks.
+The choice to expose low-level methods asynchronously versus proposing
+a higher level async API (such as Producer/Consumer/Transport/Protocol
+etc) was a conscious one; just like the parallel callables, these
+methods are intended to be idempotent; they do not need any class
+state or instance objects in order to be invoked.  They can simply be
+called directly from the context of a parallel callable with the
+intended target object (i.e. a socket or file descriptor) as the first
+The proposed async facade also leverages the new concept of parallel
+callables via the explicit callback requirements.  As long as the same
+"parallel callable" constraints are adhered to by a callback (i.e.
+cb.success), the execution can be scheduled for the parallel pipeline
+automatically.  The more "parallel callables" used to handle program
+state, the longer Python can stay executing parallel contexts.  This
+is important for being able to achieve high performance across all
+cores in heavily loaded situations.
+Also note that we don't discuss how the backend for the async facade
+should be implemented, nor do we leak any underlying platform details
+into the interface.  There is no mention of IOCP, select, poll, epoll,
+kqueue, non-blocking sockets/fds or AIO.  These are all implementation
+details that the "parallel callable" code shouldn't be interested in.
+All it needs is an asynchronous interface to all system calls that
+could potentially block, as well as a means to submit general parallel
+work and work that needs to be done by the main interpreter thread
+outside of the parallel execution context.
+The final point worth mentioning regarding the async facade is that it
+isn't attempting to compete with Tulip, Twisted, Tornado or any other
+async IO library.
+In fact, its primary goal isn't async IO at all.  Its primary goal is
+to provide parallel callables with a means to call potentially-blocking
+system calls asynchronously, such that they don't stall the parallel
+thread pipeline.
+Because the new primitives being proposed are designed to exploit
+multiple cores concurrently, the idea is that subsequent versions of
+Twisted could be written to leverage the new capabilities quite
+easily.  This applies to any other event loops like Tornado et al.