asyncio and sqlalchemy

Mike Bayer repo owner

changed milestone to blue sky
changed component to engine
marked as proposal

sure.

read this. http://techspot.zzzeek.org/2015/02/15/asynchronous-python-and-databases/

2015-05-04T15:38:25+00:00

Mike Bayer repo owner

changed title to asyncio and sqlalchemy

removed "ORM". asyncio integration would mostly be a core-level thing with only partial ORM functionality possible.

2015-05-04T15:39:19+00:00

Fred Cox

We are using core sqlalchemy and porting from Python 2.7 with gevent to Python 3.4 with asyncio.

Currently, we have to run all the code that hits the DB connection in executor threads, since aiopg's implementation of sqlalchemy is very limited.

Having asyncio support for core sqlalchemy would be much appreciated.

If I can help, give me pointers to what sections of the code needs modification, and I'll see what I can do.

2015-06-29T22:38:18+00:00

Mike Bayer repo owner

it would require entirely separate copies of sqlalchemy/pool, sqlalchemy/engine/base, sqlalchemy/engine/result that re-implement the cursor endpoints in an IO way. The event system also needs changes. you can see about .01% of this effort at https://bitbucket.org/zzzeek/sqla_async/commits/branch/asyncio, where there's just one commit involving getting just the connection pool to be async. What is nice about asyncio is that most methods don't have to change, at least, which is why this is even possible.

However it is an incredibly massive job, and the point of my blog post was to illustrate that asyncio has basically no advantage at all in talking to relational databases. So the kind of "appreciation" that would be worthwhile here would be kind of like what it took Notch to sell Minecraft to Microsoft :).

2015-06-29T23:08:56+00:00

Fred Cox

Thanks for the update. I can certainly appreciate your reasoning.

Our use case differs from the assumption in your blog article because we can't rely on close/fast connections between DB client and DB. Our system was designed from the start to assume every machine type could live anywhere on the internet using TLS for security. Our typical ping times range from sub-ms to 25ms. Occasionally, our network providers give us horrible service and we get 100s of ms.

Your article was updated to include 5ms ping times, which introduced about 30% I/O wait. Slower latency will obviously cause a higher percentage of I/O wait.

Due to Python's GIL, it is non-trivial to use more than one core, so spending significant time waiting on the network for that 1/8 or less of the CPU capacity is painful.

Our DB clients are API servers for the bulk of our system, so it is handling a large number of connections with infrequent or unpredictable traffic. This is the appropriate case for asyncio.

The current workaround I have is a pool of threads that performs the DB executions. The code contortions are contained behind a set of methods, but it would be nice to be able to do "yield from q.execute()." We used to be able to do the equivalent with gevent, but that is still dealing with the repercussions of Python 2.7.9.

If my application wasn't using asyncio to deal with lots of connections with sporadic traffic, then using threads throughout would be the right choice.

This comment is primarily to show another perspective and perhaps guide other people in ways to deal with this type of problem.

2015-06-30T20:53:20+00:00

Mike Bayer repo owner

Our system was designed from the start to assume every machine type could live anywhere on the internet using TLS for security. Our typical ping times range from sub-ms to 25ms. Occasionally, our network providers give us horrible service and we get 100s of ms.

absolutely, if you are connecting to many relational databases over the internet, then you have the whole set of problems that async polling is good for. But I've never heard of an application that accesses databases using the direct database protocol straight over the internet. Usually if a database is "remote", it puts itself behind a web service in order to expose it to the internet. You're in HTTP / REST and async is then easy again.

Your article was updated to include 5ms ping times, which introduced about 30% I/O wait. Slower latency will obviously cause a higher percentage of I/O wait.

if you want to run the benches with a network connection that is slower, then I'd love to produce a graph of at what point the network latency has the effect that the asyncio and/or gevent systems can move that data into the database faster than threads. I tried for many days to find a combination that showed asyncio coming even close to the threaded approach, and I couldn't, and since then, while lots of people had a whole lot to say about that post (a really, whole lot in fact), nobody has worked with the actual benches to learn more about them (well one person played with them but then he didn't come up with anything new to show). I find that to be very telling about the whole issue. That post began with a lot of work building a real test. I see very few real tests of these things anywhere, especially tests that are extremely comparative at a line-by-line code level like mine. Usually you see two web apps on completely different technologies (e.g. Pyramid vs. Tornado), written completely differently. Comparisons like those tell you almost nothing.

Due to Python's GIL, it is non-trivial to use more than one core, so spending significant time waiting on the network for that 1/8 or less of the CPU capacity is painful.

the GIL explicitly releases while waiting for IO. This is why IO-bound applications aren't hindered by the GIL, only CPU-bound applications. But asyncio uses only one CPU as well. What's the system you have in mind to use multiple cores with asyncio that is somehow different than traditional threads?

Our DB clients are API servers for the bulk of our system, so it is handling a large number of connections with infrequent or unpredictable traffic. This is the appropriate case for asyncio.

Yes it is. Though unusual. The database is a server. If you have a channel with infrequent traffic, why not just use a connection pool? That way you can greatly increase the utility of each connection. DB connections aren't like clients that connect to us and we wait for them to send us an occasional message. Unless by "infrequent traffic" you mean, you're running long SELECT queries against all these databases and spending a long time waiting for results. It sounds like a very unique application.

The current workaround I have is a pool of threads that performs the DB executions. The code contortions are contained behind a set of methods, but it would be nice to be able to do "yield from q.execute()."

asyncio, like any async system, includes a transparent thread pool: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ThreadPoolExecutor. It's just one hop and you're in traditionally threaded code. I think it will run faster than a pure asyncio version.

2015-06-30T21:24:23+00:00

Fred Cox

Further clarification:

Usually if a database is "remote", it puts itself behind a web service in order to expose it to the internet. You're in HTTP / REST and async is then easy again.

The API server I'm referring to is the service in front of the DB. It doesn't happen to primarily speak HTTP/S, partly for pub/sub capability. We also don't require it to be near the DB.

if you want to run the benches with a network connection that is slower, then I'd love to produce a graph of at what point the network latency has the effect that the asyncio and/or gevent systems can move that data into the database faster than threads.

Our desire for asyncio is not about expecting better performance, it's about consistency and maintainability. If we could be sure that threads could support the 10s to 100s of thousands of non-DB connections we expect without blowing up memory with stacks, then we would have used that instead of gevent, and now asyncio.

Of course, we don't want to take a huge hit to performance for no good reason, but the long delay in getting gevent to work with 2.7.9 has made us move to asyncio and 3.4.

the GIL explicitly releases while waiting for IO. This is why IO-bound applications aren't hindered by the GIL, only CPU-bound applications. But asyncio uses only one CPU as well. What's the system you have in mind to use multiple cores with asyncio that is somehow different than traditional threads?

My point was that an unthreaded system that does any blocking will not be using the core it has available to the fullest extent. I see now that you are advocating threads for the DB calls, so that is beside the point.

To get some usage out of the other cores, we do CPU intensive operations in C in threads with the GIL turned off.

Yes it is. Though unusual. The database is a server. If you have a channel with infrequent traffic, why not just use a connection pool?

Our API server acts as the connection pool (using QueuePool) among other things.

asyncio, like any async system, includes a transparent thread pool

As mentioned earlier, that's what we're using now. It's clunkier than yield from.

2015-07-01T01:52:38+00:00

Mike Bayer repo owner

As mentioned earlier, that's what we're using now. It's clunkier than yield from.

So if I understand correctly, it seems like the infrequent/unpredictable connections you refer to are not database connections at all, and there is just some database logic that needs to interact within the same application, and integration of async with a thread pool is deemed too awkward. My take on async is that it is only appropriate when you need to poll lots of arbitrarily slow/sleepy connections and nowhere else. I'd think it's much easier for asyncio to improve it's threading integration rather than forcing the rest of the world to turn itself inside out; for a tool like SQLAlchemy it also means lots of critical patterns aren't even possible (e.g. lazy loading).

My current work with haalchemy combines async and threaded approaches very smoothly using eventlet (which I'll be switching back to gevent once their py3k support lands) and threads at the same time, using the async library for the job of attending to a large number of mostly idle connections and the threads for hitting the databases for status. One might say this is complicated, I don't think it is too bad if one understands the concepts, but IMO the choice to use async is always the more complex choice, as it is only a very special approach for a very particular situation, that of transparent select() polling of many TCP connections, and this approach should not force the rest of the application to look the same way when it isn't necessary.

Unfortunately, when the documentation for asyncio phrases things like "for times when you absolutely, positively have to use a library that makes blocking I/O calls.", as though it's a terrible thing that should be avoided, I think the maintainers of asyncio themselves might be forgetting why we are doing async in the first place, hence the focus of my blog post which attempted to bring back the roots of async programming, that of using select()-style polling across a large number of arbitrarily slow/idle connections and that's pretty much it, contrasted with silly systems like node.js that attempt to make async the solution to everything because javascript happened to be an event-oriented language, because it was written for GUIs.

2015-07-01T04:03:25+00:00

Max Nordlund

First off, sorry for bringing back this old thread back to life. But there has been quite some time and the asyncio library has come quite far.

I agree that using async IO to speed up database operations for an application running in the cloud, with fairly low latency, seems overkill. But as a web developer they are kinda neat. However, you don't want any part of your app to block the event loop, and thus you need all IO to be async in some form or another. Having a thread pool seems like the sweet spot.

Allow me to give a bit of background. I'm coming from node where I wrote a small, proprietary, ORM around 0.8/0.10. I had to use the raw C bindings for SSL support and thus needed to bridge the async and sync worlds. Sadly the bindings where blocking, so it didn't really matter for performance but it did for our sanity.

To do that I used Promises, called Futures in Python. Basically I gave the caller a Promise immediately and, later, I resolved it synchronously on the DB connection side. Then the event loop took over and handed back control to the caller.

I believe something similar can be achieved in Python by using an asyncio-queue and a normal queue to bridge the gap. The trick is to return a Future, and using the synchronous API resolve it. Maybe the async queue isn't even necessary, haven't figured out all the details yet.

If something like this could be included in SQLAlchemy, that would terrific. Perhaps using async with on the connection object to explicitly switch to async mode?

async with engine.connect() as conn:
    # Now we're in async mode, and we can use the
    # __aenter__ and __aexit__ methods for great good
    query = select([users, addresses]).where(users.c.id == addresses.c.user_id)
    result = yield from conn.execute(query)

    async for row in result: # Maybe being async isn't even needed
         pass

2017-04-18T19:21:48+00:00

Mike Bayer repo owner

If something like this could be included in SQLAlchemy, that would terrific. Perhaps using async with on the connection object to explicitly switch to async mode?

I guess you're asking for the "facade that refers to a thread pool" approach. It's not clear to me that this facade wouldn't have much less code than just porting engine/pool/etc. to async versions directly, and also marshalling to/from a thread pool is expensive; you don't want to do this per-operation. This is how twisted's database adapters work and I think it's a terrible approach.

However, the "facade on top of the threadpool" approach at least is something that doesn't need to duplicate the internals, as we see here in aiopg: https://github.com/aio-libs/aiopg/blob/master/aiopg/sa/engine.py#L76 . But that means, the "async threadpool facade" would better be as a separate project. I don't have the resources to develop and maintain it.

2017-04-18T20:44:50+00:00

Max Nordlund

I was thinking about the aiopg approach, just didn't explain good enough I guess.

But that means, the "async threadpool facade" would better be as a separate project. I don't have the resources to develop and maintain it.

Yeah I understand. I think for my purposes the aiopg library is going to be good enough for me, since I'm targeting PostgreSQL anyway. Thanks for the tip.

2017-04-19T19:40:31+00:00

Comments (11)