Support psycopg2 fast execution helpers as dialect options
Hello,
Thanks for the fantastic job you did on this library.
I recently came across a shortcoming of the actual psycopg2 library: apparently the executemany function is not optimized at all and just do a loop over execute.
In regard to sqlalchemy, this mean that your recommended way of doing massive insert/update statements is actually really slow when working with psycopg2.
For me, the difference between using execute(insert(), ..) or bulk_insert_mappings (which in the end, just execute a tons of single INSERT statement) and doing .insert().values() (which construct the correct INSERT ... VALUES), was like day and night.
Maybe you could add a warning or note in your documentation regarding the use of psycopg2 in the relevant section of your docs (for instance, here, http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow) ?
Comments (13)
-
repo owner -
Account Deactivated reporter Correct which is why I always chunk my list to a reasonable amount when creating a SQL statement. In my case (inserting 80k lines with 6 columns), i went from more than 10 minutes insert with executemany to a merely few seconds using VALUES-all-in-one-INSERT approach. I still think it's worth mentioning that, in certain user case, and given that you know what you are doing, not using executemany is actually good.
Anyways, it's up to you. Once i suspected that something fishy was going on (i mean 10 minutes to insert merely 80k lines is a lot), i found the bug quite rapidly on stackoverflow and it actually helped me better understand how your library works :)Best,
-
repo owner OK, so that is a much more elaborate documentation note, e.g. "chunked INSERT..VALUES". It would be nice if psycopg2's wiki or something had some information about this we could just link towards. this side of things is more their support issue.
-
repo owner also 80K taking 10 minutes still seems extremely long, a psycopg2 executemany() shouldn't be that slow unless you have giant binary values in these rows.
-
Account Deactivated reporter yes i agree. Using 'pg_stat_activity', i know executemany was doing single line insert but that sill seems too much. I am looking into my server configuration trying to understand what was the bottleneck.
-
repo owner I'm able to insert 80000 rows in 10 seconds on a relatively low powered laptop, the PG database is on the same machine. Are you sure you don't have a network latency problem? see below.
from sqlalchemy import * from sqlalchemy.orm import * from sqlalchemy.ext.declarative import declarative_base import random Base = declarative_base() class A(Base): __tablename__ = 'a' id = Column(Integer, primary_key=True) x = Column(Integer) y = Column(Integer) z = Column(Integer) p = Column(Integer) q = Column(Integer) r = Column(Integer) e = create_engine("postgresql://scott:tiger@localhost/test") Base.metadata.drop_all(e) Base.metadata.create_all(e) import time now = time.time() with e.connect() as conn: conn.execute( A.__table__.insert(), [ { "x": random.randint(1, 1000), "y": random.randint(1, 1000), "z": random.randint(1, 1000), "p": random.randint(1, 1000), "q": random.randint(1, 1000), "r": random.randint(1, 1000), } for i in xrange(80000) ] ) total_rows = conn.scalar(select([func.count(A.__table__.c.id)])) total = time.time() - now print "Inserted %d in %d sec" % (total_rows, total)
Inserted 80000 in 10 sec
-
Account Deactivated reporter 5sec on my laptop
still not done on my remote DB (more than 2 minutes now...). Again i see single INSERT.
I really think the problem is just psycopg2 that does 80k single 'INSERT' statements which takes a long time since there is a bit of latency between me and my server (but nothing out of the ordinary)EDIT: I have around 28ms per INSERT (tested from a SQL console). This would explain the long time: 28ms * 80k INSERT = 2240s (~38 minutes)
-
repo owner over a wifi connection here to a local network server, the script above takes 3.2 minutes. which I agree is bad. switching it to point to a MariaDB on the same server, using either a C or Python based MySQL driver takes at most 6 seconds, and that is an InnoDB table too. it's using the same methodology, e.g. 80K INSERT statements. not really sure what's up. Feels wrong that the solution to this pretty broad issue is "sqlalchemy docs should hardcode an awkward workaround for one specific database in the FAQ". inserting lots of rows is going to have lots of performance issues on many platforms for many reasons. I'm not sure why this one should get special treatment.
-
Account Deactivated reporter Yes i understand your point of view and as i said, if you know what you are doing you will see that something wrong is going on on your own anyway. I will raise an issue with psycopg2 but they are aware of the issue for a long time i think.
Thank you for your time in any case :) -
repo owner i'm going to look out on the psycopg2 list for this convo. I don't follow them much but I'm sure this issue has been raised before. I was hoping they'd have some guidance at http://initd.org/psycopg/docs/faq.html#best-practices. If you can get them to add details here, we can link to it.
-
Account Deactivated reporter psycopg2 related issue here: https://github.com/psycopg/psycopg2/issues/491#issuecomment-264404522
-
repo owner - changed component to postgres
- changed title to Support psycopg2 fast execution helpers as dialect options
- changed milestone to 1.3
- removed responsible
psycopg2 has added the performance speedups as fast execution helpers. We can use these inside of dialect.do_executemany() based on options passed to the dialect (e.g. via create_engine()), as well as using execution options. (the example of "Server side cursors" should be used as a guide for an option that is enabled both via dialect-level flags as well as execution options). The "supports_sane_multirowcount" flag also must be unset appropriately for these methods, as per the discussion one (or both?) of them aren't going to report on "rowcount" accurately, though their new docs don't indicate this.
-
repo owner execution options would be
postgresql_batch_executemany
,postgresql_values_executemany
. - Log in to comment
psycopg2 executemany() should still be lots faster than a loop of execute() calls. Also, psycopg2's problem here is that it doesn't use prepared statements internally, which would help a lot. the VALUES-all-in-one-INSERT approach doesn't scale arbitrarily high, the SQL string and parameters can only be so large on the server side and has to be fully sent over the wire and parsed in memory all at once, so I'd hardly call it "correct" in the generalized sense.