Support psycopg2 fast execution helpers as dialect options

Issue #3864 new
charles pierre created an issue

Hello,

Thanks for the fantastic job you did on this library.

I recently came across a shortcoming of the actual psycopg2 library: apparently the executemany function is not optimized at all and just do a loop over execute.

In regard to sqlalchemy, this mean that your recommended way of doing massive insert/update statements is actually really slow when working with psycopg2.

For me, the difference between using execute(insert(), ..) or bulk_insert_mappings (which in the end, just execute a tons of single INSERT statement) and doing .insert().values() (which construct the correct INSERT ... VALUES), was like day and night.

Maybe you could add a warning or note in your documentation regarding the use of psycopg2 in the relevant section of your docs (for instance, here, http://docs.sqlalchemy.org/en/latest/faq/performance.html#i-m-inserting-400-000-rows-with-the-orm-and-it-s-really-slow) ?

Comments (13)

  1. Mike Bayer repo owner

    psycopg2 executemany() should still be lots faster than a loop of execute() calls. Also, psycopg2's problem here is that it doesn't use prepared statements internally, which would help a lot. the VALUES-all-in-one-INSERT approach doesn't scale arbitrarily high, the SQL string and parameters can only be so large on the server side and has to be fully sent over the wire and parsed in memory all at once, so I'd hardly call it "correct" in the generalized sense.

  2. charles pierre Account Deactivated reporter

    Correct which is why I always chunk my list to a reasonable amount when creating a SQL statement. In my case (inserting 80k lines with 6 columns), i went from more than 10 minutes insert with executemany to a merely few seconds using VALUES-all-in-one-INSERT approach. I still think it's worth mentioning that, in certain user case, and given that you know what you are doing, not using executemany is actually good.
    Anyways, it's up to you. Once i suspected that something fishy was going on (i mean 10 minutes to insert merely 80k lines is a lot), i found the bug quite rapidly on stackoverflow and it actually helped me better understand how your library works :)

    Best,

  3. Mike Bayer repo owner

    OK, so that is a much more elaborate documentation note, e.g. "chunked INSERT..VALUES". It would be nice if psycopg2's wiki or something had some information about this we could just link towards. this side of things is more their support issue.

  4. Mike Bayer repo owner

    also 80K taking 10 minutes still seems extremely long, a psycopg2 executemany() shouldn't be that slow unless you have giant binary values in these rows.

  5. charles pierre Account Deactivated reporter

    yes i agree. Using 'pg_stat_activity', i know executemany was doing single line insert but that sill seems too much. I am looking into my server configuration trying to understand what was the bottleneck.

  6. Mike Bayer repo owner

    I'm able to insert 80000 rows in 10 seconds on a relatively low powered laptop, the PG database is on the same machine. Are you sure you don't have a network latency problem? see below.

    from sqlalchemy import *
    from sqlalchemy.orm import *
    from sqlalchemy.ext.declarative import declarative_base
    import random
    
    
    Base = declarative_base()
    
    
    class A(Base):
        __tablename__ = 'a'
        id = Column(Integer, primary_key=True)
        x = Column(Integer)
        y = Column(Integer)
        z = Column(Integer)
        p = Column(Integer)
        q = Column(Integer)
        r = Column(Integer)
    
    e = create_engine("postgresql://scott:tiger@localhost/test")
    Base.metadata.drop_all(e)
    Base.metadata.create_all(e)
    
    import time
    
    now = time.time()
    
    with e.connect() as conn:
        conn.execute(
            A.__table__.insert(),
            [
                {
                    "x": random.randint(1, 1000),
                    "y": random.randint(1, 1000),
                    "z": random.randint(1, 1000),
                    "p": random.randint(1, 1000),
                    "q": random.randint(1, 1000),
                    "r": random.randint(1, 1000),
                } for i in xrange(80000)
            ]
        )
    
        total_rows = conn.scalar(select([func.count(A.__table__.c.id)]))
    total = time.time() - now
    
    print "Inserted %d in %d sec" % (total_rows, total)
    

    Inserted 80000 in 10 sec

  7. charles pierre Account Deactivated reporter

    5sec on my laptop
    still not done on my remote DB (more than 2 minutes now...). Again i see single INSERT.
    I really think the problem is just psycopg2 that does 80k single 'INSERT' statements which takes a long time since there is a bit of latency between me and my server (but nothing out of the ordinary)

    EDIT: I have around 28ms per INSERT (tested from a SQL console). This would explain the long time: 28ms * 80k INSERT = 2240s (~38 minutes)

  8. Mike Bayer repo owner

    over a wifi connection here to a local network server, the script above takes 3.2 minutes. which I agree is bad. switching it to point to a MariaDB on the same server, using either a C or Python based MySQL driver takes at most 6 seconds, and that is an InnoDB table too. it's using the same methodology, e.g. 80K INSERT statements. not really sure what's up. Feels wrong that the solution to this pretty broad issue is "sqlalchemy docs should hardcode an awkward workaround for one specific database in the FAQ". inserting lots of rows is going to have lots of performance issues on many platforms for many reasons. I'm not sure why this one should get special treatment.

  9. charles pierre Account Deactivated reporter

    Yes i understand your point of view and as i said, if you know what you are doing you will see that something wrong is going on on your own anyway. I will raise an issue with psycopg2 but they are aware of the issue for a long time i think.
    Thank you for your time in any case :)

  10. Mike Bayer repo owner

    psycopg2 has added the performance speedups as fast execution helpers. We can use these inside of dialect.do_executemany() based on options passed to the dialect (e.g. via create_engine()), as well as using execution options. (the example of "Server side cursors" should be used as a guide for an option that is enabled both via dialect-level flags as well as execution options). The "supports_sane_multirowcount" flag also must be unset appropriately for these methods, as per the discussion one (or both?) of them aren't going to report on "rowcount" accurately, though their new docs don't indicate this.

  11. Mike Bayer repo owner

    execution options would be postgresql_batch_executemany, postgresql_values_executemany.

  12. Log in to comment