bulk_insert_mappings does not group inserts properly.

Issue #3940 resolved
Livio Bieri created an issue

Summary

bulk_insert_mappings does not group inserts properly.

Example

from sqlalchemy import Column, VARCHAR, SmallInteger, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import Session

Base = declarative_base()

dburl = 'mysql://root:root@127.0.0.1:3306/dirtyfeets'
engine = create_engine(dburl, echo=True)


class Whatever(Base):
    __tablename__ = 'whatever'

    id = Column('id', SmallInteger, nullable=False, index=True, primary_key=True)
    foo = Column(VARCHAR(255), nullable=False)
    bar = Column(VARCHAR(255), nullable=False)


if __name__ == "__main__":
    mapping = [{'id': 1, 'foo': '_', 'bar': '_'},
               {'id': 2, 'bar': '_'},
               {'id': 3, 'foo': '_', 'bar': '_'},
               {'id': 4, 'bar': '_'},
               {'id': 5, 'foo': '_', 'bar': '_'},
               {'id': 6, 'bar': '_'},
               {'id': 7, 'foo': '_', 'bar': '_'},
               {'id': 8, 'bar': '_'},
               {'id': 9, 'foo': '_', 'bar': '_'}]

    session = Session(bind=engine)
    session.bulk_insert_mappings(Whatever, mapping)
    session.commit()

Actual

017-03-16 16:26:41,010 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:26:41,011 INFO sqlalchemy.engine.base.Engine show collation where `Charset` = 'utf8' and `Collation` = 'utf8_bin'
2017-03-16 16:26:41,011 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:26:41,013 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS CHAR(60)) AS anon_1
2017-03-16 16:26:41,013 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:26:41,014 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS CHAR(60)) AS anon_1
2017-03-16 16:26:41,014 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:26:41,015 INFO sqlalchemy.engine.base.Engine SELECT CAST('test collated returns' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin AS anon_1
2017-03-16 16:26:41,016 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:26:41,018 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2017-03-16 16:26:41,019 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, foo, bar) VALUES (%s, %s, %s)
2017-03-16 16:26:41,019 INFO sqlalchemy.engine.base.Engine (1, '_', '_')
2017-03-16 16:26:41,020 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, bar) VALUES (%s, %s)
2017-03-16 16:26:41,021 INFO sqlalchemy.engine.base.Engine (2, '_')
2017-03-16 16:26:41,021 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, foo, bar) VALUES (%s, %s, %s)
2017-03-16 16:26:41,022 INFO sqlalchemy.engine.base.Engine (3, '_', '_')
2017-03-16 16:26:41,022 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, bar) VALUES (%s, %s)
2017-03-16 16:26:41,023 INFO sqlalchemy.engine.base.Engine (4, '_')
2017-03-16 16:26:41,023 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, foo, bar) VALUES (%s, %s, %s)
2017-03-16 16:26:41,023 INFO sqlalchemy.engine.base.Engine (5, '_', '_')
2017-03-16 16:26:41,024 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, bar) VALUES (%s, %s)
2017-03-16 16:26:41,024 INFO sqlalchemy.engine.base.Engine (6, '_')
2017-03-16 16:26:41,025 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, foo, bar) VALUES (%s, %s, %s)
2017-03-16 16:26:41,025 INFO sqlalchemy.engine.base.Engine (7, '_', '_')
2017-03-16 16:26:41,026 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, bar) VALUES (%s, %s)
2017-03-16 16:26:41,026 INFO sqlalchemy.engine.base.Engine (8, '_')
2017-03-16 16:26:41,027 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, foo, bar) VALUES (%s, %s, %s)
2017-03-16 16:26:41,027 INFO sqlalchemy.engine.base.Engine (9, '_', '_')
2017-03-16 16:26:41,028 INFO sqlalchemy.engine.base.Engine COMMIT

Expected

EXPECTED LOG, two inserts for each group:
2017-03-16 16:31:57,187 INFO sqlalchemy.engine.base.Engine SHOW VARIABLES LIKE 'sql_mode'
2017-03-16 16:31:57,187 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:31:57,191 INFO sqlalchemy.engine.base.Engine SELECT DATABASE()
2017-03-16 16:31:57,191 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:31:57,193 INFO sqlalchemy.engine.base.Engine show collation where `Charset` = 'utf8' and `Collation` = 'utf8_bin'
2017-03-16 16:31:57,193 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:31:57,195 INFO sqlalchemy.engine.base.Engine SELECT CAST('test plain returns' AS CHAR(60)) AS anon_1
2017-03-16 16:31:57,196 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:31:57,198 INFO sqlalchemy.engine.base.Engine SELECT CAST('test unicode returns' AS CHAR(60)) AS anon_1
2017-03-16 16:31:57,198 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:31:57,200 INFO sqlalchemy.engine.base.Engine SELECT CAST('test collated returns' AS CHAR CHARACTER SET utf8) COLLATE utf8_bin AS anon_1
2017-03-16 16:31:57,200 INFO sqlalchemy.engine.base.Engine ()
2017-03-16 16:31:57,204 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2017-03-16 16:31:57,205 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, bar) VALUES (%s, %s)
2017-03-16 16:31:57,205 INFO sqlalchemy.engine.base.Engine ((2, '_'), (4, '_'), (6, '_'), (8, '_'))
2017-03-16 16:31:57,208 INFO sqlalchemy.engine.base.Engine INSERT INTO whatever (id, foo, bar) VALUES (%s, %s, %s)
2017-03-16 16:31:57,208 INFO sqlalchemy.engine.base.Engine ((1, '_', '_'), (3, '_', '_'), (5, '_', '_'), (7, '_', '_'), (9, '_', '_'))
2017-03-16 16:31:57,210 INFO sqlalchemy.engine.base.Engine COMMIT

For me it looks like it's just a simple problem where the insert interable is groupby-ed before it was sorted:

I fixed the problem in a commit but unfortunately this breaks 4 tests. I'd love to submit a PR that fixes this issue properly (assuming this behaviour is a bug) but I'd need some help to understand why the tests fail.

Failing Tests

================================================================================ FAILURES =================================================================================
__________________________________________________________________ AttachedFileShardTest.test_roundtrip ___________________________________________________________________
[gw2] darwin -- Python 2.7.13 /Users/livio/Fashwell/karl-server/src/env/bin/python
Traceback (most recent call last):
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/ext/test_horizontal_shard.py", line 170, in test_roundtrip
    'Asia', 'Tokyo')])
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/../lib/sqlalchemy/testing/assertions.py", line 214, in eq_
    assert a == b, msg or "%r != %r" % (a, b)
AssertionError: [(5, u'Asia', u'Tokyo')] != [(1, 'Asia', 'Tokyo')]
_________________________________________________________________ DistinctEngineShardTest.test_roundtrip __________________________________________________________________
[gw2] darwin -- Python 2.7.13 /Users/livio/Fashwell/karl-server/src/env/bin/python
Traceback (most recent call last):
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/ext/test_horizontal_shard.py", line 172, in test_roundtrip
    'North America', 'New York'), (3, 'North America', 'Toronto')])
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/../lib/sqlalchemy/testing/assertions.py", line 214, in eq_
    assert a == b, msg or "%r != %r" % (a, b)
AssertionError: [(4, u'North America', u'New York'), (5, u'North America', u'Toronto')] != [(2, 'North America', 'New York'), (3, 'North America', 'Toronto')]
_________________________________________________________________________ DefaultTest.test_insert _________________________________________________________________________
[gw0] darwin -- Python 2.7.13 /Users/livio/Fashwell/karl-server/src/env/bin/python
Traceback (most recent call last):
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/orm/test_unitofwork.py", line 937, in test_insert
    eq_(h1.hoho, althohoval)
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/../lib/sqlalchemy/testing/assertions.py", line 214, in eq_
    assert a == b, msg or "%r != %r" % (a, b)
AssertionError: u'im hoho' != 'im different hoho'
_________________________________________________________________ BatchInsertsTest.test_batch_interaction _________________________________________________________________
[gw0] darwin -- Python 2.7.13 /Users/livio/Fashwell/karl-server/src/env/bin/python
Traceback (most recent call last):
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/orm/test_unitofworkv2.py", line 1700, in test_batch_interaction
    {'data': 't11', 'id': 11}
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/../lib/sqlalchemy/testing/assertions.py", line 489, in assert_sql_execution
    asserter.assert_(*rules)
  File "/Users/livio/Desktop/github-sqlalchem/sqlalchemy/test/../lib/sqlalchemy/testing/assertsql.py", line 331, in assert_
    assert False, rule.errormessage
AssertionError: Testing for compiled statement 'INSERT INTO t (id, data) VALUES (:id, :data)' partial params [{'data': 't3', 'id': 3}, {'data': 't4', 'id': 4}, {'data': 't5', 'id': 5}], received u'INSERT INTO t (id, data) VALUES (:id, lower(:lower_1))' with params [{u'lower_1': 't6', 'id': 6}]
=========================================================== 4 failed, 7226 passed, 844 skipped in 84.29 seconds ===========================================================

Let my know how I can help and if this is actually a bug!

Comments (4)

  1. Mike Bayer repo owner

    hi there -

    unfortunately, the expected behavior is that the INSERT of rows proceeds in the exact order in which data records were given. This is critical so that the insert ordering of tables can be maintained.

    In order to reap the performance benefits of a bulk insert when you don't have the same keys available, make sure you pass None for those keys that aren't present, or otherwise if those columns have server-side defaults, do the sorting up front on your end. Since we are talking strictly performance optimization here, it is appropriate that those cases that require special sorting be done on the outside so that the expense of this sorting is not passed onto all users, but also the maintenance of insert order is a strict requirement in any case.

  2. Mike Bayer repo owner

    this is also documented behavior:

    http://docs.sqlalchemy.org/en/latest/orm/persistence_techniques.html#bulk-operations

    The set of objects passed to all bulk methods are processed in the order they are received. In the case of Session.bulk_save_objects(), when objects of different types are passed, the INSERT and UPDATE statements are necessarily broken up into per-type groups. In order to reduce the number of batch INSERT or UPDATE statements passed to the DBAPI, ensure that the incoming list of objects are grouped by type.

  3. Log in to comment