session.merge already does non-deterministic reconciliation for persistent objects, add this to pending

Issue #3601 resolved
Sheer El Showk created an issue

This bug was discussed on stack overview here:

http://stackoverflow.com/questions/33888539/getting-sqlalchemy-to-do-on-duplicate-key-update-inside-an-orm-cascade-in-mys?noredirect=1#comment55890922_33888539

We create a simple object hierarchy: Groups contain Users and Users have Email addresses. We want the email address to be stored uniquely even if its shared between users. Constructing two users with the same address and using session.merge() to add them has the correct behaviour (the same key is reused and no error is thrown). If, on the other hand, we add the two users to a group and then use session.merge() on the group instead them the two identical addresses lead to a unique key exception on address (due to an insert many).

Here is the relevant code:

from sqlalchemy import create_engine, Column, types
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, scoped_session
from sqlalchemy.orm import Session

from sqlalchemy import ForeignKey
from sqlalchemy.orm import relationship, backref


engine = create_engine('sqlite:///:memory:', echo=False)
Base = declarative_base()

session = scoped_session(sessionmaker(bind=engine))

class Group(Base):
    __tablename__ = "groups"
    gid = Column(types.Integer, primary_key=True)
    name = Column(types.String(255))

    users = relationship("User", backref="group")

    def __repr__(self):
        ret =  "Group(name=%r)" % self.name
        for user in self.users:
            ret += str(user)

class User(Base):
    __tablename__ = "users"
    login = Column(types.String(50), primary_key=True)
    name = Column(types.String(255))
    group_id = Column(types.Integer, ForeignKey('groups.gid'))
    address = Column(types.String(200), 
                            ForeignKey('addresses.email_address'))
    email = relationship("Address")

    def __repr__(self):
        return "User(login=%r, name=%r)\n%s" % (self.login, self.name,
                str(self.email))


class Address(Base):
    __tablename__ = 'addresses'
    email_address = Column(types.String(200), nullable=False, primary_key=True)
    #user_login = Column(types.String(50), ForeignKey('users.login'))

    def __repr__(self):
        return "<Address(email_address='%s')>" % self.email_address


Base.metadata.create_all(engine)

if __name__ == '__main__':

    # this works correctly even though we reuse a unique key
    u1 = User(login='Guy', name="Some Guy")
    u1.email=Address(email_address='nameless@yahoo.com')
    u2 = User(login='Gal', name="Some Gal")
    u2.email=Address(email_address='nameless@yahoo.com')
    session.merge(u1) 
    session.merge(u2) 
    session.commit()

    print("two users with addresses")
    for u in session.query(User):
        print(u)


    # though this is similar it ends up using insertmany and throws a unique key
    # constraint even with the merge
    u3 = User(login='Mr. User', name="A dude")
    u3.email=Address(email_address='james@yahoo.com')
    u4 = User(login='Mrs. User', name="A dudette")
    u4.email=Address(email_address='jill@yahoo.com')
    u5 = User(login='Mrs. User2', name="A dudette2")
    u5.email=Address(email_address='jill@yahoo.com')
    g1 = Group(name="G1")
    g1.users.append(u3)
    g1.users.append(u4)
    g1.users.append(u5)
    session.merge(g1) 
    session.commit()

    print(g1)

Comments (8)

  1. Mike Bayer repo owner

    here's the problem with this scenario.

    suppose Address has an extra field on it:

    class Address(Base):
        __tablename__ = 'addresses'
        email_address = Column(types.String(200), nullable=False, primary_key=True)
        some_data = Column(types.String(50))
    

    now I run this:

        u1 = User(login='Guy', name="Some Guy")
        u1.email = Address(email_address='nameless@yahoo.com', some_data='X')
        u2 = User(login='Gal', name="Some Gal")
        u2.email = Address(email_address='nameless@yahoo.com', some_data='Y')
    
        g1 = Group(users=[u1, u2])
        session.merge(g1)
        session.flush()
    

    what is in the "addresses" table after the above runs? If the answer is, "it's random", then this use case does not make sense.

  2. Mike Bayer repo owner

    basically what this issue is asking for is that as we proceed with a merge, we make a local identity map of all the objects we've seen so far, so that the two Address objects above would be merged, and then the "some_data" fields just collide and whichever one is last wins. This might be fine, perhaps as an option so that it doesn't surprise those who expect the above to raise an error.

  3. Sheer El Showk reporter

    While I appreciate the problem you're bringing up I think the core of the issue is the different behaviour in doing:

    session.merge(u1) 
    session.merge(u2) 
    session.commit()
    

    and

    session.merge(g1) 
    session.commit()
    

    I'm not sure what would happen in the first case if you add an extra field but the point is that sqlalchemy behaves differently in the two cases above and this really has nothing to do with the ambiguity you're bringing up.

  4. Mike Bayer repo owner
    • The :meth:.Session.merge method now tracks pending objects by primary key before emitting an INSERT, and merges distinct objects with duplicate primary keys together as they are encountered, which is essentially semi-deterministic at best. This behavior matches what happens already with persistent objects. fixes #3601

    → <<cset 3ec9b9f6b601>>

  5. Mike Bayer repo owner

    thanks for reporting! this is a major behavior change so is for 1.1. In 1.0, you need to uniquify on the outside, use an approach similar to the following:

    def uniquer():
        collection = {}
    
        def go(obj):
            if obj.email_address in collection:
                return collection[obj.email_address]
            else:
                collection[obj.email_address] = obj
        return go
    
    
    if __name__ == '__main__':
    
        unique_address = uniquer()
    
        u1 = User(login='Guy', name="Some Guy")
        u1.email = unique_address(Address(email_address='nameless@yahoo.com', some_data='X'))
        u2 = User(login='Gal', name="Some Gal")
        u2.email = unique_address(Address(email_address='nameless@yahoo.com', some_data='Y'))
    
        g1 = Group(users=[u1, u2])
        session.merge(g1)
        session.flush()
    
  6. Mike Bayer repo owner

    with the case you bring up with the individual merge() of u1 and u2 vs. the merge() of the group, those cases are different; if I sent a set of data A to merge(), then a set of data B to merge(), I'd totally expect the second call to merge() to overwrite anything in the first call. The second case is that there is an implicit overwriting of data within a single call and cannot be guaranteed to be deterministic, that is, the order in which it encounters objects is arbitrary. However, the already-persistent case already behaves this way, so making the change here is more consistent. Though ideally a warning should be emitted if data within a single merge() call is being overwritten by other data in the same call; that's not a trivial feature to add.

  7. Log in to comment