Add a way to fetch a model or create one if it doesn't exist

Issue #3942 wontfix
Markus Meskanen
created an issue

Django has a get_or_create() method for times when you don't know if data for an entity already exists. There are multiple implementations available for SQLAlchemy at http://stackoverflow.com/questions/2546207/does-sqlalchemy-have-an-equivalent-of-djangos-get-or-create/, but I thought it would be nice to have a native support for such widely used feature.

I'm not sure of the ideal implementation, but one that pleases my eye is to add a keyword arugment for one() etc:

session.query(Player).filter_by(id=my_player_id).one(create=True)

Or simply add a new method:

session.query(Player).filter_by(id=my_player_id).get_or_create()

Comments (16)

  1. Michael Bayer repo owner

    we have this. it's called merge().

    p1 = Player(id=some_id)
    p1 = session.merge(p1)
    

    give that a try and reopen if you can describe how this doesn't suit the "get or create" model.

  2. Markus Meskanen reporter
    • changed status to open

    http://stackoverflow.com/questions/2546207/does-sqlalchemy-have-an-equivalent-of-djangos-get-or-create/42879583?noredirect=1#comment72884826_42879583

    merge() only works if you already know in advance the primary key... if you're trying to upsert/get_or_create based on a Column with a unique constraint, you'll end up with an IntegrityError, since it'll not actually merge your new instance with the existing record. See stackoverflow.com/a/10561643/293735

    This seems like a valid complaint, although not something I would've thought of myself.

  3. Michael Bayer repo owner

    OK so let's talk about "create".

    First, tell me what you don't like about this pattern:

    p1 = session.query(Player).filter(<whatever unique thing you are looking at >).one() or Player()
    

    where I'm going is I'm guessing you want it to do the "Player()" part for you, and that's a problem. but let me know what else is wrong with the above (as well as making yourself a "def get_or_create()" that does that one liner).

  4. Markus Meskanen reporter

    I don't think you can one-line it that easily, since we're still missing session.add() to the newly created Player object, right? So we'd end up with more lines anyways, at which point it's arguably better to just create an utility function for it:

    def get_or_create(session, model, filters, kwargs):
        p1 = session.query(Player).filter(*filters).one_or_none()
        if p1 is None:
             p1 = Player(**kwargs)
            session.add(p1)
        return p1
    
    player = get_or_create(session, Player, Player.userid==5, {'name': 'Markus'})
    

    And this seems like a very common idiom, which is why I believe a built-in method would make a lot of sense. And I can't really find any downside to adding a method like get_or_create() which would make the above much prettier imo.:

    player = session.query(Player).filter(Player.userid==5).get_or_create(name='Markus')
    
  5. Michael Bayer repo owner

    "get or create" is a common idiom but the specifics IMO are not common at all. For example, the above idiom is extremely inefficient if I need to get_or_create() 50 Player objects. It's vastly better to query for all of them at once and operate on the ones missing. the pattern is just too specific to what an application needs to do for the ORM to include a "one size fits all" API for it. your get_or_create() above for example will fail if the constructor of the object has positional arguments. It's also awkward that I need to pass all the exact constructor arguments for a particular kind of object to this other method, instead of just instantiating the object. Also, why wouldn't my constructor arguments be divined from what I've put in the filter()? If I add this feature it will be about 30 seconds before someone asks for that too, because they don't want to say get_or_create(MyModel, {"x_axis": 10}, x_axis=10).

    not sure if you've seen this but we have one particular variety of get_or_create on the wiki at https://bitbucket.org/zzzeek/sqlalchemy/wiki/UsageRecipes/UniqueObject . It's a recipe so that people can tailor it as needed to specific cases.

  6. Markus Meskanen reporter

    For example, the above idiom is extremely inefficient if I need to get_or_create() 50 Player objects. It's vastly better to query for all of them at once and operate on the ones missing.

    You obviously wouldn't use this for that scenario then...? :P

    your get_or_create() above for example will fail if the constructor of the object has positional arguments.

    The self made function which I use in my own project will, but the latter example of what I'm hoping SQLAlchemy would have won't fail. And even that can be further improvised, they're just quick examples I scrapped to get the idea :) If I knew right away how to do it perfectly, I would've just pull requested ^^

    It's also awkward that I need to pass all the exact constructor arguments for a particular kind of object to this other method, instead of just instantiating the object.

    I suppose you're again referring to the self made function, not the one I'm hoping SQLAlchemy had? If not, how is get_or_create(name='Markus') any more "awkward" than Player(name='Markus')?

    Also, why wouldn't my constructor arguments be divined from what I've put in the filter()?

    You might be right about this, but I find it to make much more sense to have the two separate from each other, just to generalize the usage. And like I said, this was just a quick example I scrapped together :)

    If I add this feature it will be about 30 seconds before someone asks for that too, because they don't want to say get_or_create(MyModel, {"x_axis": 10}, x_axis=10).

    Now that is where you can answer with the "one size fits all" API argument, imo. :) And even without the support for forwarding the fields to the constructer, it would be shorter to write:

    player = session.query(Player).filter(Player.userid==5).get_or_create(userid=5)
    

    Than the existing:

    player = session.query(Player).filter(Player.userid==5).one_or_none()
    if player is None:
        player = Player(userid=5)
        session.add(player)
    

    Like I said, I don't see any downside to adding this. Surely it won't magically fix every bug ever written, and it doesn't suit querying 50 different objects, but use the right tool for the right job. And imo this tool would be nice to have in my toolset :)

  7. Michael Bayer repo owner

    You obviously wouldn't use this for that scenario then...?

    you might think that, but in reality you will see:

    for rec in my_list_of_50_recs:
    list.append(query.get_or_create(...))

    then they will complain SQLAlchemy is slow and has no way to make the above operation efficient in batch...because we've set up an expectation of "batteries included". SQLAlchemy is not really a "batteries included" kind of library. it is a foundational library. it very much is about people building on top of it what they want and that's what all my talks are about.

    but the latter example of what I'm hoping SQLAlchemy would have won't fail.

    and what does that look like? also make sure you don't miss something or make a mistake, because once it's in a release, it's forever. This is another reason I avoid "80%" APIs like this, more often than not you get them wrong and now you're stuck with them. See "bound metadata" for reference.

    I suppose you're again referring to the self made function, not the one I'm hoping SQLAlchemy had? If not, how is get_or_create(name='Markus') any more "awkward" than Player(name='Markus')?

    because it's hiding the use of the object constructor. there is no place whatsoever in SQLAlchemy where we call an object's constructor right now. this is strictly the domain of utility libraries that would be local to an application.

    You might be right about this, but I find it to make much more sense to have the two separate from each other, just to generalize the usage. And like I said, this was just a quick example I scrapped together :)

    I don't have the resources to come up with a 100% perfect API for this unfortunately. We don't do the 80% thing.

    Now that is where you can answer with the "one size fits all" API argument, imo. :) And even without the support for forwarding the fields to the constructer, it would be shorter to write: (userid is twice)

    in that example I'm still having to write "user_id=5" twice. that's ugly and redundant. Better to just be explicit in the first place? session.query(Player).filter(<whatever>).merge(Player(foo='bar')).

    Like I said, I don't see any downside to adding this.

    the downside is that it will add expectations of functionality that I cannot support.

    And imo this tool would be nice to have in my toolset :)

    OK, so the proposal would have to be way more detailed and complete than a quick example. My only point is that seemingly "simple" helper functions like these are total rabbit holes of bugs, impossible-to-fix API shortcomings, and complaints once you declare it as a permanent part of the API, encouraging everyone that "this is now the correct way to do this". Otherwise, it is such a simple pattern for anyone to tailor to their own needs in their util.py file, along with the other application-specific SQLAlchemy idioms that any project needs to have.

  8. Денис Катаев

    Usually get_or_create from django using for prevent race condition situation, after they did't get model, they try to create first, and if we get constraint error then fetch data. If we didn't get error just fetch data after create.
    This approach not for use in bulk create.

    SqlAlchemy don't have standard decision this problem.

    I think, we need new method in session class for this situation.

    Remember what in this method we need make commit and maybe rollback operations.
    This requires a good thinking about the architecture and a good documentation of the future method

    For myself ui use modified solution from this blog post.

  9. Michael Bayer repo owner

    get_or_create() as being proposed (and as in the blog post EDIT: the very last example accommodates it, at great complexity) does not solve for race conditions. A conflicting row can easily be inserted in between the time that we ask for the row vs. add one to the Session ourselves, causing an INSERT conflict (assuming proper constraints exist, silent data corruption if not). Solving for a race like this can be done in many different ways which are highly dependent on the database and scenario in use - such as using "select...for update" [edit: sorry, keep jumbling up UPDATE/INSERT w/ these issues, for pessimistic this needs to be like LOCK TABLE or serializable isolation], using Postgresql's "INSERT..ON DUPLICATE UPDATE" is another (works only on Postgresql though).

    if django claims their "get_or_create()" solves for race conditions and isn't trying to do some sort of locking or optimistic concurrency checking, then the irony would be that this bug report claiming django has "solved" this problem is in fact uncovering a bug in django :).

    I think people here have to appreciate what a big deal "get_or_create()" in the entirely generalized sense really is. People's applications will work much better if they produce this function themselves since they will be able to tailor for and know exactly what limitations they've chosen (since there will have to be some kind of limitations).

  10. Michael Bayer repo owner

    so here's the final example in the blog post:

    def get_one_or_create(session,
                          model,
                          create_method='',
                          create_method_kwargs=None,
                          **kwargs):
        try:
            return session.query(model).filter_by(**kwargs).one(), True
        except NoResultFound:
            kwargs.update(create_method_kwargs or {})
            created = getattr(model, create_method, model)(**kwargs)
            try:
                session.add(created)
                session.commit()
                return created, False
            except IntegrityError:
                session.rollback()
                return session.query(model).filter_by(**kwargs).one(), True
    

    that function is approaching the correct approach for one particular approach to this problem (there are many others). It will not work in context of an ongoing session transaction, since it is not using a savepoint and will roll back everything that has proceeded previously. it also breaks with things like zope.transaction since zope.transaction is doing transactional scope, it breaks with lots of other application-specific transaction management paradigms. the above function is very useful for the specific application that is built around it but is not sufficient for general use.

  11. Денис Катаев

    You right, but we can create some page in documentation, for concurrency problem explanation using sqlalchemy? In this page we can touch another implementations, for example django get_or_create. This can be useful for users who have used other libraries

  12. jvanasco

    FWIW, we have a high concurrency app and need to use a 'get_create_get(primary_key)' approach, as we had WAY too many conflicts on getcreate from other workers and nodes.

    I don't have the sqlalchemy code for this handy, but it looks something like this:

    def getcreateget_Foo(session, primary_key_value):
        "returns a record and if it were created"
        # first try to grab the obj
        def _get():
            return session.query(Foo).get(primary_key_value).first()
    
        record = _get()
        if record:
            return record, False
    
        # Nope?  try to create. do a wonky insert/integrity-error
        savepoint_name = 'sp_%s' % randomhash(20)
        try:
            savepoint(session, savepoint_name)
            record = Foo()
            record.primary_key = primary_key_value
            session.add(record)
            session.flush()
            savepoint_release(session, savepoint_name)
            return record, True
        except psycopg2.IntegrityError as e:
            savepoint_rollback(session, savepoint_name)
            record = _get()
            if not record:
                raise ValueError('WTF')
            return record, False
    
  13. jvanasco

    @Michael Bayer the actual SqlAlchemy code uses your native savepoint and IIRC, catches the sqlalchemy wrapped integrity error. The checkout for that is on another machine. I just quickly edited a raw psycopg2 function from a twisted app that I had checked out on this machine for the above.

    i should have said above: I'm +1 on this type of stuff not being handled within sqlalchemy core right now, because it is indeed a lot more complex. I just meant to illustrate that "getcreate" won't work as expected when you drive up concurrency.

  14. Michael Bayer repo owner

    get_or_create() is a very complex use case in the completely generalized sense, particularly in that it's expected to handle race conditions which is not possible to generalize with a single approach, and is better as something that end-user applications solve in the way that they need.

  15. Log in to comment