if object already exists in session, will not update deferred columns

Issue #870 resolved
Former user created an issue

(Put this at highest because it absolutely destroys pickling)

If an object already exists in the session when we load the object again with an undefer, the undeferred columns are selected in the SQL but are not updated on the object. Extensive test case supplied.

On a sidenote, could I get a Trac account Mike? This is my 3rd or 4th bug report in the last few days. - Chris M

Comments (12)

  1. Mike Bayer repo owner

    explain to me how this "destroys" pickling ? i dont see any usage of pickle in the testcase so im not sure what you mean.

  2. Mike Bayer repo owner

    also, this currently is defined behavior. the basic "load twice" case is the same on version 0.3. No attributes on an instance get altered in any way when the instance is already present in the session and its always been that way; while I agree this is a special case that probably should be changed, im not seeing why its a huge emergency showstopper just yet...its been this way for literally years.

    this is actually a very significant change. mapper._instance() method would need to iterate through all its MapperProperty objects for an already-loaded object, which is something it usually never does. Then, some special flag needs to be propagated through for those instances that are already loaded, so that only DeferredColumnLoaders fire off, and only when the value is not already populated; this implies that we'd need a fourth function returned from MapperProperty.create_row_processor(), corresponding to "insession_execute()", i.e. the instance was already present in the session and is not being refreshed (we use callables like that to greatly reduce the number of function calls and decisionmaking overhead when fetching many rows).

    That said it is a fairly straightforward change (now that I've worked out the steps...); I'd still like to see the real "disaster" use case and maybe you can work up some unit tests (i.e., like in test/orm/mapper.py, test/orm/query.py, test/orm/eager_relations.py) that can be part of the commit ?

  3. Former user Account Deleted
    import pickle
    
    result = session.query(Human).add_entity(Thing).join("thing").first()
    thing = session.query(Thing).options(undefer("name")).first()
    
    s = pickle.dumps(thing)
    thing = pickle.loads(s)
    
    print thing.name
    
    
    
      File "C:\sqlalchemy\lib\sqlalchemy\orm\strategies.py", line 198, in lazyload
        raise exceptions.InvalidRequestError("Parent instance %s is not bound to a Session; deferred load operation of attribute '%s' cannot proceed" % (instance.__class__, self.key))
    sqlalchemy.exceptions.InvalidRequestError: Parent instance <class '__main__.Thing'> is not bound to a Session; deferred load operation of attribute 'name' cannot proceed
    

    (I didn't include this in the test case because it wasn't necessary to demonstrate that the attributes were not being loaded; the extra complexity was not necessary. The tests for pickling behavior should cover everything.)

    Here is why it matters:

    Let's say we have a table of users with an id, a name, a password, and a profile. "profile" is huge - we definitely want to defer it for when appropriate, and we don't need it in most cases (logging in users, showing lists of users, perhaps showing comments by users, etc.)

    For every page load, we "login" the user - i.e., check their id and password from a cookie and fetch a User object. We defer the profile because it's not necessary for an instance of the logged in user.

    Now lets say this user views his own profile page - we notice the id in the URL, and load a User object with the profile undeferred to show on the page. However, our site is under massive load so we use a caching strategy (such as memcached) to make sure our site doesn't just die outright from the pressure.

    def getset(key, fn):
        """
        This function checks if "key" is in our memcached server; if it's not, we call
        fn(), set the key in memcached with the result of said function, and return the    result.
        """
    

    Let's say we are using the above getset prototype to facilitate our caching.

    def profile_controller(request, id): 
        """
        Handle a profile request. id is the id# of the user we wish to retreive
        """
    
        login = get_login(request) # Get our login so we can adjust preferences, etc.
        # ^-- If we view our own profile, then the User is in the session already.
    
        def fn():
            return User.query.options(undefer("profile")).get(id)
    
        user = getset("user:" + id, fn)
    
        return render("template", login=login, user=user) # Show results
    

    Ok, the user views their own profile. The first time it works correctly because our profile attribute is lazyloaded in the template when accessed. However, the second time it will NOT work because we pickled the object before attribute access and sent it to memcached. So, the second time the user (or actually, anyone else) views the profile the attribute will not be present in the state dictionary and we get that cryptic "bound to Session" traceback. For someone who isn't intimately familiar with how SQLAlchemy works (it surely isn't mentioned anywhere in the documentation from what I know) this bug would be very tough to squash and in my opinion the current implementation is a bit questionable. I undeferred an attribute - why isn't it loaded?

    I would most certainely be willing to write you some test cases, but first tell me what you think of the use case. Should we mention this more explicitly in the documentation? Should we require people to clear the session beforehand? That seems accident prone and unnecessary in my opinion.

  4. Mike Bayer repo owner

    OK, without all the detail, the issue is that pickling doesn't use getattr() and therefore does not load the deferred attribute. yes of course thats a valid use case, we support mapped entities being pickleable.

    But this fix really doesn't "fix" that bug, because you still have this manual step of loading with an undefer()...i don't see how ensuring that you explicitly reload the instance is so different than explicitly hitting the deferred attributes..explicit knowledge of those attributes and a special step is required in either case. If i want the entity to be pickled, i want it to happen with no extra steps no matter what. Same issue exists for relation lazy loaders.

    Which brings us to how it should be implemented anyway, which is to implement __getstate__() and __setstate__() on the instance. This can be seen as a "workaround" for this specific issue, but even if this issue didn't exist you still would probably want to do it since its easier than remembering to query() again with an undeferred option (and also, it hits lazyloaders).

    There is another ticket where Jason Kirtland has suggested having "pickle helpers" for implementing custom getstate/setstate methods; since you should pickle the _state and _instance_key attributes as well (and that is something which could in theory change with SA versions).

    I was hoping to possibly do a 0.4.1 release this weekend, and if so, I'm not really going to have time to implement this one before 0.4.1.

  5. Former user Account Deleted

    don't see how ensuring that you explicitly reload the instance is so different than explicitly hitting the deferred attributes..explicit knowledge of those attributes and a special step is required in either case.

    So how do you suggest people implement what I was trying to do? Accessing the attributes doesn't particularly "solve" the problem either; if I access every attribute, then it has to issue X amount of queries instead of it all being loaded efficiently in one query. There would be absolutely no way to actually "undefer" or "eagerload" items when the item already exists in the session and it's misleading to issue the query with those undeferred columns and then not actually use the result.

    For now in my code I'll just add a Session.clear() after my login handler, but I do think something should eventually be done about this (even if it's just a clarification in the documentation saying that this behavior isn't possible... which I would find suboptimal, but whatever)

  6. Former user Account Deleted

    In addition, this behavior happens with .all() as well, not just .first()/.get(). If we had a User loaded in the Session but then happened to get a list of all of the Users with some undeferred column, n-1 users would have the attribute loaded, and 1 wouldn't. If we sent this off to be pickled, it would be odd to have 1 object in a list be missing the requested attributes. What's someone supposed to do? iter through all of the items, explicitly hitting an attribute (and producing n queries) and then pickle? What if we don't know who is currently in the session (i.e., other unknown components manipulate our scoped session)? Should loading behavior of attributes be dependent on the queries issued beforehand?

  7. Mike Bayer repo owner

    So how do you suggest people implement what I was trying to do?

    implement __getstate__() and __setstate__(), use query.populate_existing(), or load the attributes individually.

    Accessing the attributes doesn't particularly "solve" the problem either; if I access every attribute, then it has to issue X amount of queries instead of it all being loaded efficiently in one query.

    you can issue query.populate_existing().<anything>.

    There would be absolutely no way to actually "undefer" or "eagerload" items when the item already exists in the session

    query.populate_existing() will load any number of objects, eagerloaded, whatever, and everything gets populated unconditionally.

    and it's misleading to issue the query with those undeferred columns and then not actually use the result.

    yes, this is a bug, as stated in my first comment. this will be fixed. there is nothing to argue about here.

    For now in my code I'll just add a Session.clear() after my login handler, but I do think something should eventually be done about this (even if it's just a clarification in the documentation saying that this behavior isn't possible... which I would find suboptimal, but whatever)

    the specific use case you're looking for is entirely possible right now with query.populate_existing(), as well as the __getstate__() __setstate__() approach.

    however, for the general case, where people would like pickling to "just work" without any explicit extra steps, only __getstate__() and __setstate__() solve the problem completely - it is a basic fact of Python that we have to use properties to implement the load-on-access instrumentation, and that pickle doesn't read them.

  8. Mike Bayer repo owner

    my apologies; since ive upgraded to Leopard, Firefox has broken the dropdowns on trac - this bug was not intended to be closed.

  9. Mike Bayer repo owner

    most of the code to do this is now present in mapper._instance() (with a NOTYET comment right now) - it just needs to be turned on, have some tests added, and have a little bit of extra intelligence about which attributes to load and how to "commit" their state.

    as far as pickling, deferred loaders and others are themselves now pickleable so most of the angst in this ticket about __getstate__() isn't even needed anymore.

  10. Log in to comment