Auto-Joins, Partial Selects and ORMs

Issue #627 resolved
Former user created an issue

Hello, I am writing a next-generation forum package in Python (somebody needs to beat out vBulletin and those other PHP ones, they dominate the market!) and instead of writing another SQL query builder, I decided to check out SQLAlchemy because of its good reputation and seeming power.

While I am very pleased with what it can do, I feel there is room for improvement, which is why I am bringing up these issues. If you want, I can write the patches to do this stuff but I figured I'd see what the consensus is on these ideas before I spent time writing code just to see patches rejected!

For the majority (if not all) of these examples, I am assuming the following tables for my examples.

user: id - name - pass - email

post: id - user_id - subject - text

JOIN is too hard to do with a partial select

Partial selects seem to be overlooked in SQLAlchemy - which is odd, because they are extremely important for those of us who try to write extremely efficient code for redistribution.

Let's take the following example: I want to select all the posts we have and the corresponding username... but not the email and password because they are irrelevant for the task at hand. How would we do this?

  select([user.c.name, post](user.c.id,), user.c.id == post.c.user_id) 
  # Oh come on, we have the foreign key established. This is a waste of typing, especially with huge multi-table joins!
  select([user.c.name, post](user.c.id,)).join(user, post) 
  # Nah, this creates an incorrect query. Not only will it not execute, but it tries to add the join after the cartesian product.
  join(user,post).select([user.c.name, post](user.c.id,))
  # What? selects on a join don't let you define what columns you want to retrieve. Weird...
  select([user.c.name, post](user.c.id,), from_obj=[join(user,post)](join(user,post))) 
  # Here is how you do it, but man that's ugly.

I feel as if setting a from_obj ruins the "cleanliness" of the code.

Solution 1: Auto-joins

   # Hypothetical. Syntax is up for debate - idea is what counts.
   select([post.r.user.name, post](post.r.user.id,))

   # SQLAlchemy recognizes that you are trying to join because you are using the user table as a property of post, so it creates a JOIN instead of a cartesian product.

What on earth is post.r? Basically, it would stand for "relations", or what not - name isn't important. Of course, this is all related to table metadata... while we could detect basic joins with foreign keys, what if the default behavior of X is to do a LEFT JOIN?

I propose that relationships be added to the table metadata itself instead of an arbitrary ORM. Regardless of whether I use ORM, there IS a relationship between the tables; it doesn't go away when "business objects" are removed from the picture. Adding relationship data to the table metadata itself would allow built queries and partial selects to enjoy the power of relationships without being tied to using an ORM, which can be undesired. (I bring up ORMs later in this ticket)

Full example: Select all posts with the corresponding user id and user name. We had a few users who had alternate accounts and we deleted them but left the posts because they were informative. We can't use a regular JOIN here - it will skip over those posts!

   user = Table ( 'user', ... )
   post = Table ( 'post', ..., maybe_has ('user')) # Create a "maybe" link with all the defaults - i.e, the relation entry in post is called 'user', link using the foreign keys defined to the table 'user'.

  select([post.r.user.name, post](post.r.user.id,)).execute() # specifying post by itself only specifies post's columns - not it's relation tables. You could get all of posts' relation-joins by selecting post.r !

I don't think this is too high level for the query builder. As I said earlier, the relation exists regardless of whether I am using an ORM or not.

Solution 2: select_join?

This wouldn't require any relationship data to be stored.

   select_join([user.c.name, post](user.c.id,), join=[post](user,)) # Assume all columns matching with user and post are going to be joined with each other.
   select_join([user.c.name, post](user.c.id,), outerjoin=[post](user,)) # Oh my!

ORM issues

I like the idea of object -> relation mapping, I really do. However, I believe the one present in SQLAlchemy is way too bulky for its own good. It needs to be further partitioned.

Partial Selects?

You cannot partial select with an object in any sort of convenient way. This makes no sense whatsoever. It's not possible through the objects select(), it's not possible through providing a custom select() to the objects select() (SQLAlchemy will complain about missing fields!)

Complaining about missing fields is nonsense. It's commonplace to work on subsets of data - I don't need every possible field every time I use an object. If a field isn't there, don't set it. Big deal.

Sessions and Query

Too verbose and unnecessary for projects who just want a simple row -> object map. I should be able to get the benefits of relating rows to objects without having to jump through hurdles (I believe caching of objects is completely unnecessary for a well designed application that uses SQL efficiently... this should be delegated outside the scope of the basic ORM.)

use_label doesn't group tables.

Even if I don't use an ORM, I should be able to have separation of tables within my queries. There should be an option to segregate the result set into tables, IE:

   row = select([user](post,), use_labels=True).execute().fetchone()
   # { post_id, post_user_id, post_subject, post_text, user_id, user_name, user_pass, user_email }
   # Oh shit! I have a function that just operates on the post. Why can't I just pass the post table of the result to my function, so it will work regardless of use_labels or aliases?

   row = select([post,user](post,user), segregate_tables=True).execute().fetchone()
   { post : { id, user_id, subject, text}, user : { id, name, pass, email } }

   my_func(row.post) # Alright! Cool!

In conclusion

I hope you all find my ideas interesting. I would be willing to help implement any one of these ideas. I don't know the SQLAlchemy code by heart, but I'm a very experienced programmer and DBA and would love to be apart of this project in any way I can. I mean, nobody wants me to write my own SQL package... right? Better to help others.

Comments (6)

  1. Mike Bayer repo owner

    this ticket might be more appropriate to discuss on the mailing list, since other users might have opinions on it and they tend not to read trac tickets.

    my initial take on the ideas above is that they generally rely upon new flags, switches and syntaxes that need to be documented and explained, and all of them are currently possible through more explicit means. id make a pretty strong bet that adding half a dozen new ways to do what we already can is going to achieve some higher mix of non-usage/user-confusion than we already do today. The SQL construction language tries to draw upon the look of SQL itself which is why you have things like from_obj, and that also helps people to understand whats going on; nobody's ever been confused by that however "ugly" it may seem. Syntaxes like table.r.othertable and "maybe_has" are a lot less obvious, they draw the user away from SQL hence increasing confusion, plus i dont see any way to distinguish between outer/inner joins, how to add extra ON criterion, etc. I prefer to leave less obvious/flexible techniques outside of the SA core (since the lack of something being present in the SA core does not prevent you from using it yourself and/or promoting it...these are all doable as extensions).

    also the ability to partial select with ORM works the way you want it to in release 0.4.

    with regards to non-session ORM, CRUD tools are dead simple to build. theres an example one in "Core Python" using SQLAlchemy, for example. if youd like that kind of ORM, go and build it, or just use SQLObject. SA's ORM is just one approach to ORM and does not claim to hold a monopoly on that.

    Make your post to the mailing list, and based on the response you get there, we'll have a better notion if any of these ideas are desireable to the larger userbase.

  2. Former user Account Deleted

    Replying to zzzeek:

    this ticket might be more appropriate to discuss on the mailing list, since other users might have opinions on it and they tend not to read trac tickets.

    my initial take on the ideas above is that they generally rely upon new flags, switches and syntaxes that need to be documented and explained, and all of them are currently possible through more explicit means. id make a pretty strong bet that adding half a dozen new ways to do what we already can is going to achieve some higher mix of non-usage/user-confusion than we already do today. The SQL construction language tries to draw upon the look of SQL itself which is why you have things like from_obj, and that also helps people to understand whats going on; nobody's ever been confused by that however "ugly" it may seem. Syntaxes like table.r.othertable and "maybe_has" are a lot less obvious, they draw the user away from SQL hence increasing confusion, plus i dont see any way to distinguish between outer/inner joins, how to add extra ON criterion, etc. I prefer to leave less obvious/flexible techniques outside of the SA core (since the lack of something being present in the SA core does not prevent you from using it yourself and/or promoting it...these are all doable as extensions).

    also the ability to partial select with ORM works the way you want it to in release 0.4.

    with regards to non-session ORM, CRUD tools are dead simple to build. theres an example one in "Core Python" using SQLAlchemy, for example. if youd like that kind of ORM, go and build it, or just use SQLObject. SA's ORM is just one approach to ORM and does not claim to hold a monopoly on that.

    Make your post to the mailing list, and based on the response you get there, we'll have a better notion if any of these ideas are desireable to the larger userbase.

    Yeah, I wasn't sure where was appropriate to post this, saw enhancement, whatever. I understand your philosophy on the auto-joins, I just figured it might be a neat idea to share. I can post it on the mailing lists if you wish, but I think you're right in that it would probably be best implemented as an extension. It isn't mission critical for anything, so whenever I get bored I'll draft up and API and write an extension or something :)

    Cool about the ORM. That was my main problem with it.

    What about segregating tables? I don't think it's an esoteric feature and it's simple enough to implement (not like any of the auto-join stuff) and returning multi-dimensions is very helpful in circumstances where you wouldn't want to use an ORM but use joins with conflicting names.

    Thanks for your time! Love the product.

  3. Mike Bayer repo owner

    segretate_tables is not so bad but my concern there is that it modifies the semantics of RowProxy. the current Result/RowProxy is tailored to look just like a DBAPI cursor. it does have some extra bells and whistles of course, but currently you already can pull columns off of it as attributes, i.e. row.somecolumn, which is a feature some users stuck in there long ago. since I wasnt paying attention enough at that time to make it like Table, i.e. row.c.somecolumn, now we have name-conflict issues.

    anyway heres something I like better. its probably possible that RowProxy could implement this feature independently, given a regular use_labels style query, or actually just any ClauseElement based query since its given a map of Column objects which it can use; just to provide:

    segregated_row = row.extract(users)
    

    might be an interesting way to do this. The ORM does make a lot of use out of "converted" rows so it might even be nice to make it do:

    ua = users.alias("F5B5")
    row = row.extract(users, ua)
    

    which pulls from its columns every column from the alias "F5B5" that it can tie down to the "users" table, returning a new row object. we do a lot of this "row adaptation" right now...the biggest need is to convert from an aliased version of a table down to the original table.

    if you think you dont have the Alias object lying around (which you probably should...its best to think in terms of Selectable/Columns rather than strings), we can also add:

    row = row.extract(users, prefix="F5B5_")
    

    converts things like F5B5_user_id into "user_id" / users.c.user_id.

  4. Former user Account Deleted

    Concern is valid, yeah. extract() is one option, another option I thought of would to have a method on RowProxy to maybe return an iterator of segregated tables.

       # Iterator of segregated items
       select([user,post](user,post)).execute().segregated_items()
       # This isn't consistent with use_label=, so we could add this too
       select([user,post](user,post)).execute().labeled_items()
    

    Yet another option is to add the ability to filter an item through RowProxy, like

       select([user,post](user,post)).execute(extension=Segregator())
    

    or whatever. extension= might be kind of overkill for something as simple as this; I don't really see anybody needing anything but labeled or segregated.

  5. Former user Account Deleted

    Or even better: just provide aggregate helpers like Segregator() and Labeler() and have it take a RowProxy. I'm not really thinking today.

  6. Log in to comment