1. Mike Bayer
  2. sqlalchemy

Wiki

Clone wiki

sqlalchemy / UsageRecipes / DeclarativeComputeTotalOfSubquery

Here is some example code that shows, using Declarative, how to compute a subtotal of a selected number of items. The records to use for the subtotal are selected based on a separate subquery. This code also shows how massive performance improvements can be realized by using subqueries instead of the often more familiar Python iterative loops.

For this particular query, we have a declarative class called "Host" which contains a list of computer systems, each with an associated primary key "id". Each Host record also has a field named "parent", which references another table.

We also have a separate declarative class called "HostFilesystem". Each record contains a "sizeMB" field indicating filesystem size in MB, as well as a host_id, referencing a Host id that owns this filesystem.

For this query, we want to create a subtotal of the size of all filesystems of all hosts that share the parent of name "foo". This is done as follows:

parent=session.query(Parent).filter_by(name="foo").first()
qa=session.query(Host.id).filter_by(parent=parent).subquery()
qb=session.query(func.sum(HostFilesystem.sizeMB)).filter(HostFilesystem.host_id.in_(qa))
subtotal=qb.first()[0]

Using subqueries such as this are generally much more efficient than manually iterating over each record using Python in order to calculate the subtotal iteratively. The reason why the above method is generally much more efficient is because network round-trips (as well as overall network traffic) is dramatically reduced when computation is pushed to the remote database server. For larger data sets, the difference in performance between the two methods can be extremely large.

Here is code that shows how we might perform the same task as above, but using a much slower iterative approach:

parent=session.query(Parent).filter_by(name="foo").first()
subtotal=0
for hostid in session.query(Host.id).filter_by(parent=parent):
    for hostfs in session.query(HostFilesystem).filter_by(host_id=hostid):
        subtotal += hostfs.sizeMB

While this second approach may be more familiar to you as a Python programmer, it is not ideal. If you have 1000 hosts in your database, the code immediately above will generate over 1000 database queries, whereas the first example gets the same result using only one or two queries. For optimal performance, look for opportunities to refactor iterative loops as subqueries to realize massive improvements in performance when handling larger data sets.

Updated