Pull requests

#11 Merged
Repository
Deleted repository
Branch
patch-msql-pkc-clustered (cf8e5e3cf5b0)
Repository
zzzeek/sqlalchemy sqlalchemy
Branch
master

Support clustered primary keys in mssql dialect

Author
  1. Derek Harland
Reviewers
Description

Previously I have added the ssql_clustered option to Index to support creating clustered indexes. This patch supports specifying clustered primary keys within a CREATE TABLE statement.

It supports both of the following

    Table('my_table', metadata,
          Column('x', ..., primary_key=True),
          Column('y', ..., primary_key=True),
          mssql_clustered=True)

and

    Table('my_table', metadata,
          Column('x', ...),
          Column('y', ...),
          PrimaryKeyConstraint("x", "y", mssql_clustered=True))

Patch also adds a test, and documentation.

Comments (22)

  1. Derek Harland author

    A table can support only one "clustered" index. That clustered index effectively determines how the table is actually laid out on disk.

    I think there are only three avenues for clustering

    • adding a clustered index to an existing table (which we already cover)
    • adding a clustered primary key inline in a create table

      • handled in this patch by specifying mssql_clustered on the table or the PrimaryKeyConstraint
    • adding a clustered unique key (but not a primary key) inline in a create table e.g.

    create table X (x int primary key, y int unique clustered)
    

    or

    create table X (x int, y int) unique clustered (x)
    

    So probably we should support mssql_clustered on a UniqueConstraint as well.

    Obviously clustering applies to the constraint object, so the nicest place for it to appear is only on PrimaryKeyConstraint or UniqueConstraint and not on Table … but that means the shortcut of being able to tag columns with primary_key=True disappears which is the only reason I also put it onto Table as well …

    Perhaps if clustering you should be forced to do it via an explicit PrimaryKeyConstraint ??

  2. Mike Bayer repo owner

    it would certainly be easier to just support clustering on an explicit PrimaryKeyConstraint for now - if an app wants all the PK constraints to be this way, schema attachment events can be used. but is the use case for an explicit "clustered PK index" usually just a few tables or how will this normally be used ?

  3. Derek Harland author

    Background

    Clustering a table using a particular index (be it the primary key, some other unique constraint, or even a non unique index) effectively "makes the table the index". That is:

    • in a non clustered table, the table is some big amorphous blob of data on disk and an index is e.g. a B-tree that contains row pointers into this mass of data.
    • in a table that has a clustered index, that clustered index is the table. The index/table is stored as e.g. a B-tree where the leaf pages contain actual table data rather than row pointers. (And other non clustered indexes use a different sort of pointer into this B-tree)

    This makes reads from a clustered index much faster than a non clustered index as there is no row lookup. e.g. SQLServer will ignore non clustered indexes often if the table is not large as a plain table scan may beat an index scan + lookup.

    Conversely, writes are slower as page allocation overhead is higher.

    Use Case

    So … in cases where:

    • there isn't a pressing constraint on write performance; and
    • there is a clear candidate for a unique key that will often be used (e.g. an opaque id field that is often joined to); and
    • where the table is of reasonable size so there will actually be a benefit …

    then a clustered key is a good idea. This means that in many use cases, developers may cluster every table by default.

    Other comments

    I think at the moment, if mssql_clustered also stays on Table then the only ambiguity becomes in a statement like

    Table(
        Column('x', primary_key=True),
        Column('y', unique=True),
        mssql_clustered=True
        )
    

    where its ambiguous which of the two unique constraints should be clustered.

    Bit ugly but perhaps the keyword to Table could be mssql_clustered_pk ?

  4. Mike Bayer repo owner

    right I thought maybe something like: "mssql_clustered_on=('y',)" vs. "mssql_clustered_on=('x',)" would be a way this could be set up at the table kw level, but this seems to be well within "there's more than one way to do it" territory - since the argument essentially has to poke around the table columns looking for a UniqueConstraint or PrimaryKeyConstraint that has those columns (and then, a column can be part of the primary key and a unique constraint at the same time!) and figure out which constraint actually receives the "CLUSTERED" keyword.

    There's other kinds of arguments one can put on PrimaryKeyConstraint, like a "name" for example, and we don't have table-level arguments for that either.

    What I thought should work, but unfortunately does not, is this:

    Table('t', metadata,
       Column('a', Integer, primary_key=True),
       PrimaryKeyConstraint(mssql_clustered=True)
    )
    

    and then Table would be smart enough to know that a should be added to the empty PrimaryKeyConstraint object. But it doesn't, which is unfortunate, I think it should.

    So perhaps w.r.t. this pullreq we keep it strictly on the Constraint object and then have the task of "how to set up PrimaryKeyConstraint" options be something separate.

  5. Derek Harland author

    Agree this is the best way to go … i.e. to allow user to tag columns with primary_key=True and then add remaining options via a PrimaryKeyConstraint clause.

    But your patch linked above will allow the following:

    Table('t', metadata,
       Column('a', Integer, primary_key=True),
       Column('b', Integer)
       PrimaryKeyConstraint('b', mssql_clustered=True)
    )
    

    and will generate a PKC of (b, a).

    I think that if the user has tagged any columns individually, then the PrimaryKeyConstraint should raise an exception if it also contains any columns.

    e.g. the new code should be something like

    @@ -2528,10 +2523,12 @@ class PrimaryKeyConstraint(ColumnCollectionConstraint):
         def _set_parent(self, table):
             super(PrimaryKeyConstraint, self)._set_parent(table)
    
    -        if table.primary_key in table.constraints:
    -            table.constraints.remove(table.primary_key)
    -        table.primary_key = self
    -        table.constraints.add(self)
    +        if table.primary_key is not self:
    +            table.constraints.discard(table.primary_key)
    +            table.primary_key = self
    +            table.constraints.add(self)
    +
    +        if self.columns:
    +            if [c for c in table.c if c.primary_key]:
    +                raise SomeSortOfException
    +            else:
    +                self.columns.extend(c for c in table.c if c.primary_key)
    
  6. Derek Harland author

    Haven't tested but I still think this is unclear as regards ordering of the PKC columns

    Table('t', metadata,
       Column('a', Integer, primary_key=True),
       Column('b', Integer, primary_key=True)
       Column('c', Integer, primary_key=True)
       PrimaryKeyConstraint('c', mssql_clustered=True)
    )
    

    probably generates a PKC of (c, a, b). Its more likely they wanted (a, b, c) if they meant anything at all. I think its probably an error if the PrimaryKeyConstraint columns don't match exactly those which have been tagged with primary_key=True.

    So I think the following should work

    Table('t', metadata,
       Column('a', Integer, primary_key=True),
       Column('b', Integer, primary_key=True)
       Column('c', Integer, primary_key=True)
       PrimaryKeyConstraint(mssql_clustered=True)
    )
    

    and

    Table('t', metadata,
       Column('a', Integer, primary_key=True),
       Column('b', Integer, primary_key=True)
       Column('c', Integer, primary_key=True)
       PrimaryKeyConstraint('c', 'b', 'a', mssql_clustered=True)
    )
    

    but not this

    Table('t', metadata,
       Column('a', Integer, primary_key=True),
       Column('b', Integer, primary_key=True)
       Column('c', Integer, primary_key=True)
       PrimaryKeyConstraint('c', mssql_clustered=True)
    )
    
  7. Mike Bayer repo owner

    heh. but whats the order of the cols for the second one ? also I really think it's possible a couple of users are going to get dinged by this

  8. Derek Harland author

    The ordering for the second one? I think if they have tagged columns individually, then either they can specify a PrimaryKeyConstraint with no columns, or with all the columns and the latter gives them the opportunity to reorder them into exactly what they want.

    (which is what you've said is more appropriate too!).

    Yes there is potential for dinging users :-(

  9. Mike Bayer repo owner

    I try not to ding the users on a point release but I'm not in the mood to wait for 1.0 on this. i doubt this will have too much impact.

  10. Derek Harland author

    Ok … so shall I:

    • incorporate your changes from Trac Ticket into my branch

    • rewind the mssql_clustered option on Table

    • alter my doc and test changes to match.

  11. Mike Bayer repo owner

    on this end I think the important thing is getting the mssql_clustered option into the PrimaryKeyConstraint and UniqueConstraint (and Index?). the patch in ticket 2910 can be kept separate.

  12. Derek Harland author

    I have done the following:

    • removed mssql_clustered option on Table

    • added mssql_clustered option to UniqueConstraint

    • altered documentation and tests to match

    • also patched an existing bug where an empty UniqueConstraint() (i.e. with no columns specified) would generate bad sql. Changed this in both the dialects.mssql.compiler and sql.compiler. Added test to test.sql.test_constraints

  13. Mike Bayer repo owner

    ticket 2910 is complete: http://www.sqlalchemy.org/trac/ticket/2910#comment:3

    however I've made it emit a warning and revert to the old behavior when the column lists don't match. the warning can become an exception in a future release:

    from sqlalchemy import *
    metadata = MetaData()
    
    t = Table('t', metadata,
       Column('a', Integer, primary_key=True),
       Column('b', Integer, primary_key=True),
       Column('c', Integer, primary_key=True),
       PrimaryKeyConstraint('b', 'c', mssql_clustered=True)
    )
    
    SAWarning: Table 't' specifies columns 'a', 'b', 'c' as primary_key=True, not matching locally specified columns 'b', 'c'; setting the current primary key columns to 'b', 'c'. This warning may become an exception in a future release