support range specificaiton in window function

Issue #3049 resolved
Mike Bayer repo owner created an issue

e.g., add the "RANGE BETWEEN" syntax to our long existing over() syntax.

SELECT func(x) OVER (ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND 10 FOLLOWING)

see the syntax here and here, also might think of having the named window function.

so for this we have the option for strings:

func.foo(x).over(frame="RANGE BETWEEN UNBOUNDED PRECEDING AND 10 FOLLOWING)

or maybe an elaborate generative thing

func.foo(x).over().rows(start="unbounded preceding", end="current row")

Comments (11)

  1. Mike Bayer reporter

    OK here's the syntax:

    We are shooting to make this somewhat complex syntax not a big deal to add to an over() clause. So things like calling more methods, or importing more clause constructs, trying to avoid that. Also, as this syntax accepts literal numerics, I don't want to make the same mistake we made for limit/offset, these should be rendered as bound parameters from the gate.

    Two new kw params, "range" and "rows". These are mutually exclusive, an error is raised if both are present.

    "range" and "rows" each accept a tuple.

    The tuple then contains tokens that correspond to the values at http://www.postgresql.org/docs/9.1/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS:

    UNBOUNDED PRECEDING
    value PRECEDING
    CURRENT ROW
    value FOLLOWING
    UNBOUNDED FOLLOWING
    

    that is, they are laid out freeform:

    rows=("unbounded preceding", "current row")
    rows=(25, "preceding", 50, "following")
    rows=("preceding", 100, "following")
    

    So some things to note above. We're actually going to use a tiny little "expect" algorithm here. that is, we iterate through the tokens. the first token has to be a number, "current [row]", or "[unbounded] preceding" (brackets denote an optional keyword). Based on whether that token is a number or a keyword determines what we "expect" for the next token.

    More examples:

    func.max(table.c.x).over(order_by=table.c.y, range=(25, "preceding", 50, "following"))
    
    func.max(table.c.x).over(order_by=table.c.y, range=("unbounded preceding",)))
    
    func.max(table.c.x).over(order_by=table.c.y, rows=(25, "preceding", "current row"))
    

    Here's the token parser:

    from sqlalchemy.util import string_types
    from sqlalchemy.sql import literal, ColumnElement
    import re
    
    
    def expect_window_range(tokens):
    
        value_or_keywords = re.compile(
            r"""
                (?:unbounded\ )?(preceding) |
                (?:current\ )?(row) |
                (?:unbounded\ )?(following)
            """,
            re.I | re.X
        )
        keyword_following_value = re.compile(r'(preceding|following)', re.I)
    
        def _expect(tokens, expect_value):
            try:
                token = tokens.pop(0)
            except IndexError:
                raise ValueError("token expected")
    
            if expect_value:
                if isinstance(token, int):
                    return ("value", literal(token))
                elif isinstance(token, ColumnElement):
                    return ("value", token)
    
            if isinstance(token, string_types):
                if expect_value:
                    m = value_or_keywords.match(token)
                else:
                    m = keyword_following_value.match(token)
                if m:
                    return ("token", m.group(m.lastindex))
    
            raise ValueError("invalid token: %r" % token)
    
        tokens = list(tokens)
        result = {}
        while tokens:
            if "frame_end" in result:
                raise ValueError("unexpected token: %r" % tokens[0])
    
            type_, value = _expect(tokens, True)
            if type_ == "value":
                type_, modifier = _expect(tokens, False)
            else:
                modifier = value
                value = None
    
            if "frame_start" in result:
                result["frame_end"] = (modifier, value)
            else:
                result["frame_start"] = (modifier, value)
        return result
    
    print expect_window_range((40, "preceding", "row"))
    print expect_window_range(("preceding", "unbounded following"))
    print expect_window_range((100, "following"))
    
    try:
        expect_window_range((100, "unbounded following"))
    except ValueError as err:
        print err
    
    
    try:
        expect_window_range(("unbounded preceding", 50, "row"))
    except ValueError as err:
        print err
    
    print expect_window_range(("unbounded preceding", "row"))
    

    so from that you get the frame specification, the the compiler for visit_over uses the frame spec to render the rest of the values.

  2. Phillip Cloud

    @zzzeek What do you think about syntax like this, instead of what you're showing above:

    func.sum(table.c.x).over(..., rows={'preceding': 40, 'following': 0})  # rows between 40 preceding and current row
    func.sum(table.c.x).over(..., rows={'preceding': None}) # rows between unbounded preceding and current row
    func.sum(table.c.x).over(..., range={'preceding': 2, 'following': None})  # range between 2 preceding and unbounded following
    

    The SQL to Python mapping is as follows:

    current row => 0
    unbounded preceding => None
    unbounded following => None
    No preceding key specified => None
    No following key specified => current row
    

    I think this is a more Pythonic API, and requires less work for the implementation. I'd be happy to put up a PR for this.

  3. Mike Bayer reporter

    I'm not sure what the rationale for the tuple was except that the documentation for the range at http://www.postgresql.org/docs/9.1/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS is unclear if there is ever a need to put "FOLLOWING" before "PRECEDING", or if other combinations are meaningful like "ROWS 40 PRECEDING" vs. "ROWS BETWEEN 40 PRECEDING AND CURRENT ROW". So the dictionary style here has to be validated that every possible meaningful value, per the SQL standard, not necessarily PG's behavior, can be represented.

  4. Mike Bayer reporter

    why do we need the keys "preceding" and "following" and why is missing key for "preceding" different than "following"?

    if this is just a range, let's do a range:

    func.sum(table.c.x).over(..., rows=(-40, 0))  # rows between 40 preceding and current row
    func.sum(table.c.x).over(..., rows=(None, 0)) # rows between unbounded preceding and current row
    func.sum(table.c.x).over(..., range=(-2, None))  # range between 2 preceding and unbounded following
    

    that works ?

  5. Phillip Cloud

    Nice. I like that much better. I'll look into the standard to find what it says about the necessity of BETWEEN.

  6. Phillip Cloud

    The reason I had the different meaning for missing keys is because in SQL, leaving out preceding has a different meaning than leaving out following. It's moot if we can use a tuple.

  7. Log in to comment