support range specificaiton in window function

Issue #3049 resolved

Mike Bayer repo owner created an issue 2014-05-13

e.g., add the "RANGE BETWEEN" syntax to our long existing over() syntax.

SELECT func(x) OVER (ORDER BY x RANGE BETWEEN UNBOUNDED PRECEDING AND 10 FOLLOWING)

see the syntax here and here, also might think of having the named window function.

so for this we have the option for strings:

func.foo(x).over(frame="RANGE BETWEEN UNBOUNDED PRECEDING AND 10 FOLLOWING)

or maybe an elaborate generative thing

func.foo(x).over().rows(start="unbounded preceding", end="current row")

Comments (11)

Mike Bayer reporter

OK here's the syntax:

We are shooting to make this somewhat complex syntax not a big deal to add to an over() clause. So things like calling more methods, or importing more clause constructs, trying to avoid that. Also, as this syntax accepts literal numerics, I don't want to make the same mistake we made for limit/offset, these should be rendered as bound parameters from the gate.

Two new kw params, "range" and "rows". These are mutually exclusive, an error is raised if both are present.

"range" and "rows" each accept a tuple.

The tuple then contains tokens that correspond to the values at http://www.postgresql.org/docs/9.1/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS:

UNBOUNDED PRECEDING
value PRECEDING
CURRENT ROW
value FOLLOWING
UNBOUNDED FOLLOWING

that is, they are laid out freeform:

rows=("unbounded preceding", "current row")
rows=(25, "preceding", 50, "following")
rows=("preceding", 100, "following")

So some things to note above. We're actually going to use a tiny little "expect" algorithm here. that is, we iterate through the tokens. the first token has to be a number, "current [row]", or "[unbounded] preceding" (brackets denote an optional keyword). Based on whether that token is a number or a keyword determines what we "expect" for the next token.

More examples:

func.max(table.c.x).over(order_by=table.c.y, range=(25, "preceding", 50, "following"))

func.max(table.c.x).over(order_by=table.c.y, range=("unbounded preceding",)))

func.max(table.c.x).over(order_by=table.c.y, rows=(25, "preceding", "current row"))

Here's the token parser:

from sqlalchemy.util import string_types
from sqlalchemy.sql import literal, ColumnElement
import re


def expect_window_range(tokens):

    value_or_keywords = re.compile(
        r"""
            (?:unbounded\ )?(preceding) |
            (?:current\ )?(row) |
            (?:unbounded\ )?(following)
        """,
        re.I | re.X
    )
    keyword_following_value = re.compile(r'(preceding|following)', re.I)

    def _expect(tokens, expect_value):
        try:
            token = tokens.pop(0)
        except IndexError:
            raise ValueError("token expected")

        if expect_value:
            if isinstance(token, int):
                return ("value", literal(token))
            elif isinstance(token, ColumnElement):
                return ("value", token)

        if isinstance(token, string_types):
            if expect_value:
                m = value_or_keywords.match(token)
            else:
                m = keyword_following_value.match(token)
            if m:
                return ("token", m.group(m.lastindex))

        raise ValueError("invalid token: %r" % token)

    tokens = list(tokens)
    result = {}
    while tokens:
        if "frame_end" in result:
            raise ValueError("unexpected token: %r" % tokens[0])

        type_, value = _expect(tokens, True)
        if type_ == "value":
            type_, modifier = _expect(tokens, False)
        else:
            modifier = value
            value = None

        if "frame_start" in result:
            result["frame_end"] = (modifier, value)
        else:
            result["frame_start"] = (modifier, value)
    return result

print expect_window_range((40, "preceding", "row"))
print expect_window_range(("preceding", "unbounded following"))
print expect_window_range((100, "following"))

try:
    expect_window_range((100, "unbounded following"))
except ValueError as err:
    print err


try:
    expect_window_range(("unbounded preceding", 50, "row"))
except ValueError as err:
    print err

print expect_window_range(("unbounded preceding", "row"))

so from that you get the frame specification, the the compiler for visit_over uses the frame spec to render the rest of the values.

2014-09-06T19:21:37+00:00

Mike Bayer reporter
- changed milestone to 1.0.xx
- assigned issue to
  
  sqlalchemy sprinters
- 2014-09-06T19:21:52+00:00
Mike Bayer reporter
- changed milestone to 1.2
- 2015-06-08T16:49:25+00:00
Mike Bayer reporter
- edited description
- 2015-08-31T23:21:25+00:00

Phillip Cloud

@zzzeek What do you think about syntax like this, instead of what you're showing above:

func.sum(table.c.x).over(..., rows={'preceding': 40, 'following': 0})  # rows between 40 preceding and current row
func.sum(table.c.x).over(..., rows={'preceding': None}) # rows between unbounded preceding and current row
func.sum(table.c.x).over(..., range={'preceding': 2, 'following': None})  # range between 2 preceding and unbounded following

The SQL to Python mapping is as follows:

current row => 0
unbounded preceding => None
unbounded following => None
No preceding key specified => None
No following key specified => current row

I think this is a more Pythonic API, and requires less work for the implementation. I'd be happy to put up a PR for this.

2016-05-01T23:41:19+00:00

Mike Bayer reporter
I'm not sure what the rationale for the tuple was except that the documentation for the range at http://www.postgresql.org/docs/9.1/static/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS is unclear if there is ever a need to put "FOLLOWING" before "PRECEDING", or if other combinations are meaningful like "ROWS 40 PRECEDING" vs. "ROWS BETWEEN 40 PRECEDING AND CURRENT ROW". So the dictionary style here has to be validated that every possible meaningful value, per the SQL standard, not necessarily PG's behavior, can be represented.
- 2016-05-02T02:22:39+00:00
Mike Bayer reporter
review at https://gerrit.sqlalchemy.org/#/c/69/
- 2016-05-02T02:46:51+00:00

Mike Bayer reporter

why do we need the keys "preceding" and "following" and why is missing key for "preceding" different than "following"?

if this is just a range, let's do a range:

func.sum(table.c.x).over(..., rows=(-40, 0))  # rows between 40 preceding and current row
func.sum(table.c.x).over(..., rows=(None, 0)) # rows between unbounded preceding and current row
func.sum(table.c.x).over(..., range=(-2, None))  # range between 2 preceding and unbounded following

that works ?

2016-06-07T16:35:37+00:00

Phillip Cloud
Nice. I like that much better. I'll look into the standard to find what it says about the necessity of BETWEEN.
- 2016-06-07T16:39:25+00:00
Phillip Cloud
The reason I had the different meaning for missing keys is because in SQL, leaving out preceding has a different meaning than leaving out following. It's moot if we can use a tuple.
- 2016-06-07T16:42:08+00:00
Mike Bayer reporter
- changed status to resolved
Support range specification in window function

Fixes: ~~#3049~~ Co-authored-by: Mike Bayer mike_mp@zzzcomputing.com Change-Id: Ie572095c3e25f70a1e72e1af6858e5edd89fd25e Pull-request: https://github.com/zzzeek/sqlalchemy/pull/264

→ <<cset d34063aa3221>>
- 2016-06-07T21:21:53+00:00
Log in to comment

Assignee: sqlalchemy sprinters

Type: enhancement

Priority: major

Status: resolved

Component: sql

Milestone: 1.2

Votes: 7

Watchers: 9