wrong output using ORACLE limit/offset with duplicate rows

Issue #536 resolved
Former user created an issue

Hi, I found that using limit/offset options with ORACLE DB produce wrong results output while selected column has duplicates. If I specify limit and offset the result query constructed by SQLAlchemy has the form (my table is Block and column path):

select path from (select distinct path, row_number() over (order by path desc) as ora_rn from Block) where ora_rn>1 and ora_rn<5;

Here I was looking for 3 non-duplicate rows, but the output had 3 rows with duplicates.

The correct query should be constructed in a way:

select path from (select distinct path from Block order by path desc) group by rownum,path having rownum>1 and rownum<5;

which produce 3 non-duplicated rows. It works on Oracle 10.x, but I don't know if it will works on earlier version of Oracle DB. I think the problem with SQLAlchemy query is not having having and group by keywords.

Valentin.

Comments (18)

  1. Former user Account Deleted

    Replying to guest:

    Hi, I found that using limit/offset options with ORACLE DB produce wrong results output while selected column has duplicates. If I specify limit and offset the result query constructed by SQLAlchemy has the form (my table is Block and column path):

    select path from (select distinct path, row_number() over (order by path desc) as ora_rn from Block) where ora_rn>1 and ora_rn<5;

    Here I was looking for 3 non-duplicate rows, but the output had 3 rows with duplicates.

    The correct query should be constructed in a way:

    select path from (select distinct path from Block order by path desc) group by rownum,path having rownum>1 and rownum<5;

    which produce 3 non-duplicated rows. It works on Oracle 10.x, but I don't know if it will works on earlier version of Oracle DB. I think the problem with SQLAlchemy query is not having having and group by keywords.

    Valentin.

    I checked that MySQL does job properly using limit/offset and ORACLE query constructed by SQLAlchemy should be changed to

    select path from (select distinct path, row_number() over (order by path desc) as ora_rn from Block) group by ora_rn, path having ora_rn>1 and ora_rn<5;

    I changed "where ora_rn>1 and ora_rn<5" to "group by ora_rn, path having".

    Valentin.

  2. Mike Bayer repo owner

    the query you have does not seem to maintain ordering, and in fact it seems ordering goes out the window when you combine the "DISTINCT" with the row_number() function.

    query can also be constructed manually using expressions via the "group_by" and "having" keyword arguments to select().

  3. Former user Account Deleted

    No the query does keep ordering if I'll place order by at the end

    select path from (select distinct path from Block where order by path desc) group by rownum,path having rownum>1 and rownum<5 order by path;

    here I didn't use SQLAlchemy addition and just used plain query to make distinct path's with ordering.

  4. Former user Account Deleted

    Hi, here is update on this subject. Table schema (for simplicity I removed unnecessary columns):

    CREATE TABLE Block ( ID integer, Name varchar(500) unique not null, Path varchar(500) not null, primary key(ID) ); So the path column can contains repeated values. Here is the code using limit/offset:

    t = self.getTable(dbAlias,'Block','tblk')[BR] sel = sqlalchemy.select(t.c.path,from_obj=t,distinct=True,limit=10,offset=0)[BR] print sel[BR] res = con.execute(sel)[BR] for item in res: print "result from Block",item

    and here is result:

    SELECT path FROM (SELECT DISTINCT tblk.path AS path, ROW_NUMBER() OVER (ORDER BY tblk.id) AS ora_rn FROM cms_dbs_int_global.block tblk) WHERE ora_rn>0 AND ora_rn<=10

    result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',) result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW',)

    It does not select 10 different distinct paths.

    While if I'll do

    sel = sqlalchemy.select(t.c.path,from_obj=t,distinct=True)[BR] tmp = sel.alias('tmp')[BR] q = sqlalchemy.select(as rnum',from_obj=q)[BR] sel.append_whereclause( 'rnum between %s and %s'%(0,10) )[BR] print sel[BR] res = con.execute(sel)[BR] for item in res: print "result from Block",item

    I'll get 10 different distinct paths:

    SELECT * FROM (SELECT tmp.*, rownum as rnum FROM (SELECT DISTINCT tblk.path AS path FROM cms_dbs_int_global.block tblk) tmp) WHERE rnum between 0 and 10

    result from Block ('/CSA07AllEvents/CMSSW_1_6_6-HLTSplit-A3-Stew/GEN- SIM-DIGI-RAW', 1) result from Block ('/GammaJetIsoPi0_Pt55to65_ptHat60/CMSSW_1_6_7- CSA07-4067/GEN-SIM-DIGI-RAW', 2) result from Block ('/LM3_isasdkpyt/CMSSW_1_6_5-FastSim-SUSYBSM-1234/ AODSIM', 3) result from Block ('/LM7_isasdkpyt/CMSSW_1_6_5-FastSim-SUSYBSM-1234/ AODSIM', 4) result from Block ('/LM6_isasdkpyt/CMSSW_1_6_5-FastSim-SUSYBSM-1234/ AODSIM', 5) result from Block ('/RS1GravitonZZ4Mu_1500GeV_01/CMSSW_1_6_7- CSA07-3199/GEN-SIM-DIGI-RAW', 6) result from Block ('/SingleMuPlusPt100To400/CMSSW_1_6_7-HLT-1193394942/ GEN-SIM-DIGI-RECO', 7) result from Block ('/LM2_isasdkpyt/CMSSW_1_6_5-FastSim-SUSYBSM-1234/ AODSIM', 8) result from Block ('/LM8_isasdkpyt/CMSSW_1_6_5-FastSim-SUSYBSM-1234/ AODSIM', 9) result from Block ('/RS1GravitonZZ4Mu_1500GeV_01/CMSSW_1_4_6- CSA07-2644/GEN-SIM', 10)

    As was suggested by Mike this can be done using row_number() function. According to him it is still more reliable than "rownum". similar tests work fine:

    SELECT * FROM (SELECT tmp.*, row_number() over (order by tmp.path) as ora_rn FROM (SELECT DISTINCT tblk.path AS path FROM cms_dbs_int_global.block tblk) tmp) WHERE ora_rn between 0 and 10

    This would be a little bit of extra wrapping performed in
    oracle.visit_select().

  5. Former user Account Deleted

    I just attached a patch that seems to fix the problem. I also added a few test cases to create the problem condition.

    The fix still fails on one existing test case. The error displayed is:

    FAIL: test_limit (__main__.CompileTest)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test/dialect/oracle.py", line 71, in test_limit
        assert t.c.col1 in set(c.result_map['col1']('col1')[1](1)), (t.c.col1, c.result_map)
    AssertionError: (<sqlalchemy.sql.expression._ColumnClause at 0x-48594374; col1>, {'ora_rn': ('ora_rn', (<sqlalchemy.sql.expression._Label object at 0xb7a7502c>, <sqlalchemy.sql.expression._ColumnClause at 0x-4858e014; ROWNUM>, 'ora_rn'), NullType()), 'col2': ('col2', (<sqlalchemy.sql.expression._ColumnClause at 0x-4858e294; col2>,), NullType()), 'col1': ('col1', (<sqlalchemy.sql.expression._ColumnClause at 0x-4858e2f4; col1>,), NullType())})
    

    But I don't know how to fix that. I have tested these changes with the test cases and with my application and it seems to work as desired.

    The patch was developed and tested against the 4.7 tree.

    Jim Patterson

  6. Former user Account Deleted

    Replying to guest:

    I just attached a patch that seems to fix the problem. I also added a few test cases to create the problem condition.

    Oh yeah, I forgot to mention. I tried using the row_number analytic as discussed in the email thread earlier in the year however the row_number function requires an order by and if one was not supplied by the user the logic was using oid_column.proxies which is rowid, but rowid is not valid if you use the distinct clause, so I used the Oracle recommended pagination query style using rownum. I have used that style of pagination query for many years and I have never found a case where it did not work in Oracle.

    Jim Patterson

  7. Former user Account Deleted

    I found and fixed the problem that was causing the one test case to fail under 4.7. I'll port the patch to 5.0 and test it there over the weekend.

    Thanks to Michael Bayer for the pointers on fixing the failing test case.

    Jim Patterson

  8. Mike Bayer repo owner
    • changed milestone to 0.5.0
    • changed component to oracle

    i like the patch but its a dramatic enough change that I'd like to keep it against the 0.5 series. If I have time soon I'll try to port to 0.5.

  9. Former user Account Deleted
    • changed status to open
    • removed status

    I'd like to note that the FIRST_ROWS(N) optimization hint is entirely unneeded here, and may hurt performance in some scenarios on Oracle databases.

    The whole point of FIRST_ROWS(N) is to advise the query optimizer to produce an execution plan which is optimized to return the first N rows. With most queries, this often results in misguided query plans, and worst of all, overrides the database-wide initialization parameter "optimizer_mode" without the developer's knowledge.

    The transparent addition of this hint to the query would be considered misbehavior by most DBAs, and thus I strongly advise to remove it from limit/offset query generation.

  10. Mike Bayer repo owner

    Hello -

    The entire usage scheme comes from an article on Oracle's website, which is here:

    http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html

    Don Burleson talks about it here, and seems to agree that it does improve performance for queries that only fetch the first N rows:

    http://www.dba-oracle.com/art_orafaq_oracle_first_rows_sql_optimization.htm

    We are using it as recommended in these articles, in a query where we explicitly only want the first N rows.

    I did a brief Google search and could not locate any articles or documentation suggesting this setting results in "misguided" query plans, and it is a widely referenced feature. It's been recommended by two very well known Oracle DBAs, one of whom has worked for oracle for decades, with plenty of books to their credit, as well as by Oracle itself.

    Now its not like I know any better myself, since I'm not a DBA. So what would be more helpful here are specific examples of this particular optimization, in conjunction with how SQLAlchemy is using it (i.e., a statement where we definitely know we only want the first N rows), and how performance is harmed. Or at least some references indicating its harmful effects.

  11. Former user Account Deleted

    Hi zzzeek, thanks for the two links. They were quite interesting, and I have to say I was surprised to see Tom Kyte recommending the use of FIRST_ROWS. The thing that surprises me about the use of FIRST_ROWS in this case, is that Oracle would have to fetch the entire result array of the inner query, apply rownum, and then slice the needed rows from that array. Why?

    The problem with rownum is that it is generated dynamically as the rows are fetched. This prevents the following from working:

    SELECT * FROM t WHERE rownum BETWEEN 5 AND 10;

    ... which would return no rows, since the predicate prevents any rows from showing up. This is why we need to use a subquery in order to mimic LIMIT/OFFSET behavior with rownum.

    FIRST_ROWS(n) optimizes a query to return the first N rows as fast as possible, which is fine for LIMIT behavior, but does not make sense for LIMIT/OFFSET behavior - Oracle would have to fetch all the rows up to the OFFSET+LIMIT value anyway in order to determine which rows are needed. You could say that this actually optimizes performance in some cases, but only with small offset values, with small being relative to the size of the table.

    I'm sorry I can't provide any concrete performance hit examples caused by FIRST_ROWS, but I'd like to amend my previous statements and say that FIRST_ROWS does indeed have its uses - I came off a bit negative about this hint. The thing which disturbs me the most is the silent addition of it to the query. Hints are usually used as a last resort to query optimization, and I think that it should be up to the DBAs/Developers whether to use them or not.

    If any of this needs clearing up, please say so. Thanks!

  12. Mike Bayer repo owner

    OK, so lets go with a dialect flag "optimize_limits=False", which when True will apply the FIRST_ROWS(N) optimization when LIMIT/OFFSET is used. We'll leave it off by default. Agree?

  13. Log in to comment