2.5 - Fix iter_rows() not returning empty rows at bottom of range On ReadOnly WB

#295 Merged at 0c95dd7
Repository
stephenrauch
Branch
2.5
Repository
openpyxl
Branch
2.5
Author
  1. Stephen Rauch
Reviewers
Description
  • Fix iter_rows() not returning empty rows at bottom of range On ReadOnly WB

Comments (7)

    1. Stephen Rauch author

      Yes that was the general intent. Yielding the empty row tuples should cost very little, and the consuming code is already on the hook for dealing with empty rows before or inside the block.

      In addition this would match the behavior of non-readonly version of iter_rows.

      1. CharlieC

        Yeah, there's no real performance hit unless you do something like max_row=20000000 (no limit here but standard mode will complain) but I'm not convinced that it's a good idea. Standard mode will create cells that don't exist already so you can sort of see why behaviour should differ. Standard mode can also create filler rows of the same length. We can't do this here without a value for max_col.

        Do you have a use-case where you’re flipping between modes so need orthogonality? This might make sense if you have read_only && data_only comibining with standard mode and want to zip over them.

        1. Stephen Rauch author

          I have a use for cell values and formula at the same time, so am managing two copies of the same workbook. The formulas book is not readonly since I also need cell attributes. The differing behavior on the iterator means I have to align by hand, and at the level of my code, I have an range address, not sizes. Also the differing behavior seemed odd given that consumers already had to deal with empty rows. No troubles, I am sure l can monkey patch something in or sort this in my code. Thanks.

          1. CharlieC

            I'm not declining the PR just trying to weigh it up. I'm assuming the ranges are oversized, right? My suggestion would be to check the boundaries against the worksheet and adjust if necessary. It does seem odd to have a range that extends beyond the boundaries of a worksheet. Read-only will give you ReadOnlyCells which have (nearly) all of the attributes of normal cells. I'm introducing new values_only parameter in 2.6 which will give a performance boost for many and would probably help in this situation: open the workbook twice in read-only mode, once in data-only mode, values-only mode.

            But as to whether you should be able to iterate over more rows than a worksheet currently contains I'm still undecided. The rule of consenting adults says that if the client asks for it, then they should get it even if it means 1,000,000 rows and 16,384 columns. This is, of course, that this contradicts the implicit contract of read-only mode which is to provide an immutable worksheet. Would it make sense to stop ws.iter_rows() exceeding the boundaries of the current worksheet?

            1. CharlieC

              As it happens, I think I’ve found a use case that wasn’t covered by existing tests and code where max_row < ws.max_row but there are rows missing in the source. The current implementation will stop as the last row which is less than max_row I’m working on this in the 2.6 branch, which you should soon be able to test. The code is very similar but should prevent the rows beyond sheet boundaries.

  1. CharlieC

    I've merged this but applied the same condition as in 2.6: you won't be able to iterate beyond the existing bounds of a worksheet. 2.6 has a test case which exposed a bug in the logic in some sheets due to the sparse matrix XML. I guess we should also examine the logic for column widths: min(max_col, self.max_column, 16384) though I suspect that in practice this is less of an issue.