Too many EmptyCell's retrieved from ReadOnlyWorksheet

Issue #908 resolved
Pravin Kumar
created an issue

One extra EmptyCell is retrieved when an empty range is selected. I'm not sure why any cells are retrieved at all, I thought an empty set should be returned. This only happens on the attached workbook, which is a stripped-down version of a production one I am working with. I created this workbook with Excel 2016.

Here is a minimal example, applied on the attached excel workbook:

import openpyxl

wb = openpyxl.load_workbook(filename='test_empty.xlsx', read_only=True)
ws = wb['Main']

# Should be length 2, but is length 3
ref = 'D2:E2'
print(ref)
print([x.value for x in ws[ref][0]])
print(len(ws[ref][0]))

# Should be length 5 and is length 5
ref2 = 'A2:E2'
print(ref2)
print([x.value for x in ws[ref2][0]])
print(len(ws[ref2][0]))

Outputs

D2:E2
[None, None, None]
3
A2:E2
['a', 'b', None, None, None]
5

openpyxl 2.4.8 (pip installation) Python 3.6.2 (v3.6.2:5fd33b5, Jul 8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32

Windows 10, Excel 2016

Screenshot of test_empty.xlsx: Capture.PNG

Screenshot from my actual project, debugged in PyCharm: Capture2.PNG

I'm happy to have a go at fixing this bug, if you can give me an idea of where to start. Thanks

Comments (9)

  1. CharlieC

    Thanks for the report, I'll investigate but feel free to beat me to it. NB some of this stuff is tricky!

    When it comes to EmptyCells it's worth thinking of Worksheets as tables with 16384 columns and a million rows. Any kind of sparse matrix is an implementation detail.

    When a range is requested the implementation might return a motley collection of cells in the form of uneven or "ragged" rows which can make processing them more difficult. EmptyCells are added as padding to guarantee the shape of the returned range. It also ensures consistent behaviour between read-only and standard modes because in standard mode you might be using ws.iter_rows() to create cells. Furthermore, in read-only mode due to the iterative nature it's not possible to know in advance whether result set will be empty. EmptyCells are reliable sentinels for client code that the relevant cells have neither value not style.

  2. CharlieC

    FWIW I've just pushed more useful repr methods for ReadOnlyCells and the EmptyCell. Tests for this are in tests/test_iter.py and are supposed to show the same behaviour across implementations. _cells_by_row is probably the place to start.

  3. Austin Rogers

    Hi guys, first-time contributor here, but it looks like it is an issue with _get_row in read_only.py. In the case of an all-empty row, safe_iterator yields the first non-empty cell in the row and then raises StopIteration. Thus, column can be set to a value less than min_col if there are non-empty cells earlier than the requested row-slice. In this scenario, the last line in the for-loop becomes the issue: col_counter = column + 1 sets col_counter to a value less than min_col. The final for-loop now begins yielding empty cells starting at the index col_counter despite it being smaller than min_col.

    I believe bounding the start index of the final for-loop range function to max(min_col, col_counter) fixes the issue quite nicely. I pulled down the code, made the change, and and ran a limited set of functional tests to confirm.

    for _ in range(col_counter, max_col+1):
        yield EMPTY_CELL
    

    needs to be updated to:

    for _ in range(max(min_col, col_counter), max_col+1):
        yield EMPTY_CELL
    

    I can make a pull-request if this makes sense to you, Charlie.

  4. CharlieC

    @Austin Rogers Sounds reasonable and as long as it doesn't break the existing tests should be fine. Please run tox -e py27 py34 -- -xrf and if this is okay then please submit a PR.

    Testing this stuff is tricky because OOXML allows rows to be really messy and debugging is difficult because, by design, EMPTY_CELL is singleton. If you look at the history of the code I hope you can see that I managed to clean it up over time!

  5. Log in to comment