Readonly worksheet returns empty rows as empty lists

Issue #1195 new
Charles Pigott
created an issue

After upgrading from 2.5.3 to 2.6.0, I found that empty rows (but only those without any formatting?) were now being returned as an empty list, rather than a list of empty cells, as was happening previously.

With a readonly workbook created and iterated like:

workbook_zip = openpyxl.load_workbook(filename, data_only=True, read_only=True)
for sheet_name in workbook_zip.sheetnames:
    for row in workbook_zip[sheet_name]:
         print(row)

I traced this back to the top of _get_row in worksheet/_read_only.py, which bypasses the following logic that pads the row with empty cells:

if not row:
    return ()

But it's not clear where to go from there - whether the issue is in the handling of the empty row, or the parser that produced it

I'm not entirely sure if this wasn't an intentional change, but it's a breaking change that wasn't mentioned in the changelog, so...

Comments (8)

  1. CharlieC

    I think the issue is that, in some situations, the width of the row is unknowable, so we can't even return an EMPTY_ROW. Rows themselves can be sparsely implemented so they might not even exist in a workbook. As soon as you pass in min_col and max_col, you should get EMPTY_ROWs of the right length. AFAIK this behaviour hasn't changed significantly in 2.6 but maybe I missed something.

  2. Charles Pigott reporter

    Not sure what to tell you, but this has definitely changed during the upgrade

    I've added a workaround of:

    row = row or [EMPTY_CELL] * workbook_zip[sheet_name].max_column
    

    for my own code, which seems to work well enough

    This was happening with a lot of our tests, so I can't think it's that hard to trigger, but I can get you a spreadsheet to test with if you like.

  3. CharlieC

    Well, in theory rows should be expanded by _cells_by_row() and the worksheet's max values should be respected, if they're known so that you get evenly sized rows if possible.

    But otherwise you should always be able to loop over a row. I don't really want to work through a workbook. Should be possible to write some relevant tests to see where any bug is.

  4. Jovan Kurjakov

    Hi, I am having a similar issue. But in my case ws.rows in not generating list of cells but something else. What was working with 2.5.x is not working when upgrading to 2.6.1, when using ws.rows property to fetch all rows. As consequence I changed .rows to .iter_rows(). Maybe I am missing something this is how I use it.

    def parse_average_only(excel_file):
        if not os.path.isfile(excel_file):
            raise FileNotFoundError('{} not found'.format(excel_file))
    
        wb = load_workbook(filename=excel_file, read_only=True, data_only=True)
        ws = wb[wb.sheetnames[0]]
        return _parse_average(ws)
    

    ... def _parse_average(ws) : # this is where I use ws.rows to make return values via comprehension lists

    Any suggestion that using ws.rows is wrong in this case or maybe something is lost with the object when passing it to other function ?

  5. CharlieC

    I'm still not really sure what the issue here is. This is what a worksheet in 2.5 looks like: firstly, if the dimensions are known; secondly if these are not. You can see that you get "ragged" rows if the dimensions are not known or supplied.

    for row in ws:
        print(row)
    
    (<ReadOnlyCell 'Tabelle1'.A1>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <ReadOnlyCell 'Tabelle1'.E3>)
    ws._max_column = ws._max_row = None
    for row in ws:
        print(row)
    
    (<ReadOnlyCell 'Tabelle1'.A1>,)
    []
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <ReadOnlyCell 'Tabelle1'.E3>)
    

    Behaviour in 2.6 is identical, aside from the new method for clearing the dimensions.

    from openpyxl import load_workbook
    wb = load_workbook("rows.xlsx", read_only=True)
    ws = wb.active
    for row in ws:
        print(row)
    
    (<ReadOnlyCell 'Tabelle1'.A1>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <ReadOnlyCell 'Tabelle1'.E3>)
    ws.reset_dimensions()
    for row in ws:
        print(row)
    
    (<ReadOnlyCell 'Tabelle1'.A1>,)
    []
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <ReadOnlyCell 'Tabelle1'.E3>)
    
  6. Charles Pigott reporter

    Except it's not identical -

    import sys
    import openpyxl
    print("openpyxl version:", openpyxl.__version__)
    wb = openpyxl.load_workbook(sys.argv[1], read_only=True)
    for sheet_name in wb.sheetnames:
        print("Sheet:", sheet_name)
        for row in wb[sheet_name]:
            print(row)
    

    With the linked spreadsheet:

     $ python openpyxl1195.py openpyxl1195.xlsx
    openpyxl version: 2.5.3
    Sheet: Description
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<ReadOnlyCell 'Description'.A2>, <ReadOnlyCell 'Description'.B2>, <ReadOnlyCell 'Description'.C2>, <ReadOnlyCell 'Description'.D2>, <ReadOnlyCell 'Description'.E2>, <ReadOnlyCell 'Description'.F2>, <ReadOnlyCell 'Description'.G2>, <ReadOnlyCell 'Description'.H2>, <ReadOnlyCell 'Description'.I2>, <ReadOnlyCell 'Description'.J2>, <ReadOnlyCell 'Description'.K2>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B3>, <ReadOnlyCell 'Description'.C3>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B4>, <ReadOnlyCell 'Description'.C4>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B5>, <ReadOnlyCell 'Description'.C5>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B6>, <ReadOnlyCell 'Description'.C6>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<ReadOnlyCell 'Description'.A8>, <ReadOnlyCell 'Description'.B8>, <ReadOnlyCell 'Description'.C8>, <ReadOnlyCell 'Description'.D8>, <ReadOnlyCell 'Description'.E8>, <ReadOnlyCell 'Description'.F8>, <ReadOnlyCell 'Description'.G8>, <ReadOnlyCell 'Description'.H8>, <ReadOnlyCell 'Description'.I8>, <ReadOnlyCell 'Description'.J8>, <ReadOnlyCell 'Description'.K8>)
    (<EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B10>, <ReadOnlyCell 'Description'.C10>, <ReadOnlyCell 'Description'.D10>, <ReadOnlyCell 'Description'.E10>, <ReadOnlyCell 'Description'.F10>, <ReadOnlyCell 'Description'.G10>, <ReadOnlyCell 'Description'.H10>, <ReadOnlyCell 'Description'.I10>, <ReadOnlyCell 'Description'.J10>, <ReadOnlyCell 'Description'.K10>)
    
     $ python openpyxl1195.py openpyxl1195.xlsx
    openpyxl version: 2.6.0
    Sheet: Description
    ()
    (<ReadOnlyCell 'Description'.A2>, <ReadOnlyCell 'Description'.B2>, <ReadOnlyCell 'Description'.C2>, <ReadOnlyCell 'Description'.D2>, <ReadOnlyCell 'Description'.E2>, <ReadOnlyCell 'Description'.F2>, <ReadOnlyCell 'Description'.G2>, <ReadOnlyCell 'Description'.H2>, <ReadOnlyCell 'Description'.I2>, <ReadOnlyCell 'Description'.J2>, <ReadOnlyCell 'Description'.K2>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B3>, <ReadOnlyCell 'Description'.C3>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B4>, <ReadOnlyCell 'Description'.C4>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B5>, <ReadOnlyCell 'Description'.C5>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    (<EmptyCell>, <ReadOnlyCell 'Description'.B6>, <ReadOnlyCell 'Description'.C6>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>, <EmptyCell>)
    ()
    (<ReadOnlyCell 'Description'.A8>, <ReadOnlyCell 'Description'.B8>, <ReadOnlyCell 'Description'.C8>, <ReadOnlyCell 'Description'.D8>, <ReadOnlyCell 'Description'.E8>, <ReadOnlyCell 'Description'.F8>, <ReadOnlyCell 'Description'.G8>, <ReadOnlyCell 'Description'.H8>, <ReadOnlyCell 'Description'.I8>, <ReadOnlyCell 'Description'.J8>, <ReadOnlyCell 'Description'.K8>)
    ()
    (<EmptyCell>, <ReadOnlyCell 'Description'.B10>, <ReadOnlyCell 'Description'.C10>, <ReadOnlyCell 'Description'.D10>, <ReadOnlyCell 'Description'.E10>, <ReadOnlyCell 'Description'.F10>, <ReadOnlyCell 'Description'.G10>, <ReadOnlyCell 'Description'.H10>, <ReadOnlyCell 'Description'.I10>, <ReadOnlyCell 'Description'.J10>, <ReadOnlyCell 'Description'.K10>)
    

    Note the empty rows. I suspect it's relevant that if I save the spreadsheet in Excel, I can't reproduce the issue - only when saved with LibreOffice (5.2).

  7. CharlieC

    Looks like the dimensions are missing but difficult to tell without a file. Use ws.calculate_dimension(True) to fix that.

    But back to the API: you're supposed to be able to loop over any row and this is always possible.

  8. Charles Pigott reporter

    I linked the spreadsheet above - https://www.dropbox.com/s/8ctazrhc1a9nzp2/openpyxl1195.xlsx?dl=0

    I get that, my "main" issue with the whole thing is that this was a breaking change that wasn't mentioned in the changelog

    I'll give calculate_dimension a go, to see if that can replace my previous hack

    EDIT: Nope, calculate_dimension returns A1:K10, even when unforced, and calling _calculate_dimension directly results in :

      File "[snip]/openpyxl/worksheet/_read_only.py", line 159, in _calculate_dimension
        max_col = max(max_col, cell.column)
    AttributeError: 'EmptyCell' object has no attribute 'column'
    
  9. Log in to comment