dump_lxml fails to write dimensions, causes problems for IterableWorksheet when cells are empty

Issue #393 resolved
Michael Barrientos created an issue

When using Workbook(write_only=True) with lxml installed, the output xlsx file does not contain dimensions for each sheet (i.e. it is missing the <dimension ref="A1:I458"> tag).

When such a workbook is read with openpyxl.load_workbook(use_iterators=True), this causes the rows spit out by the iterator to fail to contain the empty columns (which should be None), thus compressing the columns and making them too short if the columns are intentionally empty. This is because inside IterableWorksheet.get_squared_range(), it uses the max_col from the sheet to determine how to fill in those EMPTY_CELL values (specifically is uses max_col to populate the expected_columns field, which becomes empty when max_col is None. Because those dimensions are missing, the empty cells are not generated.

This does not happen if lxml support is forcefully shut off via environment variable; DumpWorksheet does generate the dimensions tag.

I guess this is technically correct behavior because technically the dimensions tag is optional, and the cells returned do have column information allowing reconstruction. However it is very confusing when saving the data in Excel/non-write-only-Worksheet/DumpWorksheet has one behavior when read, and reading the same data written via LXMLWorksheet has another behavior.

My naive attempt to write a pull request by adding a dim = Element('dimension', {'ref': 'A1:%s' % self.get_dimensions()}) line after writing the sheetPr tag failed, because the dimensions haven't been calculated yet at that point. Perhaps someone with more knowledge of the codebase may help.

This might also be seen as 2 bugs, the other being with the IterableWorksheet reader's handling of missing dimensions. This could still be relieved without dimensions if the reader looked at the spans="1:9" attribute on a row tag and filled in blanks appropriately.

Comments (25)

  1. Eric Gazoni

    and thank you @mbarrien for the very detailed and accurate bug description ... I wish I read it before I spent time looking at it myself ;-)

  2. CharlieC

    As you note Dimension is an optional attribute. It's missing from any implementation that streams because it has to be at the start of the XML file but the dimensions will not be known. Editing the file subsequently is very inefficient. A better approach would to include the dimensions in some metadata which could be written and read independently of the cells, but unfortunately that's not available.

    The differences between the implementations of the write_only=True mode will disappear in 2.2 when the SAX-based XML generator is removed because it's actually very slow.

    The IterableWorksheet exposes this problem from the other end: without reliable dimensions the only way to know the size of a worksheet is to iterate through every single cell. This is not really desirable if you only want to access a small part. :-(

    Could you please supply a sample file that doesn't work properly in when read_only=True?

  3. Michael Barrientos reporter

    Quick and dirty. Using openpyxl 2.1.3:

    import openpyxl
    assert openpyxl.LXML
    lx_workbook = openpyxl.Workbook(write_only=True)
    lx_sheet = lx_workbook.create_sheet(title='Sheet')
    lx_sheet.append([1, 2, 3, 4, 5])
    lx_sheet.append([None, None, 1, 2, 3])
    lx_sheet.append([1, 2, 3, None, None])
    print('LXML write-only, normal read')
    in_workbook = openpyxl.load_workbook('foo_lx.xlsx')
    in_sheet = in_workbook.get_sheet_by_name('Sheet')
    for row in in_sheet.iter_rows():
        print([cell.value for cell in row])
    print('LXML write-only, read-only')
    ro_workbook = openpyxl.load_workbook('foo_lx.xlsx', read_only=True)
    ro_sheet = ro_workbook.get_sheet_by_name('Sheet')
    for row in ro_sheet.iter_rows():
        print([cell.value for cell in row])
    from openpyxl.writer.dump_worksheet import DumpWorksheet
    dw_workbook = openpyxl.Workbook(write_only=True)
    dw_workbook._optimized_worksheet_class = DumpWorksheet
    dw_sheet = dw_workbook.create_sheet(title='Sheet')
    dw_sheet.append([1, 2, 3, 4, 5])
    dw_sheet.append([None, None, 1, 2, 3])
    dw_sheet.append([1, 2, 3, None, None])
    print('DumpWorksheet write-only, normal read')
    in_workbook = openpyxl.load_workbook('foo_dw.xlsx')
    in_sheet = in_workbook.get_sheet_by_name('Sheet')
    for row in in_sheet.iter_rows():
        print([cell.value for cell in row])
    print('DumpWorksheet write-only, read-only')
    ro_workbook = openpyxl.load_workbook('foo_dw.xlsx', read_only=True)
    ro_sheet = ro_workbook.get_sheet_by_name('Sheet')
    for row in ro_sheet.iter_rows():
        print([cell.value for cell in row])

    Output is:

    LXML write-only, normal read
    [1, 2, 3, 4, 5]
    [None, None, 1, 2, 3]
    [1, 2, 3, None, None]
    LXML write-only, read-only
    [1, 2, 3, 4, 5]
    [1, 2, 3]
    [1, 2, 3]
    DumpWorksheet write-only, normal read
    [1, 2, 3, 4, 5]
    [None, None, 1, 2, 3]
    [1, 2, 3, None, None]
    DumpWorksheet write-only, read-only
    [1, 2, 3, 4, 5]
    [None, None, 1, 2, 3]
    [1, 2, 3, None, None]

    Notice the rows in "LXML write-only, read-only" are shorter. Yes they have coordinates if I examined each cell. But the differing behavior is unexpected.

  4. CharlieC

    Thanks very much for the example. Looks to me like only the second row in read-only mode is wrong: trailing cells set to None should be omitted unless they are formatted. 2.1 uses the same code for reading ranges in standard and read-only mode so the bug is in the writer.

  5. CharlieC

    Okay, I've got a solution for missing cells at the beginning of a row but rows cells missing along the way are going to be much more difficult: [1, 2, None, None, 3]

  6. CharlieC

    While the changes solve this I'm not actually convinced if the behaviour in read-only mode with unsized worksheets should be to fill gaps. We can't add trailing cells so the third row will always be [1, 2, 3] and there is a potential performance hit on large sheets with lots of gaps. This could be configurable but I'd also like to know what client code looks like or what are the issues faced when rows only contain explicitly defined cells.

    We might add a method that will forcibly calculate the dimensions of an unsized worksheet. Unfortunately, there's no way to do this apart from going through all the cells first.

  7. Eric Gazoni

    well, at least for me, the client code relied on the presence of a dimension attribute and crashed badly if it was not present, so I'll just change the client code not to rely on an optional attribute.

  8. Eric Gazoni

    well, when I run the script attached in the comment from @mbarrien I keep getting the same output, including the "trimmed" rows for LXML write-only, read-only, but I'm unsure this is something that was addressed by your most recent commit, or if I misunderstood something

  9. CharlieC

    Ah, I see what you mean. Looks like the call to get_squared_range() doesn't do things right.

    Do you have a problem with missing cells?

  10. CharlieC

    You'll never get trailing cells because we don't know how long the rows are.

    The implementation in 2.2 is a little more robust so cells missing before the last actual cell in row will be created. Still working on this but the essence can be backported.

  11. Log in to comment