Can't read more than 65536 lines from excel file

Issue #568 wontfix
Joni Syri created an issue

Attached is excel file containing 100.000 lines. When trying to read contents with code below, reading stops after line 65536 as if end of file had been reached.

Same problem occurs on other files, so the problem doesn't seem to be related to this particular file

from openpyxl import load_workbook

wb = load_workbook('test.xlsx', read_only=True)
ws = wb.worksheets[0]
for i, row in enumerate(ws.rows):
    print(i, row[0].value)

Comments (8)

  1. CharlieC

    Can you supply a test file? There is a bug in OpenOffice / LibreOffice which creates worksheets with nominal dimensions of up to 65536 rows. openpyxl treats this information as reliable (you can check with ws.max_row as otherwise it has to parse a whole file to find the size, which we want to avoid. You can force recalculation with ws.calculate_dimension(force=True) but the simplest solution is to use iter_rows() with a max_row of 100000.

    As the bug is in the creating application I don't think that openpyxl should really try a workaround. The dimensions value is totally optional (and is missing from any application that streams data) so it's reasonable to trust it if it is there.

  2. Joni Syri reporter

    Here's test file, I forgot to attach that. Test file is made with LibreOffice, but I have while which I think is made with MS office with same problem (unfortunately I can't post it here).

    Calling ws.calculate_dimension(force=True) before iteration doesn't seem to change the behavior. Specifying manual upper limit works, but isn't very nice solution

  3. CharlieC

    Please check the "dimension" element of the XML source for the worksheet in the file you can't supply.

    I think the more recent versions of LibreOffice have fixed the bug so you might get around it by upgrading it. As I said above: it's not really openpyxl's problem is the file supplied is broken – garbage in = garbage out.

    The problem with ws.calculate_dimension(force=True) is that for it to work, all rows must be parsed so I guess it only really makes sense for unsized worksheets. You're best solution might just be to set ws.max_row = None

  4. Joni Syri reporter

    The dimension on the other file is "A1:Q65536", so it seems that's the problem.

    Setting ws.max_row = None works, so I guess that's the way to go for now.

  5. CharlieC

    Worksheet dimensions are an optional optimisation at the start of the file so that parsing code does not have read all cells to find out the size of the worksheet, as is the case for read-only mode. Worse than useless when it's wrong (we omit it ourselves in write-only mode) but that's not openpyxl's problem.

  6. Log in to comment