use_iterators=True causes load_workbook to fail

Issue #484 resolved
Matt Restko
created an issue

I am able to open a relatively large file using load_workbook(filename) but it is extremely slow taking about 30 seconds to open the file. Adding the use_iterators=True parameter causes the load to fail with the following error:

Traceback (most recent call last):
  File "store_data.py", line 345, in <module>
    main()
  File "store_data.py", line 338, in main
    data_only=True, use_iterators=True)
  File "/home/matt/.local/lib/python3.4/site-packages/openpyxl/reader/excel.py", line 149, in load_workbook
    _load_workbook(wb, archive, filename, read_only, keep_vba)
  File "/home/matt/.local/lib/python3.4/site-packages/openpyxl/reader/excel.py", line 232, in _load_workbook
    worksheet_path=worksheet_path)
  File "/home/matt/.local/lib/python3.4/site-packages/openpyxl/reader/worksheet.py", line 324, in read_worksheet
    worksheet_path, xml_source, shared_strings, style_table)
  File "/home/matt/.local/lib/python3.4/site-packages/openpyxl/worksheet/iter_worksheet.py", line 73, in __init__
    dimensions = read_dimension(self.xml_source)
  File "/home/matt/.local/lib/python3.4/site-packages/openpyxl/worksheet/iter_worksheet.py", line 37, in read_dimension
    min_col, min_row, sep, max_col, max_row = m.groups()
AttributeError: 'NoneType' object has no attribute 'groups'

Peeking into the source, it seems to be caused by a failure of the regular expression utils.ABSOLUTE_RE to detect the dimensions of the file. I can't post my file because it contains confidential information. However, I was able to unzip the file and take a look at the XML. I believe this is the tag that is causing the failure of the iterator parser:

<dimension ref="1:113"/>

Looking at the ABSOLUTE_RE expression, it seems that there is no match to be had.

I wasn't able to get past this point to figure out why the file opens normally despite this bad formatting in my file.

Comments (12)

  1. Charlie Clark

    Thanks for the report. That looks like an invalid value to me. The spec says:

    """ The row and column bounds of all cells in this worksheet. Corresponds to the range that would contain all c elements written under sheetData. Does not support whole column or whole row reference notation. """ But the schema is lax. The element is optional and ReadOnlyWorksheets work quite happily without it so you can add a workaround to catch the exception.

  2. Matt Restko reporter

    Sorry, I made a mistake in my original report. My file only opens in full access mode, not with the read_only=True attribute set.

    Thank you for your reply but I am not quite sure how to implement your suggestion. Should I modify the openpyxl code to directly insert a try/except block? Or is there a less-invasive way of doing this?

  3. Charlie Clark

    You'll have to modify the code. Something like this should do it.

    diff -r 497cde6a46c3 openpyxl/worksheet/read_only.py
    --- a/openpyxl/worksheet/read_only.py   Sat Jun 27 15:10:12 2015 +0200
    +++ b/openpyxl/worksheet/read_only.py   Sat Jun 27 16:56:36 2015 +0200
    @@ -34,6 +34,8 @@
             if element.tag == DIMENSION_TAG:
                 dim = element.get("ref")
                 m = ABSOLUTE_RE.match(dim.upper())
    +            if m is None:
    +                return
                 min_col, min_row, sep, max_col, max_row = m.groups()
                 min_row = int(min_row)
                 if max_col is None or max_row is None:
    
  4. Matt Restko reporter

    Unfortunately it seems like this workaround is not quite complete. The file now loads, but trying to iterate through the cells in the first column of the sheet causes this error:

      File "/home/matt/.local/lib/python3.4/site-packages/openpyxl/worksheet/worksheet.py", line 780, in columns
        for col_idx in range(self.max_column):
    TypeError: 'NoneType' object cannot be interpreted as an integer
    
  5. Charlie Clark

    Glad to hear it's working for you. You can now use ws.columns in this context but you'll get a warning that the sheet is unsized and how to fix it. As the data is stored row by row in the XML, this is likely to be extremely inefficient. Far better to have a means of extracting the columns from the rows as you go.

  6. Log in to comment