load_workbook fails selectively on some xlsx files

Issue #1171 closed
M. Henry Linder
created an issue

Hi,

I have a series of Excel files—.xlsx, I believe machine-generated—that I need to parse.

Most parsing is successful, but some of them fail:

In [11]: w = openpyxl.load_workbook(filename='2018/02 February2018.xlsx')                                                                                                             [1/1895]

In [12]: w = openpyxl.load_workbook(filename='2017/December2017.xlsx')

In [13]: w = openpyxl.load_workbook(filename='2018/01 January2018.xlsx')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-13d8aa58fe34> in <module>()
----> 1 w = openpyxl.load_workbook(filename='2018/01 January2018.xlsx')

/usr/lib/python3/dist-packages/openpyxl/reader/excel.py in load_workbook(filename, read_only, keep_vba, data_only, guess_types, keep_links)                                                       
    184     wb.guess_types = guess_types
    185     wb.template = wb_part.ContentType in (XLTX, XLTM)
--> 186     parser.parse()
    187     wb._sheets = []
    188

/usr/lib/python3/dist-packages/openpyxl/packaging/workbook.py in parse(self)
     45         src = self.archive.read(self.workbook_part_name)
     46         node = fromstring(src)
---> 47         package = WorkbookPackage.from_tree(node)
     48         if package.properties.date1904:
     49             self.wb.excel_base_date = CALENDAR_MAC_1904

/usr/lib/python3/dist-packages/openpyxl/descriptors/serialisable.py in from_tree(cls, node)
     73             if hasattr(desc, 'from_tree'):
     74                 #descriptor manages conversion
---> 75                 obj = desc.from_tree(el)
     76             else:
     77                 if hasattr(desc.expected_type, "from_tree"):

/usr/lib/python3/dist-packages/openpyxl/descriptors/sequence.py in from_tree(self, node)
     84
     85     def from_tree(self, node):
---> 86         return [self.expected_type.from_tree(el) for el in node]

/usr/lib/python3/dist-packages/openpyxl/descriptors/sequence.py in <listcomp>(.0)
     84
     85     def from_tree(self, node):
---> 86         return [self.expected_type.from_tree(el) for el in node]

/usr/lib/python3/dist-packages/openpyxl/descriptors/serialisable.py in from_tree(cls, node)
     90                 attrib[tag] = obj
     91
---> 92         return cls(**attrib)
     93
     94

TypeError: __init__() missing 1 required positional argument: 'id'

These files are valid: I can open them in, eg, Excel and LibreOffice. But, other tools (eg, Apache POI-based tools like ssconvert in Gnumeric) also fail on the same files.

Can anyone help with this? I apologize that I am unable to provide the files for external use, but I am happy to give info as I can...

Thank you very much!

Comments (3)

  1. CharlieC

    Without a file, or at least the relevant XML file from an XLSX archive there is not a lot we can do about this. It's worth noting, however, that POI is known to produce invalid XML, particularly the styles. Just because the files work with other applications doesn't make them valid.

    I'm not 100% sure but it looks like this is related to parsing the workbook.xml. With a little debugging, particularly identifying the node or tag you should be able to extract the relevant file, check it for sensitive information and provide it.

  2. Log in to comment