Issue with corruption of xlsx inside a docx

Issue #1194 resolved
Baboin Arthur created an issue

Hello,

I've got an issue with the new release of openpyxl (2.6.0), with 2.5.14 I was able to edit xlsx within a docx (I unzip the docx and work with the xlsx directly).
But with the new update this is not possible as an error message occur every time i try to open the file from word (cf the annexes)
after digging into the xml of word i found that nothing changed just the xlsx.
When trying to open the xlsx excel offer me to repair the file, but even after repairing the xlsx word believe that the file is broken.

Can you fix that?
P.S I'm sorry to not been able to share with you the docx file involved as they are confidential but I've joined an corrupted xlsx produced by the 2.6.0. you should also be able to reproduce the issue just by:
- unziping a docx
- update the xlsx of any charts with openpyxl (word/embeddings)
- zip the directory and rename the extension to docx

Best regards,

Comments (12)

  1. CharlieC

    I've no idea about the error in Word, and won't be able to do anything about it directly but Excel doesn't like the table you've created in the worksheet.

    Can you please provide the code you used to create this file?

  2. Baboin Arthur reporter

    I cannot share with you the full code to produce the output but basicaly here is what is done:

    class DummyClass(object):
        def load_new_xlsx(self, xlsx_path):
            self.book = openpyxl.load_workbook(xlsx_path)
            self.book_path = xlsx_path
        def update_xlsx(self, data, xlsx: str = None):
            """
            update an xlsx file with data
            :param data:list of data to update [1,2,3] or [(1,2,3),(4,5,6)] (list of value will update second row, list of tuple let you modify each row)
            :param xlsx: if set load the xlsx instead of using the loaded one
            :return: None
            """
            if xlsx:
                self.load_new_xlsx(xlsx) # basicaly 
            if self.book is None:
                error("No xlsx loaded cannot update None")
                return
            sheet = self.book.active
            for i in range(len(data)):
                dat = data[i]
                if isinstance(dat, tuple) or isinstance(dat, CommentedSeq) or isinstance(dat, list):
                    for y in range(len(dat)):
                        value = dat[y]
                        if value is None:
                            pass
                        else:
                            sheet.cell(row=i + 2, column=y + 1).value = value
                else:
                    sheet.cell(row=i + 2, column=2).value = dat
    
            self.book.save(self.book_path)
            info("{} has been successfully updated".format(self.book_path))
    
    if __name__ == '__main__':
        data=[{"legend": "XXX",
                                                     "values": [("Open", 23, "8a5c3d"), ("Closed", 20, "4286f4")],
                                                     }]
        d=DummyClass()
        d.update_embedded_xlsx("excel.xlsx",
                                      [(data[0], data[1], data[2] if len(data) > 2 else None) for data in
                                       values[0].get("values")])
    
  3. CharlieC

    I need a file and code than I can run locally otherwise I cannot really investigate this beyond looking at the table.

  4. Baboin Arthur reporter

    here is a working code who produce the bug

    import openpyxl
    from ruamel.yaml.comments import CommentedSeq
    
    
    class DummyClass(object):
        def load_new_xlsx(self, xlsx_path):
            self.book = openpyxl.load_workbook(xlsx_path)
            self.book_path = xlsx_path
    
        def update_xlsx(self, data, xlsx: str = None):
            """
            update an xlsx file with data
            :param data:list of data to update [1,2,3] or [(1,2,3),(4,5,6)] (list of value will update second row, list of tuple let you modify each row)
            :param xlsx: if set load the xlsx instead of using the loaded one
            :return: None
            """
            if xlsx:
                self.load_new_xlsx(xlsx)
            if self.book is None:
                error("No xlsx loaded cannot update None")
                return
            sheet = self.book.active
            for i in range(len(data)):
                dat = data[i]
                if isinstance(dat, tuple) or isinstance(dat, CommentedSeq) or isinstance(dat, list):
                    for y in range(len(dat)):
                        value = dat[y]
                        if value is None:
                            pass
                        else:
                            sheet.cell(row=i + 2, column=y + 1).value = value
                else:
                    sheet.cell(row=i + 2, column=2).value = dat
    
            self.book.save("updated_excel.xlsx")
    
    
    if __name__ == '__main__':
        d = DummyClass()
        d.update_xlsx([("Open", 23, "8a5c3d"), ("Closed", 20, "4286f4")], "excel.xlsx")
        exit(0)
    

    you can use any excel sheet to produce the output

  5. CharlieC

    OK, the problem seems to be related to first column heading being set to spaces. Excel doesn't like this, which is why it will set a default value if you clear the cell. Unfortunately, if you use spaces Excel has a problem synchronising the table with the worksheet and relies on xml:space being set to preserve whitespace in the sharedString.xml. In openpyxl 2.6 we switched to using inline strings instead for performance reasons and there is no way to replicate the setting. So the only solution is to give the column heading a name or simply remove declare the table as not having a header row.

    NB. this is really a bug or "feature" of Excel allowing you to pretend that a column is called " " and this is invisible, openpyxl produces valid OOXML and even down to duplicating the value of the cell in the column.

  6. Baboin Arthur reporter

    Sorry for the late replied (got a lot of urgent work to do)

    I'm not sure to understand exactly what you are saying but basically, from what i understand i should either set the first line with data (A1,B1,C1 ...) instead of not touching it or the other option is to unset the heading of rows. (not sure how to do it but I think i will use with this )
    Am i right?

    Thank you very much for your reply.

  7. CharlieC

    Yes, basically deactivating the table header is the way to deal with this, and it makes sense if you think about it. See this screenshot. Sorry, but my local Excel is in German.

    Bildschirmfoto 2019-02-21 um 12.55.58.png

  8. Baboin Arthur reporter

    Thank you very much for your reply :)

    In fact i also got a similar issue with openpyxl==2.5.14 when i got some headers set but not all of them, just filling the missing one with a title patched the issue.

    another time thank you.

    P.S you can close the issue.

  9. Baboin Arthur reporter

    In case other people has some similar problems,
    i found out that the heading is not always at the first line it can be any cell (I mean i got the bug presented before with a excel with a serie starting at F11 but with a blank title, F1 was filled with a title)

  10. Log in to comment