File with external links corrupts

Issue #711 resolved
Gericke Potgieter
created an issue

I seemed to have chanced on a bug - if not please advise.

Environment: Python 3.5.2 Openpyxl version: 2.4.0

Replication Process:

Opening a file with one or more external link, then saving it.

Code used for replication:

def test_external_link_corruption():
    file_path = 'has external links.xlsx'

    tfile = load_workbook(file_path)
    sh = tfile['Sheet1']
    print('RANDOM BORDER 1')
    print(sh['f6'].border)
    print(sh['f6'].border.top)
    print(sh['f6'].border.left)
    print(sh['f6'].border.bottom.color)
    print(sh['f6'].border.right.color)

    print('RANDOM BORDER 2')
    print(sh['f40'].border)
    print(sh['f40'].border.top)
    print(sh['f40'].border.left)
    print(sh['f40'].border.bottom.color)
    print(sh['f40'].border.right.color)
    tfile.save(file_path)

Replicated Symptom:

On opening the file in Excel, it requires repair, removing the external formula

Excel Error.jpg

Original Process:

Opening an excel file, running many functions, saving it, then reopening it for further processing.

Original Symptom:

The program generates the following error:

 Traceback (most recent call last):
  File "/home/gericke/PycharmProjects/xxxx/ss_integrity/test_base.py", line 1229, in test_all
    build_results(file_name, all_test_counts, test_results, get_stats, get_con, test_file_names)
  File "/home/gericke/PycharmProjects/xxxx/ss_integrity/report_results.py", line 471, in build_results
    results_overview(file_name, all_results, test_file_names[2])
  File "/home/gericke/PycharmProjects/xxxx/ss_integrity/report_results.py", line 23, in results_overview
    full_workbook = load_workbook(file_name, read_only=False)
  File "/usr/local/lib/python3.5/dist-packages/openpyxl/reader/excel.py", line 161, in load_workbook
    parser.parse()
  File "/usr/local/lib/python3.5/dist-packages/openpyxl/packaging/workbook.py", line 56, in parse
    read_external_link(self.archive, rel.Target)
  File "/usr/local/lib/python3.5/dist-packages/openpyxl/workbook/external_link/external.py", line 189, in read_external_link
    deps = get_dependents(archive, link_path)
  File "/usr/local/lib/python3.5/dist-packages/openpyxl/packaging/relationship.py", line 128, in get_dependents
    src = archive.read(filename)
  File "/usr/lib/python3.5/zipfile.py", line 1232, in read
    with self.open(name, "r", pwd) as fp:
  File "/usr/lib/python3.5/zipfile.py", line 1255, in open
    zinfo = self.getinfo(name)
  File "/usr/lib/python3.5/zipfile.py", line 1199, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'xl/worksheets/_rels/sheet1.xml.rels' in the archive"

Please find attached a sample file used for replicating the error. The 'has external links.xlsx" file is opened, then saved, which results in a broken file which I attach as "has external links - broken.xlsx."

Comments (8)

  1. Gericke Potgieter reporter

    I took some time to run comparisons on the test file. The workbook saved via Openpyxl seems to be missing the calcChain.xml file. I tried tracing the issue with debug, but I am not familiar enough with the code to identify precisely where the problem comes in.

    comparison.jpg

  2. CharlieC

    Thanks for the report. I can't replicate the Python error using the file but looks like the the thing that Excel doesn't like is related to the id for the external links.

    FWIW if you want to compare Excel files use Microsoft's Productivity Tool.

  3. Gericke Potgieter reporter

    Hi Charlie, thanks for the tip, I am using the productivity tool now with great results.

    As you can see, I updated the issue attaching a broken version of the file. I tried tracing the process and could only find that the calcChain.xml file is included when opened and excluded when saved. However, after reading up on the role of calcChain.xml I doubt that it is the core problem.

    What I also found (whether relevant or not you may know better than me), was that the externalLink1.xml.rels file in the broken workbook opens up in a text editor, whereas the same file in the unbroken workbook opens as an XML file in my browser. Is there perhaps an encoding issue that could result in this problem?

  4. Gericke Potgieter reporter

    Thanks, I spent some more time tracing the error, and can confirm that the link is incorrect.

    Openpyxl proceeds to create an incorrect relative path for the external link file because it reads it in the wrong order from the workbook.xml file. In the example unbroken workbook, the reference to the external link file is rId2, but in the broken file it is rId1. this causes the program to read the path incorrectly as rId1 actually refers to "/xl/worksheets/sheet1.xml"

    I tried writing some workarounds with limited success. My last attempt was to copy the externalReferences node from workbook.xml and then put it back after the file was saved, but this is only effective for a single external reference.

    Any other suggestions?

  5. CharlieC

    Well, the order of reading doesn't really matter but this stuff is unnecessarily difficult. 2.4 is largely an attempt to tidy this up and make such things much less ad hoc. In particular the link containers (RelationshipLists) should be able to assign the ids automatically but there is an additional layer of indirection which probably isn't helping.

  6. Log in to comment