parser.pivot_cashes crash

Issue #1159 resolved
Alexey Gaidiukov
created an issue

Some history: switched from 2.4.x to 2.5.12 and now can't load an xlsx file that contain pivots. Original file is 10MB size, ~40k lines. On the 16G machine load_workbook() fails with MemoryError after ~5-10mins, on 32G machine it loads ok, although memory footprint looks excessive for 10MB input.

So I tried to remove pivot caches by unchecking "save data with file" in the pivot table options in Excel. The file became 6MB in size, and load_workbook() now fails almost immediately with:

        wb = openpyxl.load_workbook(filename=xlsfile, data_only=True)
      File "C:\Anaconda\lib\site-packages\openpyxl\reader\excel.py", line 224, in load_workbook
        pivot_caches = parser.pivot_caches
      File "C:\Anaconda\lib\site-packages\openpyxl\packaging\workbook.py", line 127, in pivot_caches
        records = get_rel(self.archive, cache.deps, cache.id, RecordList)
      File "C:\Anaconda\lib\site-packages\openpyxl\packaging\relationship.py", line 153, in get_rel
        rel = next(deps.find(cls.rel_type))
    StopIteration

Comments (7)

  1. CharlieC

    I understand the memory issue. Any kind of data source usually comes with a cache of the data. Because I'm not sure how important the cache is the library is conservative it tries to keep the caches around and these can get pretty big as the parsed objects are not optimised; less than putting the XML tree in memory but not much.

    You can disable this by setting keep_links=False when opening the workbook. This should resolve the memory issue but not the missing object one.

    Depending on feedback relating to these files we can change the behaviour because I suspect the caches are of little relevance in most scenarios.

  2. Alexey Gaidiukov reporter

    CharlieC - thanks for your quick reply and the fix! Regarding the Memory issue with pivot cache, I have tried keep_links=False and it didn't help at all. Commenting out the two lines in excel.py that deal with pivot_cache helped though, so I'm using that work-around for now. I agree that in general the caches would be a rarely used feature, so making loading pivot cache optional would really help general usability.

  3. CharlieC

    Disabling the cache is easy enough to do but there's a bit more to it with PivotTables, unforunately. But I think that a flag keep_cache with a default of False probably makes more sense.

    Caching things like Pivot Table calculations or External Worksheets probably only makes sense in non-active applications, such as the old Excel viewer. These are volatile elements by design.

  4. Log in to comment