OpenXml (xlsx) out of memory with millions of hidden rows...

Create issue
Issue #894 new
Jim Hargrave created an issue

This file has a sheet with over 1 million spreadsheet rows with only 93 of them as visible. So when okapi is doing the "merge" step, it's trying to hold all of those in memory which means holding the java object representation of data for those hidden rows and then rendering those as XML and storing all of that in memory before writing out the rendered XML to the file.

I assume the hidden rows are all being stored in a large object and kept in memory. Is there a way to break these up so they can be garbage collected more efficiently?

Comments (8)

  1. Denis Konovalyenko

    @Jim Hargrave (OLD) , it seems to me that the why behind this implementation is aligned with the way the XLSX parts are constructed (some of them have to be adjusted only after the point when all string items have been processed). So, I am not sure if there is an easy way to deal with the out of memory error. Maybe, we could make the flow stricter (do not allow to translate the hidden content with bPreferenceTranslateExcelHidden parameter at all) and dump all hidden string items to a temporary location on the file system but keep all available for translation only. And then, when we come to the adjustment point, we could assemble all string items again and return to the existing handling course.

    @Chase Tingley , if you have a chance, could you please share your thoughts on this as well?

  2. Chase Tingley

    How much memory are you using for this?

    Denis is right, we need to fully parse the worksheet because portions of it can be rewritten on the fly.

  3. Jim Hargrave reporter

    Offloading these big chunks to the filesystem seems our best option. Though some services might have limited amount of temp storage available. Another option is we increase our service max memory - but that might be costly (pay per #gigabytes of memory).

  4. hao

    We faced a similar issue while processing an xlsx with millions of empty but formatted cells. I understand there must be the reason okapi has to hold them all in memory, but it’s really painful if it leads to OOM in our services.

    Is there any way/configuration at lease okapi can reject this kind of 'large' files?

    And thanks for your great work!

  5. Chase Tingley

    Relevant comments from #730:

    From me:

    I have looked at this file and it produces an 1100mb XLIFF file if you give it enough memory. Gadzooks.

    The filter will get through this if you give it enough memory, and does so quickly (in my testing with tikal, it seemed like < 30s to parse and then another 35 or so to write the XLIFF), which does seem like the slowdown is the GC locking up. However, there are a couple things we can look at here: Are we buffering more data in the filter than we have to? (Seems likely) For truly gigantic documents like this, are there stream-based improvements to the framework we could make that would lead to less of the data being held in memory? If the pipeline fully extracts the set of okapi events before calling the next stage, it looks like it's holding roughly 2 million TU in memory here....

    From Jim:

    @tingley Another layer we should look at is the PipedInputStream (InputStreamFromOutputStream) classes that we use for merging. I added this code some time back to prevent unnecessary disc access and to speed up processing. But this does impose more memory limitations because of the buffers needed by the pipe streams.

    Not sure if this is related but worth looking at the code and optimizing if anyone sees a problem.

  6. Patrick Huy

    I did some investigation. It seems that most of the memory consumption comes from keeping the Stax XML Events around. This is unfortunatly happening at multiple places. Each XML Event object requires something like 176byte and there are a lot of them. The StaX Events alone created for the file from this issue take ~350MB Heap space on my machine.

    Finding a different solution for retaining data temporarily (streaming it again from the file? keeping it in memory as a string?) could improve the sitaution

  7. Log in to comment