Large openXML spreadsheets with many rows cause out of memory error…

Issue #730 duplicate
Jim Hargrave (OLD) created an issue

I have a private file that has triggered this error. It has 250,000 rows and produces over 14 million XML parser events. This eventually causes an out of memory error, but appears as if the filter is stuck for a long time before the error is produced.

I can supply the private file if someone would like to try to reproduce this on their environment.

java.lang.OutOfMemoryError: GC overhead limit exceeded

Comments (7)

  1. Chase Tingley

    I have looked at this file and it produces an 1100mb XLIFF file if you give it enough memory. Gadzooks.

    The filter will get through this if you give it enough memory, and does so quickly (in my testing with tikal, it seemed like < 30s to parse and then another 35 or so to write the XLIFF), which does seem like the slowdown is the GC locking up. However, there are a couple things we can look at here: * Are we buffering more data in the filter than we have to? (Seems likely) * For truly gigantic documents like this, are there stream-based improvements to the framework we could make that would lead to less of the data being held in memory? If the pipeline fully extracts the set of okapi events before calling the next stage, it looks like it's holding roughly 2 million TU in memory here....

  2. Jim Hargrave (OLD) reporter

    @tingley Another layer we should look at is the PipedInputStream (InputStreamFromOutputStream) classes that we use for merging. I added this code some time back to prevent unnecessary disc access and to speed up processing. But this does impose more memory limitations because of the buffers needed by the pipe streams.

    Not sure if this is related but worth looking at the code and optimizing if anyone sees a problem.

  3. Log in to comment