High memory usage when saving a write-only workbook

Issue #986 wontfix
Marcus Mo created an issue

We are experiencing high memory usage when creating an Excel file with high number of rows (anything from 10000 to 1million+ rows).

Have attached a test script which reproduces the problem on our machines. It performs the following:

  • Generates data
  • Creates an openpyxl Workbook
  • Inserts the data into a worksheet
  • Saves the workbook

e.g. python test.py --write_only 10000 10 will generate 10000 rows with 10 columns each and use Workbook(write_only=True). See --help for more details. If running without mprof comment out the @profile decorators.

Python version: 2.7

Openpyxl version: 2.4.9

OS: Ubuntu 17.10

Checking for lxml:

→ dpkg --list |awk '/ii/ && /libxml/'
ii  libxml-parser-perl                              2.44-2build3                                      amd64        Perl module for parsing XML files
ii  libxml-twig-perl                                1:3.50-1                                          all          Perl module for processing huge XML documents in tree mode
ii  libxml-xpathengine-perl                         0.13-1                                            all          re-usable XPath engine for DOM-like trees
ii  libxml2:amd64                                   2.9.4+dfsg1-4ubuntu1.2                            amd64        GNOME XML library
ii  libxml2-dev:amd64                               2.9.4+dfsg1-4ubuntu1.2                            amd64        Development files for the GNOME XML library
ii  libxml2-utils                                   2.9.4+dfsg1-4ubuntu1.2                            amd64        XML utilities
ii  libxmlsec1:amd64                                1.2.24-4                                          amd64        XML security library
ii  libxmlsec1-nss:amd64                            1.2.24-4                                          amd64        Nss engine for the XML security library
ii  python-lxml                                     4.0.0-1                                           amd64        pythonic binding for the libxml2 and libxslt libraries
lxml.etree.LXML_VERSION = (3, 4, 0, 1)
lxml.etree.LIBXML_VERSION = (2, 9, 4)
lxml.etree.LIBXML_COMPILED_VERSION = (2, 9, 4)
openpyxl.LXML = True

We can then use memory-profiler to gather memory usage (mprof run test.py --write_only 10000 10) and generate graphs (mprof plot), some of which are also attached.

Using Workbook(write_only=True) does seem to help with memory usage but when we call wb.save() the memory increases dramatically.

There is potential we are using openpyxl incorrectly, or there is some other dependency we are missing. In production, we are using pypy and lxml-cffi but it exhibits the same behaviour as CPython and lxml.

The test script generates random strings. In reality, we create Excel files with a mixture of strings, dates and numbers.

Thanks!

Comments (8)

  1. CharlieC

    Several things: first of all lots of unique strings will eat up memory. Excel defaults to using a pointer-system for strings and we imitate this using a list with an embedded dict to try and reduce duplicates. Obviously this has to be kept in memory to be fast but will eat memory once you get towards a million unique strings: Excel just isn't suited to this kind of task. It is possible to write the strings inline but the XML for this is so horrible that I've avoided this so far and have no plans to implement it.

    My own tests with pypy and lxml suggest they don't play nicely together and that lxml runs a lot slower. But without lxml you'll be using our own implementation of xmlfile that unfortunately doesn't stream very well, which means a large worksheet will create an even larger element tree in memory.

    Kind of a perfect storm when combining the two. If you really do have millions of unique strings in your workbooks then XLSX is the wrong format.

  2. CharlieC

    Looking at this again and I think the problem you're seeing must be related to how lxml is running in your environment. writer/strings.py is as memory optimised for the structure as possible. openpyxl's own memory use will rise as you add random strings to the worksheet and the deduped list of strings is built up.

    In a PyPy context you're best off finding a way to replace xmlfile with something that can stream as et_xmlfile will unfortunately keep the tree in memory. We didn't test for this at the time. But rather than revisit this and go back to some kind of SAX implementation in general, you could probably write one for just this and see how that goes. You'll still be hitting memory issues with the worksheet object itself, of course, but they might be less severe.

  3. CharlieC

    I've just run a test using your code and for 10 million cells I get an increment of about 1 GB when saving the file. This corresponds when to the size of sharedStrings.xml in the archive which is held in memory while it is written out (bytesIO is still much smaller than if the XML were held in memory). You could easily modify the code to write to a temporary file which would just leave you with the 2GB for the IndexedList.

  4. CharlieC

    No plans to change any code at the moment. xlsxwriter in constant memory mode is probably the better option for this kind of task. Better still avoid XLSX for anything with more than 1 million unique strings.

  5. Marcus Mo reporter

    Thanks for looking at this again. As an update, we are looking to optimise the environment in which we run openpyxl (so not to use PyPy). Rudimentary observations show the latest openpyxl and lxml versions use much less memory. Nevertheless, we provide other formats in which to export such data which, as you say, are more suitable than XLSX.

  6. CharlieC

    If memory is an issue then you should really look at xlsxwriter's constant memory mode which streams by using its own handcrafted tag management and can thus handle all the necessary XML for inline strings.

    In terms of speed openpyxl + lxml and xlsxwriter are about the same: at some point it becomes more about the quality of the API and this is what I've chosen to focus on.

  7. Marcus Mo reporter

    Thanks for the tips.
    We'll take a look at xlsxwriter again. We didn't choose it originally because we preferred the openpyxl API.

  8. Log in to comment