memory footprint and read only

Issue #949 resolved
Matt King
created an issue

I'm probably missing the point of read_only mode but I have the following function that I'm profiling with memory_profiler

@profile
def create_csv(fn):
    from openpyxl import load_workbook
    wb = load_workbook(filename=fn, read_only=True)
    ws = wb.active

    cell_count = 0
    for row in ws.rows:
        for cell in row:
            cell_count += 1
    print cell_count

and as you can see the memory used goes up to 1GB or so when loading the workbook:

Line #    Mem usage    Increment   Line Contents
================================================
    19   32.273 MiB   32.273 MiB   @profile
    20                             def create_csv(fn):
    21   38.848 MiB    6.574 MiB       from openpyxl import load_workbook
    22 1102.047 MiB 1063.199 MiB       wb = load_workbook(filename=fn, read_only=True)
    23 1102.047 MiB    0.000 MiB       ws = wb.active
    24
    25 1102.047 MiB    0.000 MiB       cell_count = 0
    26 1102.047 MiB -36564981.566 MiB       for row in ws.rows:
    27 1102.047 MiB -365645323.121 MiB           for cell in row:
    28 1102.047 MiB -329080799.082 MiB               cell_count += 1
    29  518.207 MiB -583.840 MiB       print cell_count

Now if I change read_only=False and run again:

Line #    Mem usage    Increment   Line Contents
================================================
    20   28.062 MiB   28.062 MiB   @profile
    21                             def create_csv(fn):
    22   38.367 MiB   10.305 MiB       from openpyxl import load_workbook
    23 1136.023 MiB 1097.656 MiB       wb = load_workbook(filename=fn, read_only=False)
    24 1136.027 MiB    0.004 MiB       ws = wb.active
    25
    26 1136.027 MiB    0.000 MiB       cell_count = 0
    27 1136.266 MiB    0.238 MiB       for row in ws.rows:
    28 1136.266 MiB    0.000 MiB           for cell in row:
    29 1136.266 MiB    0.000 MiB               cell_count += 1
    30 1136.281 MiB    0.016 MiB       print cell_count

I don't see much difference. This is a 120MB excel file with some fairly large text in some of the cells. I've tried agateexcel and pyexcel and they each profile at about 1GB to process the same file.

My goal is to convert to CSV without a large memory overhead. But I guess unzipping and reading the xml has to be done somehow.

The contents of my xlsx look like

  Length      Date    Time    Name
---------  ---------- -----   ----
     1258  01-11-2018 21:00   [Content_Types].xml
      531  01-11-2018 21:00   _rels/.rels
      641  01-11-2018 21:00   xl/_rels/workbook.xml.rels
      886  01-11-2018 21:00   docProps/app.xml
      940  01-11-2018 21:00   docProps/core.xml
    10140  01-11-2018 21:00   xl/theme/theme1.xml
      375  01-11-2018 21:00   xl/workbook.xml
 22932034  01-11-2018 21:00   xl/worksheets/sheet1.xml
396420174  01-11-2018 21:00   xl/sharedStrings.xml
      823  01-11-2018 21:00   xl/styles.xml
---------                     -------
419367802                     10 files

Comments (9)

  1. CharlieC

    I suspect that the problem is related to the strings. Excel traditionally uses a lookup table (sharedStrings.xml) for text in the hope that in a typical spreadsheet there will be little text and this will typically be repetitive. Excel is singularly unsuited to dealing with large numbers of unique strings.

    openpyxl converts the lookup table into a keyed list which essentially doubles memory requirements. The key in your situation is that we have to keep the list in memory while sheets themselves are read only on demand. This is why memory use is so similar between the different modes.

    I can't think of any quick fix for this without replacing the structure, which for convenience is read/write: position is used when reading, it acts like a dictionary when writing. I'm sure this could be improved upon but, as I said above, this really the wrong file for this data. You could probably monkey patch the library to use just a list when reading which should reduce memory use considerably but you're still going to have several hundred MB of strings in memory.

    This is a very unusual use case so I wouldn't be particularly on making changes in the library for it.

  2. Barry Hurley

    The memory overhead can be significant for some of our use cases also. Do you think there are general gains to be made in that area? There may be scope for us to work on an improvement, I am just wondering if you think it would be worth it.

  3. CharlieC

    If an Excel file has lots and lots of unique strings then it's going to use a lot of memory and it's the wrong file format: text doesn't need much in the way of typing. Replacing the current implementation of an IndexedList will save about 50% and should be fairly painless (we know that we have to accept duplicates). Apart from that read-only mode has a tiny footprint. I would rather invest resources in reducing the footprint of the standard, in-memory mode where I think sixfold reduction is possible, but that is substantially more work.

  4. Marcus Mo

    FWIW we are also seeing large memory usage (in the order of GBs) when saving a workbook, even when using Workbook(write_only=True), which may also be related to the use of (random) strings. Writing to a WriteOnlyWorksheet has improvements over the standard Worksheet but wb.save() is when memory usage increases dramatically. Happy to provide test code and benchmark results also (in a new ticket/email).

    Our production use-case is to create Excel files with ~1million rows and ~25 columns with a mixture of strings, dates and numbers (no styles).

  5. CharlieC

    With a 20M sharedStrings.xml the memory use drops from around 70MB to 6MB when using a simple list so I've done this. This is a big improvement when seen in isolation but I suspect won't make much difference in most use cases.

  6. Log in to comment