Export to XLSX consumes lots of memory

Issue #1004 new
David Platten created an issue

Exporting a large amount of (CT) data to XLSX consumes a large amount of server memory. The memory consumption increases as the “Write study” count increases during the export. This is causing me some issues, with memory usage in excess of 1 GB during exports.

The memory consumption is related to the way the Workbook is created in line 706 of export_common.py:

book = Workbook(temp_xlsx, {"strings_to_numbers": False})

This default mode means that XlsxWriter holds all cell data in memory. From reading the XlsxWriter docs (https://xlsxwriter.readthedocs.io/working_with_memory.html) this can be addressed by changing the above line to:

book = Workbook(temp_xlsx, {"strings_to_numbers": False, "constant_memory": True})

The docs state that the trade-off when using 'constant_memory' mode is that you won’t be able to take advantage of any new features that manipulate cell data after it is written. Currently the add_table() method doesn’t work in this mode and merge_range() and set_row() only work for the current row.

We don’t currently use add_table(), merge_range() or set_row() so I think it is fine to implement this change.

On my system the memory usage with the “constant_memory” option remains at 238 MB regardless of how many studies are being exported.

This will be implemented for CT, radiography and mammography [and consider nuc med]

Need to consider the order of the columns in the exports from this branch compared with those from the previous version.

Status and progress

CT xlsx export now uses the new Pandas code; the tests work for the new code

  • [1st March 2024: done] Duplicate the common code file, and reinstate the original version so that the RF exports continue to work as they did
  • The CT code now works with single- and dual-source systems

DX: implement new method and check that tests run

  • Pandas code implemented and tests pass

MG: implement new method and check that tests run

  • The existing code is structured differently to the DX and CT code: the csv and xlsx code is intertwined

RF: don't do anything - the code needs an overhaul or rethink

NN: consider looking at this

Comments (52)

  1. David Platten reporter

    It is worth noting that the csv writer does not suffer from the same high memory footprint because by default it writes the file to disk as the export progresses.

  2. David Platten reporter

    This small change has broken the “Summary” and “All data” worksheets, which is why the rf test is now failing. Hooray for tests.

    The change I have made writes each row of a worksheet to the file. Updating individual cells doesn’t work nor does updating rows that have already been written…

    For my change to be workable the export code will need to be rewritten so that the rows in a sheet are written out in order from top to bottom.

    @Ed McDonagh - what do you think about this issue?

  3. David Platten reporter

    For the “Summary” page I am inclined to create a pandas data frame containing the info we need, and then write that to the worksheet. The same thing would work for the “All data” sheet.

  4. Ed McDonagh

    I think a lot of the export code could do with being optimised, particularly the RF, and we should be making use of pandas data frames as you suggest.

    Do you have capacity to “do enough” to get this working sufficiently well for release?

    I’ve not noticed this on Linux, but I should look more closely to see if it is working in the same way.

  5. David Platten reporter

    I think I should probably look at issue #927 again and get that implemented with the current development code. That issue’s branch already uses the “constant_memory” setting, and also greatly speeds up the export process.

  6. David Platten reporter

    I won’t be able to get this working for a 1.0 release, but will implement it for whatever comes after that.

  7. Ed McDonagh

    Ok. I need to see how much of an issue this is in Linux, and then we can document it for this release and fix it for the next one.

  8. David Platten reporter

    Summary sheet now populated using Pandas data frames. Note: the exported acquisition data and standard acquisition data are incorrect at the moment if the user has filtered by those fields. This needs fixing. Refs issue #1004 and issue #927

    → <<cset bf954d537e31>>

  9. David Platten reporter

    I think that the exported acquisition data and standard acquisition data are now correct, but I have not thoroughly tested this, nor have I written any tests. I have only been testing this with CT data - the other modalities need checking. Refs issue #1004 and issue #927

    → <<cset f876a7660dd9>>

  10. David Platten reporter

    CT xlsx export now being written out using Pandas data frames. To do: include standard study names in the export; refactor out some code to export_common.py; extend to other modalities. Refs issue #1004 and issue #927

    → <<cset 64c8cfb0604d>>

  11. David Platten reporter

    Standard names now being written out correctly for CT in all data sheet. Acquisition data being duplicated - need to fix. This work is incomplete, so [skip ci]. Refs issue #1004 and issue #927

    → <<cset 1be07ff164d2>>

  12. David Platten reporter

    Acquisition data being written out correctly for CT. I need to update some of the dataframe datatypes after creation to reduce memory consumption. We may have an issue with a limit on the number of sheets allowed in a spread sheet. Still incomplete so [skip ci]. Refs issue #1004 and issue #927

    → <<cset 3851ac79df5a>>

  13. David Platten reporter

    Fixing some bugs that became apparent when exporting some real data. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927

    → <<cset a21b37b46e6a>>

  14. David Platten reporter

    Making use of the text_and_date_formats method - I've changed the default column number for the study date - this probably has implications for the other modalities. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927

    → <<cset ec22c837f4e4>>

  15. David Platten reporter

    Reverted export_common.py to develop version (except for the OpenREM version check). CT export code uses the new export_common_pandas.py file; all other exports use the original export code at the moment. Refs issue #1004

    → <<cset a3c08dce802c>>

  16. David Platten reporter

    Modified the CT export code in preparation for refactoring it out for more general use. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004

    → <<cset 17379e53ca58>>

  17. David Platten reporter

    Using refactored pandas code for the DX modality. Not fully working (order of fields), so no point in running the tests [skip ci]. Refs issue #1004

    → <<cset 668eefa339fb>>

  18. David Platten reporter

    Changed order of fields in the exported files, and sorting them by descending date and time. This has broken the column formatting in the Excel file (dates, values etc), so need to fix that. Not fully working so no point in running the tests [skip ci]. Refs issue #1004

    → <<cset 53c1fe979ac7>>

  19. David Platten reporter

    Fixed bug that caused export failure if there were no standard name entries in the data being exported. Also replaced standard name status with refactored routine. Also fixed column formatting for some scenarios. Not fully working so no point in running the tests [skip ci]. Refs issue #1004

    → <<cset 5133118afd3e>>

  20. David Platten reporter

    Including modality in the export file name, rather than using a hard-written 'ct'. Not fully working so no point in running the tests [skip ci]. Refs issue #1004

    → <<cset af8e33bb2f10>>

  21. David Platten reporter

    Updated CT export code to include two x-ray sources. Realised the export code fails if there are no accession numbers in the data - needs looking at. Refs issue #1004

    → <<cset cef9fa128984>>

  22. David Platten reporter

    Filter information was entirely missing from radiographic xlsx exports. Partially-enabled this now. If an exposure has more than one filter (such as the DX-Im-Carestream_DR7500-2.dcm file) then two rows will be present in the data queryset - one for each filter (lines 1283 and 1448 of export_common_pandas.py). These multiple rows need to be merged into a single row in export_using_pandas in a similar way to how dual source CT is candled. As the code currently stands only one filter will be present in the exported xlsx file for exposures that actually had more than one filter in-use. The tests will still fail, so [skip ci]. Refs issue #1004

    → <<cset 823012c18542>>

  23. David Platten reporter

    Refactored quite a bit of the export_using_pandas.py into individual methods to avoid duplication. Added code to ensure CT dual source dealt with correctly for studies without accession numbers. Tests will still fail due to radiographic filter info not being correctly exported yet, so [skip ci]. Refs issue #1004

    → <<cset b4e67676f875>>

  24. David Platten reporter

    Added code to simplify filter material name in the output, and to calculate the mean filter thickness from the min and max values. Tests will still fail due to radiographic filter info not being correctly exported yet, so [skip ci]. Refs issue #1004

    → <<cset b87b35737814>>

  25. David Platten reporter

    Radiographic filter material and thickness are now combined correctly in the xlsx output file. Tests will still fail due to ordering of fields, so [skip ci] Refs issue #1004

    → <<cset 59ed28f13fe5>>

  26. David Platten reporter

    Radiographic filter thickness is now combined in a way that missing data results in a blank in the xlsx file, rather than the text 'nan'. Tests will still fail due to ordering of fields, so [skip ci] Refs issue #1004

    → <<cset 74a4a5a6360e>>

  27. David Platten reporter

    Adjusted number of decimal places for the filter thickness to match the test decimal places. Running the DX export test on its own passes. Refs issue #1004

    → <<cset 87b9c752694d>>

  28. David Platten reporter

    Split the combined csv and xlsx mammo export code into separate routines so that I can update xlsx one to use the Pandas dataframe method. No functional changes at the moment. Refs issue #1004

    → <<cset 134055bc3f84>>

  29. Log in to comment