Export to XLSX consumes lots of memory
Exporting a large amount of (CT) data to XLSX consumes a large amount of server memory. The memory consumption increases as the “Write study” count increases during the export. This is causing me some issues, with memory usage in excess of 1 GB during exports.
The memory consumption is related to the way the Workbook is created in line 706 of export_common.py
:
book = Workbook(temp_xlsx, {"strings_to_numbers": False})
This default mode means that XlsxWriter holds all cell data in memory. From reading the XlsxWriter docs (https://xlsxwriter.readthedocs.io/working_with_memory.html) this can be addressed by changing the above line to:
book = Workbook(temp_xlsx, {"strings_to_numbers": False, "constant_memory": True})
The docs state that the trade-off when using 'constant_memory'
mode is that you won’t be able to take advantage of any new features that manipulate cell data after it is written. Currently the add_table()
method doesn’t work in this mode and merge_range()
and set_row()
only work for the current row.
We don’t currently use add_table()
, merge_range()
or set_row()
so I think it is fine to implement this change.
On my system the memory usage with the “constant_memory” option remains at 238 MB regardless of how many studies are being exported.
This will be implemented for CT, radiography and mammography [and consider nuc med]
Need to consider the order of the columns in the exports from this branch compared with those from the previous version.
Status and progress
CT xlsx export now uses the new Pandas code; the tests work for the new code
- [1st March 2024: done] Duplicate the common code file, and reinstate the original version so that the RF exports continue to work as they did
- The CT code now works with single- and dual-source systems
DX: implement new method and check that tests run
- Pandas code implemented and tests pass
MG: implement new method and check that tests run
- The existing code is structured differently to the DX and CT code: the csv and xlsx code is intertwined
- The MG code has been separated into a method for csv and for xlsx
- Need to implement the new Pandas method for the xlsx MG export code
RF: don't do anything - the code needs an overhaul or rethink
NN: consider looking at this
Comments (77)
-
reporter -
reporter It is worth noting that the csv writer does not suffer from the same high memory footprint because by default it writes the file to disk as the export progresses.
-
reporter This small change has broken the “Summary” and “All data” worksheets, which is why the rf test is now failing. Hooray for tests.
The change I have made writes each row of a worksheet to the file. Updating individual cells doesn’t work nor does updating rows that have already been written…
For my change to be workable the export code will need to be rewritten so that the rows in a sheet are written out in order from top to bottom.
@Ed McDonagh - what do you think about this issue?
-
reporter For the “Summary” page I am inclined to create a pandas data frame containing the info we need, and then write that to the worksheet. The same thing would work for the “All data” sheet.
-
I think a lot of the export code could do with being optimised, particularly the RF, and we should be making use of pandas data frames as you suggest.
Do you have capacity to “do enough” to get this working sufficiently well for release?
I’ve not noticed this on Linux, but I should look more closely to see if it is working in the same way.
-
reporter I think I should probably look at issue #927 again and get that implemented with the current development code. That issue’s branch already uses the “constant_memory” setting, and also greatly speeds up the export process.
-
Good plan. I'd forgotten about that issue 🤦
-
reporter I won’t be able to get this working for a 1.0 release, but will implement it for whatever comes after that.
-
Ok. I need to see how much of an issue this is in Linux, and then we can document it for this release and fix it for the next one.
-
reporter Summary sheet now populated using Pandas data frames. Note: the exported acquisition data and standard acquisition data are incorrect at the moment if the user has filtered by those fields. This needs fixing. Refs issue #1004 and issue #927
→ <<cset bf954d537e31>>
-
reporter I think that the exported acquisition data and standard acquisition data are now correct, but I have not thoroughly tested this, nor have I written any tests. I have only been testing this with CT data - the other modalities need checking. Refs issue #1004 and issue #927
→ <<cset f876a7660dd9>>
-
reporter Updated the generate_sheets method to use a dataframe - this avoids iterating through exams within a queryset of studies. Refs issue #1004 and issue #927
→ <<cset 105259f2bfb0>>
-
reporter Added missing code to check modality. Refs issue #1004 and issue #927
→ <<cset 0442d7c62362>>
-
reporter Checking if standard names are enabled before trying to create a list of standard acquisition names. Refs issue #1004 and issue #927
→ <<cset 2112abbe95d6>>
-
reporter CT xlsx export now being written out using Pandas data frames. To do: include standard study names in the export; refactor out some code to export_common.py; extend to other modalities. Refs issue #1004 and issue #927
→ <<cset 64c8cfb0604d>>
-
reporter Standard names now being written out correctly for CT in all data sheet. Acquisition data being duplicated - need to fix. This work is incomplete, so [skip ci]. Refs issue #1004 and issue #927
→ <<cset 1be07ff164d2>>
-
reporter Acquisition data being written out correctly for CT. I need to update some of the dataframe datatypes after creation to reduce memory consumption. We may have an issue with a limit on the number of sheets allowed in a spread sheet. Still incomplete so [skip ci]. Refs issue #1004 and issue #927
→ <<cset 3851ac79df5a>>
-
reporter Nearly there. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927
→ <<cset 1affa1424545>>
-
reporter Fixing some bugs that became apparent when exporting some real data. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927
→ <<cset a21b37b46e6a>>
-
reporter Making use of the text_and_date_formats method - I've changed the default column number for the study date - this probably has implications for the other modalities. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927
→ <<cset ec22c837f4e4>>
-
reporter Writing out the blank accession number section. Still incomplete so [skip ci]. Refs issue #1004 and issue #927
→ <<cset 0ed7fc1acbb9>>
-
-
assigned issue to
-
assigned issue to
-
reporter - edited description
-
reporter CT xlsx export file time and date columns now correctly formatted for all sheets [skip ci]. Refs issue #1004
→ <<cset 3f6981d28650>>
-
reporter Updated CT xlsx export test file. Refs issue #1004
→ <<cset 4a3e934f16f9>>
-
reporter Reverted export_common.py to develop version (except for the OpenREM version check). CT export code uses the new export_common_pandas.py file; all other exports use the original export code at the moment. Refs issue #1004
→ <<cset a3c08dce802c>>
-
reporter - edited description
-
reporter Started to implement Pandas export for DX. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset b82977162256>>
-
reporter Modified the CT export code in preparation for refactoring it out for more general use. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset 17379e53ca58>>
-
reporter Refactored out the pandas code for more general use. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset d632a2b984f6>>
-
reporter Using refactored pandas code for the DX modality. Not fully working (order of fields), so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset 668eefa339fb>>
-
reporter Changed order of fields in the exported files, and sorting them by descending date and time. This has broken the column formatting in the Excel file (dates, values etc), so need to fix that. Not fully working so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset 53c1fe979ac7>>
-
reporter Fixed bug that caused export failure if there were no standard name entries in the data being exported. Also replaced standard name status with refactored routine. Also fixed column formatting for some scenarios. Not fully working so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset 5133118afd3e>>
-
reporter Including modality in the export file name, rather than using a hard-written 'ct'. Not fully working so no point in running the tests [skip ci]. Refs issue #1004
→ <<cset af8e33bb2f10>>
-
reporter - edited description
-
reporter - edited description
-
reporter - edited description
-
reporter - edited description
-
reporter - edited description
-
reporter Updated CT export code to include two x-ray sources. Realised the export code fails if there are no accession numbers in the data - needs looking at. Refs issue #1004
→ <<cset cef9fa128984>>
-
reporter Updated export code to cope when there are no accession numbers in the data. Refs issue #1004
→ <<cset 6b20e341d608>>
-
reporter - edited description
-
reporter Filter information was entirely missing from radiographic xlsx exports. Partially-enabled this now. If an exposure has more than one filter (such as the DX-Im-Carestream_DR7500-2.dcm file) then two rows will be present in the data queryset - one for each filter (lines 1283 and 1448 of export_common_pandas.py). These multiple rows need to be merged into a single row in export_using_pandas in a similar way to how dual source CT is candled. As the code currently stands only one filter will be present in the exported xlsx file for exposures that actually had more than one filter in-use. The tests will still fail, so [skip ci]. Refs issue #1004
→ <<cset 823012c18542>>
-
reporter Refactored quite a bit of the export_using_pandas.py into individual methods to avoid duplication. Added code to ensure CT dual source dealt with correctly for studies without accession numbers. Tests will still fail due to radiographic filter info not being correctly exported yet, so [skip ci]. Refs issue #1004
→ <<cset b4e67676f875>>
-
reporter Added code to simplify filter material name in the output, and to calculate the mean filter thickness from the min and max values. Tests will still fail due to radiographic filter info not being correctly exported yet, so [skip ci]. Refs issue #1004
→ <<cset b87b35737814>>
-
reporter Radiographic filter material and thickness are now combined correctly in the xlsx output file. Tests will still fail due to ordering of fields, so [skip ci] Refs issue #1004
→ <<cset 59ed28f13fe5>>
-
reporter Radiographic filter thickness is now combined in a way that missing data results in a blank in the xlsx file, rather than the text 'nan'. Tests will still fail due to ordering of fields, so [skip ci] Refs issue #1004
→ <<cset 74a4a5a6360e>>
-
reporter Adjusted number of decimal places for the filter thickness to match the test decimal places. Running the DX export test on its own passes. Refs issue #1004
→ <<cset 87b9c752694d>>
-
reporter - edited description
-
reporter Split the combined csv and xlsx mammo export code into separate routines so that I can update xlsx one to use the Pandas dataframe method. No functional changes at the moment. Refs issue #1004
→ <<cset 134055bc3f84>>
-
reporter Reporting correct figure for blank accession numbers in the task update. Need to chunk blank accession number data [skip ci]. Refs issue #1004
→ <<cset 2d0d9491fb8f>>
-
reporter Ensuring field name lists are correct. May refactor some of this. Still need to chunk blank accession number data [skip ci]. Refs issue #1004
→ <<cset e9841779346f>>
-
repo owner Removed creation of Windows-specific temporary files in debug mode [skip ci]. Refs issue #1004
→ <<cset 34c667faad31>>
-
reporter Chunking exports for studies with no accession numbers. I have restructered the code so that I can refactor large elements to reduce duplication (not done yet) [skip ci]. Refs issue #1004
→ <<cset 50e5533dcd4e>>
-
reporter Refactored the chunking export code [skip ci]. Refs issue #1004
→ <<cset 942348c11f34>>
-
reporter Addressing some codacy issues [skip ci]. Refs issue #1004
→ <<cset 5104ae983c2a>>
-
reporter Need to change the call to run_in_background in the mgxlsx1 method in exportviews.py because xlsx=True is no longer required, and breaks the export if left in. Cannot commit at the moment due to change in network security at work.
job = run_in_background( exportMG2excel, "export_mg", request.GET, pid=pid["pidgroup"], name=pid["include_names"], patid=pid["include_pat_id"], user=request.user.id, # Need to remove this line xlsx=True, )
-
reporter Removing unrequired parameter from call to mammo xlsx export [skip ci]. Refs issue #1004
→ <<cset 605a5250f3c6>>
-
reporter Added new tool to check for status of standard name mapping and use it throughout the code base [skip ci]. Sort-of refs issue #1004
→ <<cset 5d4060d5338e>>
-
reporter - edited description
-
Comma separated image view modifiers. refs issue #1004
→ <<cset 9d7eb39e8324>>
-
Use pk to remove duplicates. refs issue #1004
→ <<cset ce5fb6476309>>
-
Added sort by pk for consistent output. refs issue #1004
→ <<cset ce1d97f31ca8>>
-
pandas nm export. refs issue #1004
→ <<cset e0cd4733a142>>
-
fix datetime format bug. refs issue #1004
→ <<cset 0c6d95027032>>
-
fix date maxrows. refs issue #1004
→ <<cset deb9583bb1c9>>
-
reporter I have tested the mammo export and found a couple of issues. I have fixed these, but am unable to commit the changes on my work laptop due to some new network security features. I will commit them from home later.
-
reporter Fixing two problems that were causing mammography exports to fail on my test system. Moving the 'Mode' field to the category list rather than value list for mammo exports. Removing 'Filter thickness (mm)' field from list for MG as well as DX modalities. Refs issue #1004
→ <<cset 59421fafc4b5>>
-
Merged in issue1029mammoperpulse (pull request #602)
Fixes
#1020and#1029. Export of per-event detail mammo is outstanding. Refs #1004→ <<cset 8d238cdd74b4>>
-
no standard names in nm export. refs issue #1004
→ <<cset cef41bfb740e>>
-
reporter Removing standard name headings from nuclear medicine export files because standard name feature is not available for this modality yet. Refs issue #1004
→ <<cset 4d1316fb90d1>>
-
multiple pulse export mg. refs issue #1004
→ <<cset 424917c62914>>
-
total agd left and right. refs issue #1004
→ <<cset d4c41fcc0e19>>
-
patient date of birth and refactor date. refs issue #1004
→ <<cset 9cf33cb30640>>
-
data export naming inconsistency. refs issue #1004
→ <<cset cfaca95a20c0>>
-
get short filter and target name case insensitive. refs issue #1004
→ <<cset e994d6cf104e>>
-
fix codacy issue. refs issue #1004
→ <<cset 0712d7bfd063>>
- Log in to comment
Added constant_memory option to creation of Workbook to ensure memory usage during export remains constant. Refs issue #1004
→ <<cset fd8ad42c3fde>>