Issue #1004 new

David Platten created an issue 2023-08-23

Exporting a large amount of (CT) data to XLSX consumes a large amount of server memory. The memory consumption increases as the “Write study” count increases during the export. This is causing me some issues, with memory usage in excess of 1 GB during exports.

‌

The memory consumption is related to the way the Workbook is created in line 706 of export_common.py:

book = Workbook(temp_xlsx, {"strings_to_numbers": False})

‌

This default mode means that XlsxWriter holds all cell data in memory. From reading the XlsxWriter docs (https://xlsxwriter.readthedocs.io/working_with_memory.html) this can be addressed by changing the above line to:

book = Workbook(temp_xlsx, {"strings_to_numbers": False, "constant_memory": True})

‌

The docs state that the trade-off when using 'constant_memory' mode is that you won’t be able to take advantage of any new features that manipulate cell data after it is written. Currently the add_table() method doesn’t work in this mode and merge_range() and set_row() only work for the current row.

‌

We don’t currently use add_table(), merge_range() or set_row() so I think it is fine to implement this change.

‌

On my system the memory usage with the “constant_memory” option remains at 238 MB regardless of how many studies are being exported.

‌

This will be implemented for CT, radiography and mammography [and consider nuc med]

‌

Need to consider the order of the columns in the exports from this branch compared with those from the previous version.

‌

Status and progress

CT xlsx export now uses the new Pandas code; the tests work for the new code

[1st March 2024: done] Duplicate the common code file, and reinstate the original version so that the RF exports continue to work as they did
The CT code now works with single- and dual-source systems

DX: implement new method and check that tests run

Pandas code implemented and tests pass

MG: implement new method and check that tests run

The existing code is structured differently to the DX and CT code: the csv and xlsx code is intertwined
The MG code has been separated into a method for csv and for xlsx
Need to implement the new Pandas method for the xlsx MG export code

RF: don't do anything - the code needs an overhaul or rethink

NN: consider looking at this

Comments (77)

David Platten reporter
Added constant_memory option to creation of Workbook to ensure memory usage during export remains constant. Refs issue #1004

→ <<cset fd8ad42c3fde>>
- 2023-08-23T15:43:26+00:00
David Platten reporter
It is worth noting that the csv writer does not suffer from the same high memory footprint because by default it writes the file to disk as the export progresses.
- 2023-08-23T15:58:05+00:00
David Platten reporter
This small change has broken the “Summary” and “All data” worksheets, which is why the rf test is now failing. Hooray for tests.

The change I have made writes each row of a worksheet to the file. Updating individual cells doesn’t work nor does updating rows that have already been written…

For my change to be workable the export code will need to be rewritten so that the rows in a sheet are written out in order from top to bottom.

@Ed McDonagh - what do you think about this issue?
- 2023-08-24T08:22:00+00:00
David Platten reporter
For the “Summary” page I am inclined to create a pandas data frame containing the info we need, and then write that to the worksheet. The same thing would work for the “All data” sheet.
- 2023-08-24T08:27:36+00:00
Ed McDonagh
I think a lot of the export code could do with being optimised, particularly the RF, and we should be making use of pandas data frames as you suggest.

Do you have capacity to “do enough” to get this working sufficiently well for release?

I’ve not noticed this on Linux, but I should look more closely to see if it is working in the same way.
- 2023-08-24T08:34:39+00:00
David Platten reporter
I think I should probably look at issue #927 again and get that implemented with the current development code. That issue’s branch already uses the “constant_memory” setting, and also greatly speeds up the export process.

‌
- 2023-08-24T08:36:13+00:00
Ed McDonagh
Good plan. I'd forgotten about that issue 🤦
- 2023-08-24T08:39:11+00:00
David Platten reporter
I won’t be able to get this working for a 1.0 release, but will implement it for whatever comes after that.
- 2023-08-24T08:45:55+00:00
Ed McDonagh
Ok. I need to see how much of an issue this is in Linux, and then we can document it for this release and fix it for the next one.
- 2023-08-24T08:47:49+00:00
David Platten reporter
Summary sheet now populated using Pandas data frames. Note: the exported acquisition data and standard acquisition data are incorrect at the moment if the user has filtered by those fields. This needs fixing. Refs issue #1004 and issue #927

→ <<cset bf954d537e31>>
- 2023-08-25T11:24:15+00:00
David Platten reporter
I think that the exported acquisition data and standard acquisition data are now correct, but I have not thoroughly tested this, nor have I written any tests. I have only been testing this with CT data - the other modalities need checking. Refs issue #1004 and issue #927

→ <<cset f876a7660dd9>>
- 2023-08-25T14:03:58+00:00
David Platten reporter
Updated the generate_sheets method to use a dataframe - this avoids iterating through exams within a queryset of studies. Refs issue #1004 and issue #927

→ <<cset 105259f2bfb0>>
- 2023-08-25T14:56:46+00:00
David Platten reporter
Added missing code to check modality. Refs issue #1004 and issue #927

→ <<cset 0442d7c62362>>
- 2023-08-25T15:58:33+00:00
David Platten reporter
Checking if standard names are enabled before trying to create a list of standard acquisition names. Refs issue #1004 and issue #927

→ <<cset 2112abbe95d6>>
- 2023-08-29T08:46:26+00:00
David Platten reporter
CT xlsx export now being written out using Pandas data frames. To do: include standard study names in the export; refactor out some code to export_common.py; extend to other modalities. Refs issue #1004 and issue #927

→ <<cset 64c8cfb0604d>>
- 2023-08-31T15:18:37+00:00
David Platten reporter
Standard names now being written out correctly for CT in all data sheet. Acquisition data being duplicated - need to fix. This work is incomplete, so [skip ci]. Refs issue #1004 and issue #927

→ <<cset 1be07ff164d2>>
- 2023-09-04T14:28:14+00:00
David Platten reporter
Acquisition data being written out correctly for CT. I need to update some of the dataframe datatypes after creation to reduce memory consumption. We may have an issue with a limit on the number of sheets allowed in a spread sheet. Still incomplete so [skip ci]. Refs issue #1004 and issue #927

→ <<cset 3851ac79df5a>>
- 2023-09-04T16:26:28+00:00
David Platten reporter
Nearly there. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927

→ <<cset 1affa1424545>>
- 2023-09-05T15:59:18+00:00
David Platten reporter
Fixing some bugs that became apparent when exporting some real data. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927

→ <<cset a21b37b46e6a>>
- 2023-09-06T15:41:48+00:00
David Platten reporter
Making use of the text_and_date_formats method - I've changed the default column number for the study date - this probably has implications for the other modalities. Still need to update the blank accession number section, so [skip ci]. Refs issue #1004 and issue #927

→ <<cset ec22c837f4e4>>
- 2023-09-07T15:21:57+00:00
David Platten reporter
Writing out the blank accession number section. Still incomplete so [skip ci]. Refs issue #1004 and issue #927

→ <<cset 0ed7fc1acbb9>>
- 2023-09-08T13:07:40+00:00
Ed McDonagh
- assigned issue to
  
  David Platten
- 2024-02-16T14:12:57+00:00
David Platten reporter
- edited description
- 2024-02-16T14:15:45+00:00
David Platten reporter
CT xlsx export file time and date columns now correctly formatted for all sheets [skip ci]. Refs issue #1004

→ <<cset 3f6981d28650>>
- 2024-02-16T15:41:15+00:00
David Platten reporter
Updated CT xlsx export test file. Refs issue #1004

→ <<cset 4a3e934f16f9>>
- 2024-02-16T16:27:54+00:00
David Platten reporter
Reverted export_common.py to develop version (except for the OpenREM version check). CT export code uses the new export_common_pandas.py file; all other exports use the original export code at the moment. Refs issue #1004

→ <<cset a3c08dce802c>>
- 2024-03-01T15:08:10+00:00
David Platten reporter
- edited description
- 2024-03-04T08:45:38+00:00
David Platten reporter
Started to implement Pandas export for DX. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset b82977162256>>
- 2024-03-04T17:09:30+00:00
David Platten reporter
Modified the CT export code in preparation for refactoring it out for more general use. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset 17379e53ca58>>
- 2024-03-05T11:44:47+00:00
David Platten reporter
Refactored out the pandas code for more general use. Not fully working, so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset d632a2b984f6>>
- 2024-03-05T12:32:47+00:00
David Platten reporter
Using refactored pandas code for the DX modality. Not fully working (order of fields), so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset 668eefa339fb>>
- 2024-03-05T14:42:31+00:00
David Platten reporter
Changed order of fields in the exported files, and sorting them by descending date and time. This has broken the column formatting in the Excel file (dates, values etc), so need to fix that. Not fully working so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset 53c1fe979ac7>>
- 2024-03-06T10:51:54+00:00
David Platten reporter
Fixed bug that caused export failure if there were no standard name entries in the data being exported. Also replaced standard name status with refactored routine. Also fixed column formatting for some scenarios. Not fully working so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset 5133118afd3e>>
- 2024-03-06T13:18:28+00:00
David Platten reporter
Including modality in the export file name, rather than using a hard-written 'ct'. Not fully working so no point in running the tests [skip ci]. Refs issue #1004

→ <<cset af8e33bb2f10>>
- 2024-03-06T13:30:30+00:00
David Platten reporter
- edited description
- 2024-04-09T15:30:06+00:00
David Platten reporter
- edited description
- 2024-04-09T15:38:44+00:00
David Platten reporter
- edited description
- 2024-04-09T15:41:23+00:00
David Platten reporter
- edited description
- 2024-04-12T11:05:16+00:00
David Platten reporter
- edited description
- 2024-04-12T13:12:41+00:00
David Platten reporter
Updated CT export code to include two x-ray sources. Realised the export code fails if there are no accession numbers in the data - needs looking at. Refs issue #1004

→ <<cset cef9fa128984>>
- 2024-04-15T15:51:07+00:00
David Platten reporter
Updated export code to cope when there are no accession numbers in the data. Refs issue #1004

→ <<cset 6b20e341d608>>
- 2024-04-16T13:17:27+00:00
David Platten reporter
- edited description
- 2024-04-16T13:18:53+00:00
David Platten reporter
Filter information was entirely missing from radiographic xlsx exports. Partially-enabled this now. If an exposure has more than one filter (such as the DX-Im-Carestream_DR7500-2.dcm file) then two rows will be present in the data queryset - one for each filter (lines 1283 and 1448 of export_common_pandas.py). These multiple rows need to be merged into a single row in export_using_pandas in a similar way to how dual source CT is candled. As the code currently stands only one filter will be present in the exported xlsx file for exposures that actually had more than one filter in-use. The tests will still fail, so [skip ci]. Refs issue #1004

→ <<cset 823012c18542>>
- 2024-04-17T10:15:56+00:00
David Platten reporter
Refactored quite a bit of the export_using_pandas.py into individual methods to avoid duplication. Added code to ensure CT dual source dealt with correctly for studies without accession numbers. Tests will still fail due to radiographic filter info not being correctly exported yet, so [skip ci]. Refs issue #1004

→ <<cset b4e67676f875>>
- 2024-04-18T10:47:38+00:00
David Platten reporter
Added code to simplify filter material name in the output, and to calculate the mean filter thickness from the min and max values. Tests will still fail due to radiographic filter info not being correctly exported yet, so [skip ci]. Refs issue #1004

→ <<cset b87b35737814>>
- 2024-04-18T16:24:27+00:00
David Platten reporter
Radiographic filter material and thickness are now combined correctly in the xlsx output file. Tests will still fail due to ordering of fields, so [skip ci] Refs issue #1004

→ <<cset 59ed28f13fe5>>
- 2024-04-23T11:33:42+00:00
David Platten reporter
Radiographic filter thickness is now combined in a way that missing data results in a blank in the xlsx file, rather than the text 'nan'. Tests will still fail due to ordering of fields, so [skip ci] Refs issue #1004

→ <<cset 74a4a5a6360e>>
- 2024-04-23T12:43:02+00:00
David Platten reporter
Adjusted number of decimal places for the filter thickness to match the test decimal places. Running the DX export test on its own passes. Refs issue #1004

→ <<cset 87b9c752694d>>
- 2024-04-23T15:42:36+00:00
David Platten reporter
- edited description
- 2024-04-23T15:55:33+00:00
David Platten reporter
Split the combined csv and xlsx mammo export code into separate routines so that I can update xlsx one to use the Pandas dataframe method. No functional changes at the moment. Refs issue #1004

→ <<cset 134055bc3f84>>
- 2024-04-24T09:29:53+00:00
David Platten reporter
Reporting correct figure for blank accession numbers in the task update. Need to chunk blank accession number data [skip ci]. Refs issue #1004

→ <<cset 2d0d9491fb8f>>
- 2024-04-25T08:38:30+00:00
David Platten reporter
Ensuring field name lists are correct. May refactor some of this. Still need to chunk blank accession number data [skip ci]. Refs issue #1004

→ <<cset e9841779346f>>
- 2024-04-26T16:02:31+00:00
OpenREM repo owner
Removed creation of Windows-specific temporary files in debug mode [skip ci]. Refs issue #1004

→ <<cset 34c667faad31>>
- 2024-04-28T08:43:39+00:00
David Platten reporter
Chunking exports for studies with no accession numbers. I have restructered the code so that I can refactor large elements to reduce duplication (not done yet) [skip ci]. Refs issue #1004

→ <<cset 50e5533dcd4e>>
- 2024-04-30T11:00:19+00:00
David Platten reporter
Refactored the chunking export code [skip ci]. Refs issue #1004

→ <<cset 942348c11f34>>
- 2024-04-30T11:18:37+00:00
David Platten reporter
Addressing some codacy issues [skip ci]. Refs issue #1004

→ <<cset 5104ae983c2a>>
- 2024-04-30T12:53:14+00:00

David Platten reporter

Need to change the call to run_in_background in the mgxlsx1 method in exportviews.py because xlsx=True is no longer required, and breaks the export if left in. Cannot commit at the moment due to change in network security at work.

job = run_in_background(
    exportMG2excel,
    "export_mg",
    request.GET,
    pid=pid["pidgroup"],
    name=pid["include_names"],
    patid=pid["include_pat_id"],
    user=request.user.id,
    # Need to remove this line xlsx=True,
)

‌

2024-05-21T12:59:57+00:00

David Platten reporter
Removing unrequired parameter from call to mammo xlsx export [skip ci]. Refs issue #1004

→ <<cset 605a5250f3c6>>
- 2024-05-21T14:13:10+00:00
David Platten reporter
Added new tool to check for status of standard name mapping and use it throughout the code base [skip ci]. Sort-of refs issue #1004

→ <<cset 5d4060d5338e>>
- 2024-05-21T14:38:01+00:00
David Platten reporter
- edited description
- 2024-05-22T09:33:48+00:00
Jordan Suter
Comma separated image view modifiers. refs issue #1004

→ <<cset 9d7eb39e8324>>
- 2024-07-25T11:32:28+00:00
Jordan Suter
Use pk to remove duplicates. refs issue #1004

→ <<cset ce5fb6476309>>
- 2024-07-26T08:33:55+00:00
Jordan Suter
Added sort by pk for consistent output. refs issue #1004

→ <<cset ce1d97f31ca8>>
- 2024-07-26T12:31:21+00:00
Jordan Suter
pandas nm export. refs issue #1004

→ <<cset e0cd4733a142>>
- 2024-08-09T10:04:51+00:00
Jordan Suter
fix datetime format bug. refs issue #1004

→ <<cset 0c6d95027032>>
- 2024-08-12T14:13:58+00:00
Jordan Suter
fix date maxrows. refs issue #1004

→ <<cset deb9583bb1c9>>
- 2024-08-13T08:17:14+00:00
David Platten reporter
I have tested the mammo export and found a couple of issues. I have fixed these, but am unable to commit the changes on my work laptop due to some new network security features. I will commit them from home later.

‌
- 2024-08-21T13:20:47+00:00
David Platten reporter
Fixing two problems that were causing mammography exports to fail on my test system. Moving the 'Mode' field to the category list rather than value list for mammo exports. Removing 'Filter thickness (mm)' field from list for MG as well as DX modalities. Refs issue #1004

→ <<cset 59421fafc4b5>>
- 2024-08-21T16:22:12+00:00
Ed McDonagh
Merged in issue1029mammoperpulse (pull request #602)

Fixes ~~#1020~~ and ~~#1029~~. Export of per-event detail mammo is outstanding. Refs #1004

→ <<cset 8d238cdd74b4>>
- 2024-08-22T06:28:06+00:00
Jordan Suter
no standard names in nm export. refs issue #1004

→ <<cset cef41bfb740e>>
- 2024-08-23T14:23:15+00:00
David Platten reporter
Removing standard name headings from nuclear medicine export files because standard name feature is not available for this modality yet. Refs issue #1004

→ <<cset 4d1316fb90d1>>
- 2024-08-27T15:57:59+00:00
Jordan Suter
multiple pulse export mg. refs issue #1004

→ <<cset 424917c62914>>
- 2024-09-03T09:13:03+00:00
Jordan Suter
total agd left and right. refs issue #1004

→ <<cset d4c41fcc0e19>>
- 2024-09-03T09:13:03+00:00
Jordan Suter
patient date of birth and refactor date. refs issue #1004

→ <<cset 9cf33cb30640>>
- 2024-09-03T09:13:03+00:00
Jordan Suter
data export naming inconsistency. refs issue #1004

→ <<cset cfaca95a20c0>>
- 2024-09-03T11:50:53+00:00
Jordan Suter
get short filter and target name case insensitive. refs issue #1004

→ <<cset e994d6cf104e>>
- 2024-09-03T11:50:54+00:00
Jordan Suter
fix codacy issue. refs issue #1004

→ <<cset 0712d7bfd063>>
- 2024-09-03T13:55:15+00:00
Log in to comment

Assignee: David Platten

Type: enhancement

Priority: major

Status: new

Component: Export: All

Milestone: 1.0.0

Votes: 0

Watchers: 1