Improve memory efficiency of chart DataFrame creation

Issue #870 resolved

David Platten created an issue 2020-11-24

No description provided.

Comments (26)

David Platten reporter
Using values_list rather than values to read Django queryset into the DataFrame. This reduces the memory requirement by around 50 %. It is also faster (3 s vs 5 s for 600,000 radiographic studies when plotting the workload chart). https://stackoverflow.com/questions/11697887/converting-django-queryset-to-pandas-dataframe/29990874. Refs issue ~~#870~~

→ <<cset 721a1d698509>>
- 2020-11-24T08:31:34+00:00
David Platten reporter
Updating changes file [skip ci]. Refs issue ~~#870~~

→ <<cset 5c8106f95adf>>
- 2020-11-24T09:11:39+00:00
David Platten reporter
Updated Plotly to latest version. This solves the multi-index error that Ed and I have seen. I've also upgraded the plotly JavaScript file. Upgrading Plotly has changed the data that is returned for the tests a little, so I've updated the test files too. They all run fine on my local computer. I have also updated pandas to the latest version. Refs issue ~~#870~~ (sort of)

→ <<cset e03f1fa4eff2>>
- 2020-11-24T19:30:14+00:00
David Platten reporter
Making weekday column categorical to save memory. Refs issue ~~#870~~

→ <<cset e3d298ebadef>>
- 2020-11-24T19:48:34+00:00
David Platten reporter
Matching the requirements files. Updating spacing for some charts. Simplifying weekday dataframe creation. Refs issue ~~#870~~

→ <<cset ad98287563d2>>
- 2020-11-25T08:45:59+00:00
David Platten reporter
- changed status to resolved
Merged in issue870improveDataFrameMemoryEfficiency (pull request #426)

Using values_list rather than values to read Django queryset into the DataFrame. This reduces the memory requirement by around 50 %. It is also faster (3 s vs 5 s for 600,000 radiographic studies when plotting the workload chart). https://stackoverflow.co

Approved-by: Ed McDonagh ed@mcdonagh.org.uk

Fixes issue ~~#870~~

→ <<cset 2f90276772f2>>
- 2020-11-25T09:46:18+00:00
David Platten reporter
- changed status to open
I'd like to make some further improvements to this
- 2020-12-23T16:26:34+00:00
Ed McDonagh
Can you run black on this one then? The charts functions and tests need doing.
- 2020-12-23T16:36:54+00:00
David Platten reporter
Commented DataFrame creation more thoroughly. Removed loop through names. Refs issue ~~#870~~

→ <<cset ae0b018b2c14>>
- 2020-12-23T16:40:53+00:00
David Platten reporter
Ran Black. Refs issue ~~#870~~

→ <<cset 9a0ad203ef73>>
- 2020-12-23T17:16:58+00:00
David Platten reporter
Trying to appease Codacy. Refs issue ~~#870~~

→ <<cset 66edf9858c80>>
- 2020-12-23T17:38:32+00:00
David Platten reporter
@Ed McDonagh Codacy is happy now.
- 2020-12-23T17:59:47+00:00
David Platten reporter
Removed the custom ordering on the query set before creation of chart DataFrame. This custom ordering isn't needed by the chart routines (it is used to order the summary view tables in the order specified by the user), and slows down the DataFrame creation. In some of my tests this reduces the DataFrame creation time from 10 s to 7 s. Refs issue ~~#870~~

→ <<cset de73ed7ac117>>
- 2020-12-24T13:55:43+00:00
David Platten reporter
- changed status to resolved
Merged in issue870improveDataFrameCreationMemoryEfficiency (pull request #436)

Issue870improveDataFrameCreationMemoryEfficiency

Approved-by: Ed McDonagh

Fixes issue ~~#870~~ Refs issue ~~#881~~

→ <<cset 8597e2d03d33>>
- 2020-12-28T09:57:51+00:00
David Platten reporter
- changed status to open
I would like to further improve the memory efficiency of the chart DataFrame creation
- 2021-06-05T13:53:52+00:00
David Platten reporter
In the create_dataframe method of chart_functions.py the value fields can be made float32 once they have been scaled by the multiplier; this halves the memory requirement for these fields (they are float64 otherwise):
```
for idx, value_field in enumerate(field_dict["values"]):
    if data_point_value_multipliers:
        df[value_field] *= data_point_value_multipliers[idx]
        df[value_field] = df[value_field].astype("float32")
```
Also, the uid field, if present, can be made int32, rather than int64 once the DataFrame has been created. This is on the assumption that the OpenREM database will never have more than 2 billion or so entries in a table (could do with making these values unsigned really):
```
if uid:
    df[uid] = df[uid].astype("int32")
```
‌
- 2021-06-05T14:04:57+00:00
David Platten reporter
Made primary key field an unsigned 32-bit int (max value 4294967296). Made value fields 32-bit floats once they've been scaled by their multiplier. Refs issue ~~#870~~

→ <<cset 8852b6df3cfe>>
- 2021-06-05T14:42:54+00:00
David Platten reporter
Made primary key field an unsigned 32-bit int that can cope with NaN values. Reduced precision of chart data tests now we're using single-precision values. Refs issue ~~#870~~

→ <<cset 5fefbb612000>>
- 2021-06-05T15:21:34+00:00
David Platten reporter
Made precision of dx and mg chart tests more appropriate. Refs issue ~~#870~~

→ <<cset fd50443779b8>>
- 2021-06-05T15:41:15+00:00
David Platten reporter
Updating pandas and matplotlib versions to reflect my test system - these are the latest releases. Refs issue ~~#870~~

→ <<cset 4b12a8b5b7db>>
- 2021-06-05T15:56:18+00:00
David Platten reporter
Pandas 1.2.4 requires Python 3.7.1 or above. Refs issue ~~#870~~

→ <<cset fe08b59ea086>>
- 2021-06-05T16:01:59+00:00
David Platten reporter
The latest Matplotlib is not compatible with Python 3.6. Downgrading to the newest version that is. Refs issue ~~#870~~

→ <<cset 8f13fa3e9598>>
- 2021-06-05T16:06:08+00:00
David Platten reporter
My test system has Python 3.9.5 and the following packages installed:

Django 2.2.23
OpenREM 1.0.0.dev0
Pillow 8.2.0
XlsxWriter 1.2.8
amqp 2.6.1
billiard 3.6.4.0
celery 4.4.7
certifi 2020.12.5
chardet 3.0.4
cycler 0.10.0
defusedxml 0.6.0
django-crispy-forms 1.9.0
django-filter 2.2.0
django-js-reverse 0.9.1
django-qsstats-magic 1.1.0
django-solo 1.1.3
flower 0.9.5
gunicorn 20.0.4
humanize 2.0.0
idna 2.10
kiwisolver 1.3.1
kombu 4.6.11
matplotlib 3.4.2
mock 4.0.2
numpy 1.19.4
pandas 1.2.4
pip 20.3.4
pkg-resources 0.0.0
plotly 4.14.1
prometheus-client 0.8.0
psycopg2-binary 2.8.6
pydicom 2.0.0
pynetdicom 1.5.7
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2021.1
requests 2.23.0
retrying 1.3.3
scipy 1.5.4
setuptools 44.1.1
six 1.16.0
sqlparse 0.4.1
testfixtures 6.14.0
tornado 6.1
urllib3 1.25.11
vine 1.3.0
xlrd 1.2.0

‌
- 2021-06-05T16:11:21+00:00
David Platten reporter
Refactored some chart test methods to reduce duplication. Reduced number of decimal places in some test data. Refs issue ~~#870~~ and refs issue #686

→ <<cset 05da801c9524>>
- 2021-06-07T11:21:24+00:00
David Platten reporter
Updated pandas, numpy and scipy to their latest versions. Refs issue ~~#870~~

→ <<cset c4ca33924b63>>
- 2021-06-11T08:50:37+00:00
David Platten reporter
- changed status to resolved
Merged in issue870ChartDataFrameMemoryOptimisation (pull request #455)

Issue870ChartDataFrameMemoryOptimisation

Approved-by: Ed McDonagh

Fixes issue ~~#870~~

→ <<cset e077d9dd088b>>
- 2021-06-15T07:24:50+00:00
Log in to comment

Assignee: David Platten

Type: enhancement

Priority: minor

Status: resolved

Component: –

Milestone: 1.0.0

Votes: 0

Watchers: 1