Improve memory efficiency of chart DataFrame creation
No description provided.
Comments (26)
-
reporter -
reporter Updating changes file [skip ci]. Refs issue
#870→ <<cset 5c8106f95adf>>
-
reporter Updated Plotly to latest version. This solves the multi-index error that Ed and I have seen. I've also upgraded the plotly JavaScript file. Upgrading Plotly has changed the data that is returned for the tests a little, so I've updated the test files too. They all run fine on my local computer. I have also updated pandas to the latest version. Refs issue
#870(sort of)→ <<cset e03f1fa4eff2>>
-
reporter Making weekday column categorical to save memory. Refs issue
#870→ <<cset e3d298ebadef>>
-
reporter Matching the requirements files. Updating spacing for some charts. Simplifying weekday dataframe creation. Refs issue
#870→ <<cset ad98287563d2>>
-
reporter - changed status to resolved
Merged in issue870improveDataFrameMemoryEfficiency (pull request #426)
Using values_list rather than values to read Django queryset into the DataFrame. This reduces the memory requirement by around 50 %. It is also faster (3 s vs 5 s for 600,000 radiographic studies when plotting the workload chart). https://stackoverflow.co
Approved-by: Ed McDonagh ed@mcdonagh.org.uk
Fixes issue
#870→ <<cset 2f90276772f2>>
-
reporter - changed status to open
I'd like to make some further improvements to this
-
Can you run black on this one then? The charts functions and tests need doing.
-
reporter Commented DataFrame creation more thoroughly. Removed loop through names. Refs issue
#870→ <<cset ae0b018b2c14>>
-
reporter Ran Black. Refs issue
#870→ <<cset 9a0ad203ef73>>
-
reporter Trying to appease Codacy. Refs issue
#870→ <<cset 66edf9858c80>>
-
reporter @Ed McDonagh Codacy is happy now.
-
reporter Removed the custom ordering on the query set before creation of chart DataFrame. This custom ordering isn't needed by the chart routines (it is used to order the summary view tables in the order specified by the user), and slows down the DataFrame creation. In some of my tests this reduces the DataFrame creation time from 10 s to 7 s. Refs issue
#870→ <<cset de73ed7ac117>>
-
reporter - changed status to resolved
Merged in issue870improveDataFrameCreationMemoryEfficiency (pull request #436)
Issue870improveDataFrameCreationMemoryEfficiency
Approved-by: Ed McDonagh
Fixes issue
#870Refs issue#881→ <<cset 8597e2d03d33>>
-
reporter - changed status to open
I would like to further improve the memory efficiency of the chart DataFrame creation
-
reporter In the
create_dataframe
method ofchart_functions.py
the value fields can be madefloat32
once they have been scaled by the multiplier; this halves the memory requirement for these fields (they arefloat64
otherwise):for idx, value_field in enumerate(field_dict["values"]): if data_point_value_multipliers: df[value_field] *= data_point_value_multipliers[idx] df[value_field] = df[value_field].astype("float32")
Also, the
uid
field, if present, can be madeint32
, rather thanint64
once the DataFrame has been created. This is on the assumption that the OpenREM database will never have more than 2 billion or so entries in a table (could do with making these values unsigned really):if uid: df[uid] = df[uid].astype("int32")
-
reporter Made primary key field an unsigned 32-bit int (max value 4294967296). Made value fields 32-bit floats once they've been scaled by their multiplier. Refs issue
#870→ <<cset 8852b6df3cfe>>
-
reporter Made primary key field an unsigned 32-bit int that can cope with NaN values. Reduced precision of chart data tests now we're using single-precision values. Refs issue
#870→ <<cset 5fefbb612000>>
-
reporter Made precision of dx and mg chart tests more appropriate. Refs issue
#870→ <<cset fd50443779b8>>
-
reporter Updating pandas and matplotlib versions to reflect my test system - these are the latest releases. Refs issue
#870→ <<cset 4b12a8b5b7db>>
-
reporter Pandas 1.2.4 requires Python 3.7.1 or above. Refs issue
#870→ <<cset fe08b59ea086>>
-
reporter The latest Matplotlib is not compatible with Python 3.6. Downgrading to the newest version that is. Refs issue
#870→ <<cset 8f13fa3e9598>>
-
reporter My test system has Python 3.9.5 and the following packages installed:
Django 2.2.23
OpenREM 1.0.0.dev0
Pillow 8.2.0
XlsxWriter 1.2.8
amqp 2.6.1
billiard 3.6.4.0
celery 4.4.7
certifi 2020.12.5
chardet 3.0.4
cycler 0.10.0
defusedxml 0.6.0
django-crispy-forms 1.9.0
django-filter 2.2.0
django-js-reverse 0.9.1
django-qsstats-magic 1.1.0
django-solo 1.1.3
flower 0.9.5
gunicorn 20.0.4
humanize 2.0.0
idna 2.10
kiwisolver 1.3.1
kombu 4.6.11
matplotlib 3.4.2
mock 4.0.2
numpy 1.19.4
pandas 1.2.4
pip 20.3.4
pkg-resources 0.0.0
plotly 4.14.1
prometheus-client 0.8.0
psycopg2-binary 2.8.6
pydicom 2.0.0
pynetdicom 1.5.7
pyparsing 2.4.7
python-dateutil 2.8.1
pytz 2021.1
requests 2.23.0
retrying 1.3.3
scipy 1.5.4
setuptools 44.1.1
six 1.16.0
sqlparse 0.4.1
testfixtures 6.14.0
tornado 6.1
urllib3 1.25.11
vine 1.3.0
xlrd 1.2.0
-
reporter Refactored some chart test methods to reduce duplication. Reduced number of decimal places in some test data. Refs issue
#870and refs issue #686→ <<cset 05da801c9524>>
-
reporter Updated pandas, numpy and scipy to their latest versions. Refs issue
#870→ <<cset c4ca33924b63>>
-
reporter - changed status to resolved
Merged in issue870ChartDataFrameMemoryOptimisation (pull request #455)
Issue870ChartDataFrameMemoryOptimisation
Approved-by: Ed McDonagh
Fixes issue
#870→ <<cset e077d9dd088b>>
- Log in to comment
Using values_list rather than values to read Django queryset into the DataFrame. This reduces the memory requirement by around 50 %. It is also faster (3 s vs 5 s for 600,000 radiographic studies when plotting the workload chart). https://stackoverflow.com/questions/11697887/converting-django-queryset-to-pandas-dataframe/29990874. Refs issue
#870→ <<cset 721a1d698509>>