Improve memory efficiency of chart DataFrame creation

Issue #870 resolved
David Platten created an issue

No description provided.

Comments (26)

  1. David Platten reporter

    Updated Plotly to latest version. This solves the multi-index error that Ed and I have seen. I've also upgraded the plotly JavaScript file. Upgrading Plotly has changed the data that is returned for the tests a little, so I've updated the test files too. They all run fine on my local computer. I have also updated pandas to the latest version. Refs issue #870 (sort of)

    → <<cset e03f1fa4eff2>>

  2. David Platten reporter

    Merged in issue870improveDataFrameMemoryEfficiency (pull request #426)

    Using values_list rather than values to read Django queryset into the DataFrame. This reduces the memory requirement by around 50 %. It is also faster (3 s vs 5 s for 600,000 radiographic studies when plotting the workload chart). https://stackoverflow.co

    Approved-by: Ed McDonagh ed@mcdonagh.org.uk

    Fixes issue #870

    → <<cset 2f90276772f2>>

  3. David Platten reporter

    Removed the custom ordering on the query set before creation of chart DataFrame. This custom ordering isn't needed by the chart routines (it is used to order the summary view tables in the order specified by the user), and slows down the DataFrame creation. In some of my tests this reduces the DataFrame creation time from 10 s to 7 s. Refs issue #870

    → <<cset de73ed7ac117>>

  4. David Platten reporter
    • changed status to open

    I would like to further improve the memory efficiency of the chart DataFrame creation

  5. David Platten reporter

    In the create_dataframe method of chart_functions.py the value fields can be made float32 once they have been scaled by the multiplier; this halves the memory requirement for these fields (they are float64 otherwise):

    for idx, value_field in enumerate(field_dict["values"]):
        if data_point_value_multipliers:
            df[value_field] *= data_point_value_multipliers[idx]
            df[value_field] = df[value_field].astype("float32")
    

    Also, the uid field, if present, can be made int32, rather than int64 once the DataFrame has been created. This is on the assumption that the OpenREM database will never have more than 2 billion or so entries in a table (could do with making these values unsigned really):

    if uid:
        df[uid] = df[uid].astype("int32")
    

  6. David Platten reporter

    Made primary key field an unsigned 32-bit int that can cope with NaN values. Reduced precision of chart data tests now we're using single-precision values. Refs issue #870

    → <<cset 5fefbb612000>>

  7. David Platten reporter

    My test system has Python 3.9.5 and the following packages installed:

    Django 2.2.23
    OpenREM 1.0.0.dev0
    Pillow 8.2.0
    XlsxWriter 1.2.8
    amqp 2.6.1
    billiard 3.6.4.0
    celery 4.4.7
    certifi 2020.12.5
    chardet 3.0.4
    cycler 0.10.0
    defusedxml 0.6.0
    django-crispy-forms 1.9.0
    django-filter 2.2.0
    django-js-reverse 0.9.1
    django-qsstats-magic 1.1.0
    django-solo 1.1.3
    flower 0.9.5
    gunicorn 20.0.4
    humanize 2.0.0
    idna 2.10
    kiwisolver 1.3.1
    kombu 4.6.11
    matplotlib 3.4.2
    mock 4.0.2
    numpy 1.19.4
    pandas 1.2.4
    pip 20.3.4
    pkg-resources 0.0.0
    plotly 4.14.1
    prometheus-client 0.8.0
    psycopg2-binary 2.8.6
    pydicom 2.0.0
    pynetdicom 1.5.7
    pyparsing 2.4.7
    python-dateutil 2.8.1
    pytz 2021.1
    requests 2.23.0
    retrying 1.3.3
    scipy 1.5.4
    setuptools 44.1.1
    six 1.16.0
    sqlparse 0.4.1
    testfixtures 6.14.0
    tornado 6.1
    urllib3 1.25.11
    vine 1.3.0
    xlrd 1.2.0

  8. Log in to comment