Median values for charts

Issue #241 resolved
David Platten created an issue

It would be useful if median values were available on charts.

Unfortunately django aggregation doesn't include median, as it isn't included in the underlying database. It should be possible to add a median class to Aggregate, but I haven't worked out how to do it yet.

Comments (77)

  1. David Platten reporter

    This is something that other users of django have wanted to do. I can't find a definitive solution. http://stackoverflow.com/questions/942620/missing-median-aggregate-function-in-django has a function that will calculate the median given a queryset and the field to calculate the median on, but it doesn't work within an aggregate or annotate call:

    def median_value(queryset,term):
        count = queryset.count()
        return queryset.values_list(term, flat=True).order_by(term)[int(round(count/2))]
    
  2. David Platten reporter

    There's a 2009 discussion in the Google Groups Django users group about custom aggregate objects: https://groups.google.com/forum/#!topic/django-users/vVprMpsAnPo. It includes an example of implementing a MyMax user-defined aggregate function. Not sure if the example given there will work with django 1.6 that is currently being used by OpenREM. If it does then it may be possible to use the technique to implement a median function.

  3. David Platten reporter

    I have tried the above method, but the median for the DAP values comes out as zeros. It works for kVp and uAs - I think there's a problem with the calculation method when the values are all less than 1.

    I've also tried removing the database functions and aggregates above, and using just a single one from here: https://wiki.postgresql.org/wiki/Aggregate_Median. This works for kVp and uAs, but not for the DAP. Again, I suspect the problem is with all the DAP values being less than 1.

  4. David Platten reporter

    Hi @edmcdonagh. The database functions I'm looking at to begin with are probably postgresql-specific, but it should be possible to come up with a database-neutral version once I've worked out how to get it working properly.

    It would also be good if I can find a way of automatically adding the required database function and aggregate without the user having to log in to their database and manually add it.

  5. David Platten reporter

    The median values being returned for the DAP are being rounded to integers, and I can't see why. I have been able to obtain median values by making the database function multiply the values by 10^10. This ensures that all decimal places in the stored DAP values are moved to the left of the decimal point. The returned values will have to be divided by 10^10 to make them correct in views.py

    CREATE FUNCTION _final_median(anyarray) RETURNS float8 AS $$ 
      WITH q AS
      (
         SELECT val
         FROM unnest($1) val
         WHERE VAL IS NOT NULL
         ORDER BY 1
      ),
      cnt AS
      (
        SELECT COUNT(*) AS c FROM q
      )
      SELECT (AVG(val)::float8) * 10000000000
      FROM 
      (
        SELECT val FROM q
        LIMIT  2 - MOD((SELECT c FROM cnt), 2)
        OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) - 1, 0)
      ) q2;
    $$ LANGUAGE sql IMMUTABLE;
    
    CREATE AGGREGATE median(anyelement) (
      SFUNC=array_append,
      STYPE=anyarray,
      FINALFUNC=_final_median,
      INITCOND='{}'
    );
    
  6. David Platten reporter

    Added new django aggregate function to calculate median. The database needs to have a median function added (included in medianAggregateSQL.txt) - this is postgresql-specific at the moment. References issue #241.

    → <<cset 1a436ab72cc2>>

  7. David Platten reporter

    Added median data as a second series to the plot of acquisition DAP per protocol. Renamed the chart option to 'mean and median' rather than just 'mean'. References issue #241.

    → <<cset e6ac8cfa6b7c>>

  8. David Platten reporter

    It looks like this method of adding a new aggregate function won't work for MySQL or sqlite. MySQL can have additional aggregates added, but from a brief search it appears they have to be written in C or similar - it certainly doesn't let you use SQL.

  9. David Platten reporter

    Added postgresql-specific SQL back as it seems that other databases won't be able to use this. I've removed the float8 conversions, as it seems to be faster without. References issue #241.

    → <<cset 5fe3b0273d21>>

  10. David Platten reporter

    @edmcdonagh, I'd really like to have median values on the charts. I'm pretty sure that there is some European patient dose audit guidance coming out soon that is going to say that medians should be used rather than means. However, I don't like the fact that my current solution is specific to PostgreSQL. Do you have any suggestions as to how medians could be calculated in a database-neutral way?

  11. Ed McDonagh

    From following your investigations, it would seem the pragmatic path to follow would be to recommend PostgreSQL (as django do, at least to some degree), and if you have Postgres you will have medians, and if you don't, then you won't (it will need to fail gracefully). If you don't, then a simple export to Excel will sort you out.

  12. David Platten reporter

    Thanks for the reply @edmcdonagh. We could have a system-wide flag that is set to 1 if it is a PostgreSQL server, and zero if it is anything else. The flag could then be checked to see whether to try the median calculations or not. Something like this:

    if 'postgresql' in settings.DATABASES['default']['ENGINE']:
        median_available = 1
    else:
        median_available = 0
    

    I think that this may have to be stored in the user profile, because the html templates need to know whether they should include JavaScript that expects medians or not. Something like this could then be used in the html templates:

    {% if request.user.userprofile.median_available %}
      HTML and JavaScript code that includes plotting of median data
    {% else %}
      HTML and JavaScript code that does not include plotting of median data
    {% endif %}
    
  13. David Platten reporter

    Added DAP per acquisition JavaScript chart files for mean, mediand and combined plot. Updated dxfiltered.html to point to the combined one. References issue #241.

    → <<cset a13bfcc85b76>>

  14. David Platten reporter

    Added new userProfile field called median_available, set to False by default and cannot be edited by the user. Added code to views.py to check status of median_available and also if PostgreSQL is being used in order to determine if median values can be calcualted. Pass the result to the dx plot calculations. References issue #241.

    → <<cset d8f8b7d04456>>

  15. David Platten reporter

    Added user choice to plot either mean, median or both. Incorporated this option into views.py and dxfiltered.html so that the required DAP per acquisition protocol plot is produced. References issue #241.

    → <<cset 3a814629f53e>>

  16. David Platten reporter

    Added code to create a median acquisition DAP over time plot. The median is used if the user has selected 'median' or 'both' in the 'Averages' choice. A plot with median and mean would be too busy. References issue #241.

    → <<cset 41c5b118675a>>

  17. David Platten reporter

    I have just installed OpenREM 0.7.0b1 using SQLite as the database and then populated it with several hundred fake radiographic studies. The plots etc all worked as they should. I then copied my median updates over, ran a schemamigration and restarted the server: all still worked fine. As expected, the "Average" choice drop-down does not appear, and medians are not (indeed cannot) be plotted. So it seems my code does fail gracefully when a database other than PostgreSQL is being used.

  18. David Platten reporter

    Initial experience of upgrading to Django 1.8 is that it works fine. I looked at https://docs.djangoproject.com/en/1.8/topics/migrations/#upgrading-from-south and https://docs.djangoproject.com/en/1.8/howto/upgrade-version/ and then did:

    1. pip install -U Django

    2. Commented out south from the INSTALLED_APPS section of the settings.py file

    3. Deleted all the numbered migration files in the migrations folder, including the .pyc files

    4. python manage.py makemigrations

    5. python manage.py migrate --fake-initial

    6. python manage.py runserver --install

    OpenREM is now running with Django 1.8.

    The above tests were carried out on a 0.7.0b1 system with an SQLite backend containing a few hundred radiographic studies. I haven't yet tested it with PostgreSQL.

    The only complaint I've seen is from the command-line server with the following message, relating to a depreciated bit of code in the pagination package:

    ...\site-packages\django\core\handlers\wsgi.py:126:
    RemovedInDjango19Warning: MergeDict is deprecated, use dict.update() instead.
      self._request = datastructures.MergeDict(self.POST, self.GET)
    
  19. Ed McDonagh

    That's really good news. So different from when I tried last time, before the painful model renaming!

    The next thing to understand would be the upgrade from 0.6 to 0.7 with a corresponding upgrade from 1.6 to 1.8, plus the additional models, your data migration for the time of day, and the sql thing from this issue...

  20. David Platten reporter

    I've just tested upgrading Django from 1.6 to 1.8 with a system that uses PostgreSQL. The database has 7000+ CT and 18000+ Radiographic examinations. The OpenREM code was already up-to-date with this issue241MedianValues branch, so this is only a test of whether Django 1.8 works OK with PostgreSQL and my custom median calculations. It worked, using the same method as above.

  21. David Platten reporter

    Mean and median were being calculated when just median selected. Altered code so that only required averages are calculated, saving time. References issue #241.

    → <<cset 3faafb1a56c6>>

  22. David Platten reporter

    Added Django 1.8 migration file to add median function to PostgreSQL database. Not sure what will happen if you're using something else. References issue #241.

    → <<cset 73c90733c02c>>

  23. David Platten reporter

    Added a check for PostgreSQL use in the median migration file so that the median functions are only executed if PostgreSQL is being used. Untested. References issue #241.

    → <<cset d8c15b56f2bb>>

  24. David Platten reporter

    Commit 0f927c4 has a working migration file for use with a version 0.6 to 0.7 upgrade.

    0.7 will install Django 1.8. The user must:

    1. Manually remove / rename all numbered migration files in their migration folder

    2. Run python manage.py makemigrations

    3. Run python manage.py migrate --fake-initial

    4. Rename the 00xx_python manage.py file to 0002_python manage.py

    5. Run python manage.py migrate to apply the migrations

    6. Finally run python manage.py makemigrations then python manage.py migrate.

    You can now restart the server.

  25. David Platten reporter

    I've just tested this using an SQLite-based 0.6 install of OpenREM, populated with some test data. The upgrade to Django 1.8 and the above migration works for this too. There is no median function available with the SQLite install, as expected. No errors were shown.

  26. Ed McDonagh

    Should step 4 be 4. Rename the 00xx_add_median_function.py file to 0002_add_median_function.py?

    If so, is this necessary for step 3. to work? Or could the file be named 0002 in the first place as we are deleting all previous migrations so we know this is the first post initial one?

  27. David Platten reporter

    Hi @edmcdonagh, I've just checked. You can't have any numbered .py files when the first makemigrations is run - you receive a screen full of errors if the new 0002 file is present. However, the simplified list of commands below works fine:

    Manually remove / rename all numbered migration files in their migration folder. Leave the new 0002_add_median_function.py.unactive file there.

    1. Run python manage.py makemigrations

    2. Run python manage.py migrate --fake-initial

    3. Rename the 0002_add_median_function.py.unactive file to 0002_add_median_function.py

    4. Run python manage.py makemigrations

    5. Run python manage.py migrate to apply the migrations

    You can now restart the server.

  28. David Platten reporter

    Hi @edmcdonagh - it would be good to put a build on there once the Median branch has been merged back into develop - are you OK to do that?

  29. Ed McDonagh

    Added new django aggregate function to calculate median. The database needs to have a median function added (included in medianAggregateSQL.txt) - this is postgresql-specific at the moment. References issue #241.

    → <<cset 1a436ab72cc2>>

  30. Ed McDonagh

    Added median data as a second series to the plot of acquisition DAP per protocol. Renamed the chart option to 'mean and median' rather than just 'mean'. References issue #241.

    → <<cset e6ac8cfa6b7c>>

  31. Ed McDonagh

    Added postgresql-specific SQL back as it seems that other databases won't be able to use this. I've removed the float8 conversions, as it seems to be faster without. References issue #241.

    → <<cset 5fe3b0273d21>>

  32. Ed McDonagh

    Added DAP per acquisition JavaScript chart files for mean, mediand and combined plot. Updated dxfiltered.html to point to the combined one. References issue #241.

    → <<cset a13bfcc85b76>>

  33. Ed McDonagh

    Added new userProfile field called median_available, set to False by default and cannot be edited by the user. Added code to views.py to check status of median_available and also if PostgreSQL is being used in order to determine if median values can be calcualted. Pass the result to the dx plot calculations. References issue #241.

    → <<cset d8f8b7d04456>>

  34. Ed McDonagh

    Added user choice to plot either mean, median or both. Incorporated this option into views.py and dxfiltered.html so that the required DAP per acquisition protocol plot is produced. References issue #241.

    → <<cset 3a814629f53e>>

  35. Ed McDonagh

    Added code to create a median acquisition DAP over time plot. The median is used if the user has selected 'median' or 'both' in the 'Averages' choice. A plot with median and mean would be too busy. References issue #241.

    → <<cset 41c5b118675a>>

  36. Ed McDonagh

    Mean and median were being calculated when just median selected. Altered code so that only required averages are calculated, saving time. References issue #241.

    → <<cset 3faafb1a56c6>>

  37. Ed McDonagh

    Added Django 1.8 migration file to add median function to PostgreSQL database. Not sure what will happen if you're using something else. References issue #241.

    → <<cset 73c90733c02c>>

  38. Ed McDonagh

    Added a check for PostgreSQL use in the median migration file so that the median functions are only executed if PostgreSQL is being used. Untested. References issue #241.

    → <<cset d8c15b56f2bb>>

  39. David Platten reporter

    Just tried upgrading from 0.6 to 0.7.0b2.

    makemigrations failed at the first attempt because 'south' hasn't been commended out in settings.py. Once I commented that out it worked as expected.

    There may need to be a migration file for fresh 0.7 installations that creates the PostgreSQL database functions.

  40. David Platten reporter

    Hi @edmcdonagh. I've never liked my implementation of the PostgreSQL median function. I specifically don't like the fact that the values that it returns are 10,000,000,000 larger than they should be.

    This morning I've removed this "feature" from my test install, and everything, including the calculation of the median of values that are all less than 1.0, works fine. I'm going to commit my changes to the custom database function, and also the necessary changes to views.py, but will wait until the #332 branch is merged back into develop.

  41. David Platten reporter

    There are no implications for non-PostgreSQL databases.

    Users who are already using the median function (anyone using beta versions of 0.7) will have to reapply a bit of SQL that will replace their existing median function with an updated one. This can be done via a migration file.

    Users who are still on 0.6 won't be affected, as they don't have the median function yet. The migration file for these users includes the creation of the database median function, which can be adjusted so that it doesn't include the *10000000000 bit.

    What do you think?

  42. David Platten reporter

    Added updated median function for users that are already using the median database function. Removed need for median values to be multiplied by 10,000,000,000. References issue #241

    → <<cset 343fcc2e3fed>>

  43. David Platten reporter

    @edmcdonagh, the djp demo site is up-to-date with the current develop branch, including the revised median database function.

  44. David Platten reporter

    Case-insensitive setting: removed unnecessary annotation of the QuerySet; doing this in the Pandas DataFrame instead. I've also just realised that we don't need the database median function anymore, as the averages are calculated by Altair: medians will now work with any database (although I've not verified that this works). References issue #477 and #241

    → <<cset af235ae18b13>>

  45. Log in to comment