New table of study summary information

Issue #759 resolved
Ed McDonagh created an issue

As suggested by @dplatten it would simplify things enormously when doing filtering in the web interface if as many database calls as possible were to just one table.

There are various things we need to think about:

  1. If the data already has a place in one of the other tables, as determined by relevant DICOM template, that would remain and a copy of that data would be added to the summary table. Agree/Disagree?
  2. Some summary data is not currently held, rather it is computed each time it is needed. For example the number of Axial events in a study. This could be computed on import (or update) and stored in the table.
  3. Every study starts with a GeneralStudyModuleAttrtable. The new table could have a one-to-one relationship with that table, or it could simply be extra fields in that table. Thoughts?
  4. When do we do it? Do it now, and fix the Specific Event CT filtering vs charts bug by simplifying the event filtering code, or put it aside for the next major release and battle through to find the current bug?
  5. What goes in the table… some ideas:
number_of_events = models.IntegerField(blank=True, null=True)
number_of_axial = models.IntegerField(blank=True, null=True)
number_of_spiral = models.IntegerField(blank=True, null=True)
number_of_stationary = models.IntegerField(blank=True, null=True)
number_of_const_angle = models.IntegerField(blank=True, null=True)
total_dlp = models.DecimalField(max_digits=16, decimal_places=8, blank=True, null=True)
total_dap = models.DecimalField(max_digits=16, decimal_places=8, blank=True, null=True)
total_rp_dose = models.DecimalField(max_digits=16, decimal_places=8, blank=True, null=True)
total_agd_left = models.DecimalField(max_digits=16, decimal_places=8, blank=True, null=True)
total_agd_right = models.DecimalField(max_digits=16, decimal_places=8, blank=True, null=True)
total_agd_both = models.DecimalField(max_digits=16, decimal_places=8, blank=True, null=True)  # for legacy

We could also consider adding in some event level data such as the acquisition protocol, but that would have to be some sort of a list as a string, so probably not worth it.

Feedback please @dplatten , @LuukO , anyone else reading this!

Comments (126)

  1. David Platten

    I appreciate that the current database is set up to reflect the DICOM templates, but it is far from ideal when it comes to accessing and processing the data in OpenREM, particularly for charts. I think there is a good argument for migrating to a flatter database arrangement that is designed to provide high performance, rather than the database structure reflecting the DICOM templates.

    I’m thinking of a flat study-level table that contains everything that is study-level including patient-related data like ID, age, gender (if we collect that), height and weight. I’d also like a flat acquisition-level table that contains everything to do with each acquisition, avoiding the need to find things like kVp values elsewhere. We could have an acquisition-level table per modality.

    One of the problems with my above suggestion is that I am not a database expert, and am not certain that the above would result in faster queries. What I do know is that producing the data for the current charts is challenging with the current database structure, and would be greatly simplified with a flatter database structure.

  2. Ed McDonagh reporter

    I am not a database expert either, and I wish we had one to help us!

    I think there probably is quite a lot we could put in fewer tables. But the problem is always when there is, or can be, multiplicity. For example with the kVp example, it is almost always just the one kVp per exposure, but some devices detail the kVp of every single pulse in a fluoroscopy exposure.

    A more common example is where there can be two planes in fluoro (and x-ray it turns out) and left and right are separate in mammo. For these, if we can assume it is only ever one or two, then maybe we can handle that by duplicating the fields in the same table. Not sure.

    From the little I know about databases, having a flatter structure with fewer one-to-many lookups would certainly be faster, as well as much easier to code!

  3. Ed McDonagh reporter

    Not sure what conclusion to draw from those articles… flatter better to program with but leading to compromise and trouble down the lane; normalised better to load each table (thinner), and more flexibility.

    I don’t think it is simple! I am sure we can do better than what we currently have.

  4. Ed McDonagh reporter

    Added summary fields as proposed to GeneralStudyModuleAttr, modified rdsr to populate number_of_events, tested with CT and RF. Refs #759. As per branch name, this is an experiment to see how to best proceed.

    → <<cset b82543434830>>

  5. Ed McDonagh reporter

    Mocked up a view to populate an existing database with summary information as a more friendly alternative to trying to do a data migration. Refs #759

    → <<cset d807d04d3e51>>

  6. Ed McDonagh reporter

    @David Platten - looking at the display of DAP and RP doses over delta weeks, do you only consider the A plane in a bi-plane system?

  7. Ed McDonagh reporter

    @David Platten I think that in this line in rdsr.py: rdsr.py#lines-1521 we look at both planes of a bi-plane system separately and get the same answer, so we record the pk of each study in the time period twice.

    In the detail view, the count ignores that table and therefore gets the right answer (one study): views.py#lines-853

    In the filtered view we use rffiltered.html#lines-146 to get that table then rffiltered.html#lines-233 to count the number of studies, hence the wrong answer (two studies) for bi-plane systems.

  8. Ed McDonagh reporter

    @David Platten It adds up both planes of DAP and RP Dose as desired. Just the number of studies reported in the filtered view is wrong. I think.

  9. Ed McDonagh reporter

    @David Platten @Luuk I've migrated a copy of my production database using the new fields in GeneralStudyModuleAttr and created identical servers with the pre and post migrated databases to time the differences in rendering each of the modality filtered pages.

    The migration itself, the process and user interface needs a little more work, and a decision as to whether to run the migration as one task per modality or one task per study - on my limited VM on my laptop that keeps crashing it took days to do the CT…

    In terms of number of studies in the database:

    CT Fluoroscopy Mammography Radiography
    217114 2012 83584 55787

    I’ve loaded CT then RF etc for both servers at the same time, and used the Chrome Network tools to record the wait and load times for each. The view is set to 200 studies. I have then repeated this four times, and below you can see the mean times and the proportion of one over the other.

    I still don’t like how slow the system is even with the summary fields, but I’m pleased with the relative increase in speed. I’d like to push on with this to have it in the next release.

    What do you think of the approach? I’ll need to tidy up a bit before I can suggest that you guys try it!

  10. David Platten

    @Ed McDonagh it’s great to see the speed increase in the filtered views. However, I’m more interested in whether the summary fields help to speed up chart data calculations…

  11. Ed McDonagh reporter

    For CT

    I am confident that the following will be faster if we use the summary data:

    • DLP per study
    • Events per study
    • Study DLP over time

    I think the others (except maybe DLP per requested procedure?) make use of event level data or have nothing to gain from new summary data?

    For Fluoro

    • DAP per study

    For Mammography

    • None of the current charts

    For Radiography

    • DAP per study

    Is that hopeful or depressing @David Platten ? From reading the links you sent about normalised vs de-normalised (flatter) databases, I am not convinced that trying to record acquisition information at study level can work in a maintainable way.

    It might be worth though having ‘average’ summary values at event level? That way we can bring kV etc into that table which would reduce the need to to handle units that record kV per pulse within an exposure for example.

  12. David Platten

    @Ed McDonagh that’s good with me. I’d like to try out the summary table with one of the charts. Should I wait for you to tidy up the migration process before I try?

  13. Ed McDonagh reporter

    Do you want to test it with summary data just at study level (as implemented), or possible event level summary data as mentioned in the final sentence?

  14. Ed McDonagh reporter

    Great. Let me tidy up a bit. I’ll try not to let it drag on. I think I’m going to go with the ‘one task per study’ option rather than the ‘one task per modality type’ option.

  15. Ed McDonagh reporter

    Added upgrade status singleton to db model, set automatically when each modality complete, then doesn't update on homepage anymore. Refs #759

    → <<cset 46b840200774>>

  16. Ed McDonagh reporter

    Hi @David Platten - can you try this branch out now please? A migration is required of course. Other than that, it should be self-explanatory. You will need to restart Celery to pick up the new tasks.

  17. David Platten

    The RF progress bar has stopped moving. The Celery log contains the following:

    [2019-07-30 11:06:24,005: ERROR/MainProcess] Task remapp.tools.populate_summary.populate_summary_rf[bb5877d4-17ca-48c9-95b1-f7ac6776bfa7] raised unexpected: IndexError('list index out of range',)
    Traceback (most recent call last):
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\celery\app\trace.py", line 240, in trace_task
        R = retval = fun(*args, **kwargs)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\celery\app\trace.py", line 438, in __protected_call__
        return self.run(*args, **kwargs)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\openrem\remapp\tools\populate_summary.py", line 173, in populate_summary_rf
        populate_summary_study_level('RF', study.pk)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\celery\local.py", line 188, in __call__
        return self._get_current_object()(*a, **kw)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\celery\app\trace.py", line 439, in __protected_call__
        return orig(self, *args, **kwargs)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\celery\app\task.py", line 428, in __call__
        return self.run(*args, **kwargs)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\openrem\remapp\tools\populate_summary.py", line 29, in populate_summary_study_level
        populate_dx_rf_summary(study)
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\openrem\remapp\extractors\extract_common.py", line 172, in populate_dx_rf_summary
        accum_int_a = planes[0].accumintegratedprojradiogdose_set.get()
      File "c:\pythonvirtualenvs\openrem-dev\lib\site-packages\django\db\models\query.py", line 201, in __getitem__
        return list(qs)[0]
    IndexError: list index out of range
    

    The MG progress bar has also stopped progressing (almost at the end - 49 studies to go). The Celery log contains a series of lines like this, each with a different Study UID:

    [2019-07-30 11:06:34,315: WARNING/Worker-1] Study UID 1.2.840.113619.redacted. Unable to set summary total_agd values
    

    The number of entries of this type corresponds to the number of studies left on the progress bar.

  18. Ed McDonagh reporter

    Blast. Had an answer written then pressed buttons without this box in focus and lost it all!

    First issue indicates you have an RF study with no planes in it, or no accumxraydose_set. I had an IndexError catch for the second one (as most are single plane), but not the first. I have now added such a catch if you’d like to check out, stop all the tasks and run it again.

    The second error either indicates that you have mammo studies with no projectionxrayradiationdose_set.get().accumxraydose_set.get().accummammographyxraydose_set or something else is tripping up. Are you able to find some of the studies and have a look at what is there?

  19. Ed McDonagh reporter

    By the way, it should be fairly safe to kill and restart the tasks (and purge the queue) - they should restart where they have left off and redo any that were half done - the check for being done is the last field that is populated.

  20. David Platten

    I just tried to commit the “IndexError” catch, and have just realised you’ve already updated it. I’ll stop my updates and try again with the fix.

  21. David Platten

    Added an IndexError exception to catch the error thrown when plane[0] data does not exist in an RF study. I haven't tested this yet as I am waiting for my DX migration to complete (220000 studies to go...). References issue #759.

    → <<cset d968b7b869dc>>

  22. David Platten

    Added further IndexError exception to catch error thrown if there are no accumulated x-ray dose sets present in a study. My RF migrations are now progressing. References issue #759

    → <<cset cddb6475bc8d>>

  23. David Platten

    Some of my mammo studies have empty accumulated AGD. They are “QA-raw uniformity” studies. Not sure how to make the code skip over these and still increment the progress.

  24. David Platten

    I think my RF migration is now complete, but the progress bar is 10 studies short. I think these are the ones with the missing accumulated data. So, as for the mammo, how do we make the progress bar increment for these?

  25. David Platten

    I’m now looking at using some of the study summary data in the charts. The CT study DLP chart was initially broken due to None type appearing in the series name list. I’ve fixed this.

    Some timings:

    CT DLP per study: 3.0 s vs. 5.5 s

    CT # events per study: 2.7 s vs. 5.4 s

    So around a 50 % reduction.

    Can ctradiationdose__ctirradiationeventdata__mean_ctdivol be put into the summary table?

  26. David Platten

    Scrap my request earlier - I now realise that the mean_ctdivol value is per event, not per study. Sorry - being dim.

  27. David Platten

    Now on to the DX charts.

    Using the summary total_dap_a instead of projectionxrayradiationdose__accumxraydose__accumintegratedprojradiogdose__dose_area_product_total

    in the DAP per study type reduces the time from 22.5 s to 6 s.

  28. David Platten

    Hmm. What should a chart of median DAP per study for a bi-plane fluoro system show? Is it total DAP from the whole study? I think so, in which case I’d like a “total_dap” field in the summary table that is the sum of total_dap_a and total_dap_b. It is likely to be the same argument for a dual-plane DX system - at the moment I’ve used total_dap_a for the DX plots, but I suspect that this is incorrect if you have a dual-plane system.

  29. David Platten

    Changed chart code to take advantage of the summary data available at study level. Replaced None with empty string in series names for bar chart of average data. Added some elapased time print statements (to be removed) later. Used total_dap_a for total DAP in DX charts (this isn't really correct for the bi-plane system we know of). Removed most of the prefetching from the CT chart routine as I believe this is pointless if the data is in the main table. Still to do: RF charts, as there is currently no total_dap field. References issue #759

    → <<cset 758bd45adc12>>

  30. Ed McDonagh reporter

    I should add, my laptop is currently out of use - repeated BSODs, attempted Windows reinstall today, no change, so Dell should be coming out to change the motherboard and SSD tomorrow. Until then (and until I have rebuilt everything), I won’t be able to easily do any coding!

  31. Ed McDonagh reporter

    If you made the changes to models.py and to extract_common.py, you might like to run a little function in a manage.py shell to reset all the RF and DX studies so that you can run the migration again:

    from django.db.models import Q
    from remapp.models import SummaryFields, UpgradeStatus, GeneralStudyModuleAttr
    
    dxrf = GeneralStudyModuleAttr.objects.filter(Q(modality_type__exact='DX') | Q(modality_type__exact='CR') | Q(modality_type__exact='RF'))
    for study in dxrf:
        study.number_of_events_a = None
        study.save()
    
    upgrade_status = UpgradeStatus.get_solo()
    upgrade_status.from_0_9_1_summary_fields = False
    upgrade_status.save()
    
    rf_status = SummaryFields.objects.get(modality_type__exact='RF')
    rf_status.complete = False
    rf_status.save()
    
    dx_status = SummaryFields.objects.get(modality_type__exact='DX')
    dx_status.complete = False
    dx_status.save()
    
  32. David Platten

    @Ed McDonagh I’ll give your suggestions a try at some point - not able to do much coding for the next week or so either.

  33. Ed McDonagh reporter

    @David Platten - changes made, works with the Tosh bi-plane study and doesn’t cause errors with the single plane studies.

  34. Ed McDonagh reporter

    Merged in issue759AsManyTasks (pull request #315)

    Fixes #759 Summary fields in GeneralStudyModuleAttr, Refs #768 Update charts to use new study level summary fields

    Approved-by: David Platten

    Thanks @dplatten and @LuukO for reviewing (and adding to!) I'm leaving the branch open as you are working in it David - not sure if you want to start a new branch to finish ref #768 off, from develop?

    → <<cset 890828d285b3>>

  35. Log in to comment