Simple visualizations of near-current opioid data.
Near current? Yes, I am pulling data from the CDC's WONDER (Wide-ranging Online Data for Epidemiologic Research), which contains Multiple Cause of Death files to 2016. Although the Division of Vital Statistics has reported the 2017 figures (see https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm), these have not made it into WONDER yet.
The analysis is pretty straightforward in R's tidyverse.
- The file prep only requires (1)
filterto remove the data's embedded totals and (2)
selectto get the variables of interest
- A single facet-wrapped plot. Note: I'm not using qplot; I want to get the full flexibility of ggplot.
But note! Gah, the tidyverse is still not quite there! Unlike pandas, which
can skip headers and footers,
readr can only skip headers. And I found a
dsicussion on some tidyverse support board that said this would be forever so,
as one can reasonably hack it using some pipeline that involves read_lines and
filters; also it would be kinda messy to fix properly. Hmm. I think it is fair
to say that R != python.
It is immediately obvious that NM has many counties with small populations. These are the blank graphs in the faceted display: their populations are so low that the data is suppressed. Of the 5 that have data, Rio Arriba stands out. It has a mortality rate of 86.3 per 100,000 standard population in 2016. That is crazily high: the national average is 19.8. Española is largest city in Rio Arriba and it is famous for multi-generational opioid abuse.
It is going to be interesting to see what the 2017 figues are like. According to the email I got back from the CDC yesterday, those will be released in WONDER later this week.
I'm mainly interested in this because of my upcoming call with the Community Data Platform people on Nantucket. Nantucket is a small county and so its results are suppressed. Barnstable (Cape Cod), Bristol (Taunton), and Plymouth counties are bad: 50 and above
If I want to avoid blanks in my plots, I probably need to look at states. I
viz_state.R to get
Wow. Poor West Virginia. Though several other states show horribly steep curves (OH, PA, NH, KY).
Take a look at WV's counties. Or look at top-20 counties in US. Also, it might be interesting to see whether these are particularly strong on fentanyl and other synthetic opioids. (Actually given that fentanyl deaths are a comparaively recent phenomenon, I might hold the latter till I have the 2017 figures.)
Worst counties in US
I tweaked the US plot:
- Use crude rate instead of age-adjusted rate. Crude rate is easier to explain, and the differences between the two rates is small.
- Filter down to the 12 counties with the highest average death rate (measured over last 3 years)
- Color by state
- Size points by population
Latter shows that although NM and WV have the highest rates, the actual numbers of people impacted are massive in OH and MD.
I shall put this on hold till I get the 2017 data.
Where is the data
Data does not belong in the git repository. It can be obtained at the link above. I use the same ICD-10 codes to identify opioid deaths as are used in the CDC's own reports (see Hedegaard et al "Drug Overdose Deaths in the United States, 1999–2017"), specifically:
X40 (Accidental poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics)
X41 (Accidental poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified)
X42 (Accidental poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified)
X43 (Accidental poisoning by and exposure to other drugs acting on the autonomic nervous system)
X44 (Accidental poisoning by and exposure to other and unspecified drugs, medicaments and biological substances)
X60 (Intentional self-poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics)
X61 (Intentional self-poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified)
X62 (Intentional self-poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified)
X63 (Intentional self-poisoning by and exposure to other drugs acting on the autonomic nervous system)
X64 (Intentional self-poisoning by and exposure to other and unspecified drugs, medicaments and biological substances)
X85 (Assault by drugs, medicaments and biological substances)
Y10 (Poisoning by and exposure to nonopioid analgesics, antipyretics and antirheumatics, undetermined intent)
Y11 (Poisoning by and exposure to antiepileptic, sedative-hypnotic, antiparkinsonism and psychotropic drugs, not elsewhere classified, undetermined intent)
Y12 (Poisoning by and exposure to narcotics and psychodysleptics [hallucinogens], not elsewhere classified, undetermined intent)
Y13 (Poisoning by and exposure to other drugs acting on the autonomic nervous system, undetermined intent)
Y14 (Poisoning by and exposure to other and unspecified drugs, medicaments and biological substances, undetermined intent)
I have been reading Knaflic "Storytelling with Data", and that has led to some tweaks to my plots. I can't go full SWD because at the moment I don't really know what story I am telling. Perhaps it is something to do with issues of rates vs absolute counts. This matters because I suspect that crisis response does not scale in the same way that population scales. If county A has 10 times the population of county B, I'm guessing (and it is a guess) that the number of deaths at which local services in A get overloaded is not 10 times the threshold in B but substantially less.
Anyway, some minor cleaning up on the plots
Make the titles bigger and clearer
Point sizes are a function of deaths (which is what the story is about) not population.
2017 data from CDC
The CDC have put the 2017 data into WONDER. I pull the new data. In doing so I discover that there is actually a radio button that automatically selects the ICD-10 codes for opioid deaths. So no more need to remember X40-44 etc.
I rerun the "wrotsy counties" analysis. Rio Arriba slips to #2. Baltimore gets worse. A county in WV has an astronomical rate of 160 per 100,000.
The folks at FRIAM had expressed an interest in this, so I compose a post on my blog.
I would like to create some choropleths but I am not up to speed on R's mapping abilities. I spend the morning tinkering. I think I am ready to try in earnest, but my project directory is getting messy. I think it is time for me to actually use RStudio's project functionality. (That's my next task.)
OK, so I have some reasonable choropleths. Matching the map data to the CDC was a problem as the only common field is county name and its format is different in the two sources. Converting one to the other proved difficult in the tidyverse pipeline (I was bumping up against dplyr's "quosures") and so I end up using sapply().
Got myself a neat map of MA. Barnstable is surprisingly bad. I thought it was wealthy and problem free? By the way, this is the first time I noticed a big gap between the age-adjusted rate (50 per 100k) and the crude rate (40 per 100k). Is that a reflection of old/retiree communities?
Next: add some town names. Or put it on OpenMap.
Next steps, revised
Yes, it would be interesting to get Open Map data, particularly as it can localize the map nicely without adding too much visual noise. However I have a couple of other things I want to look at too:
Wikipedia's page on chloropleths briefly mentioned a color scheme that is amenable to bivariate visualization. What? how is that possible? I need to investigate and trial. Perhaps look at an obesity dataset too? Ed (Angel) has already commented on a possible link (admittedly based on an n of 1).
sf or sp for geocomputation in R? Former is newer and (I think) is better integrated with the tidyverse; however some legacy packages require sp. I think I shall use sf and worry about legacy issues as they arise
leaflet for interactive maps?
And what about python? And what about putting some of these maps in dashboards?
First step: stick to R. Use sf (and tmap?) to create a lower-48 map of counties. Switch between obesity (in the tmap demo) and opioids.
Second step: try datashader and plotly in python.
Court Case: Rio Arriba vs Purdue et al
Tom Johnson brought this case to my attention. See
NOTES.txt in the Medicare
Martinez: "Continued Improvement in Drug Overdose Death Rankings"
So NM has fallen from #2 in 2014 to #17 in 2017 says DOH. Really? I think some of those numbers might be cherry-picked. Let's go see.
OK, so I cannot duplicate the numbers using my 2017 data pull. It looks like my assumption that the radio button that I thought selected the ICD-10 codes for opioids (see 12/07 notes) doesn't. When I do it longhand (see 12/05) I get numbers that correspond (i) with the CDC report and (ii) the raw rankings that the press release quotes (though the latter does require me to omit DC from the rankings).
However the claims a couple of paras on seem sketchy: how can NM get 4% reduction in natural and semisynth, 9% reduction in heroin, and 6% reduction in synth, when the overall AAR has only gone down from 25.2 to 24.8, i.e. 1.6%? (Thiese are 2017 vs 2016). OK, I'm going to have to pull the raw numbers from WONDER.
Later that day...
Gah! Getting the data to match is a problem! I am getting confused between "underlying cause of death" (one only) and "multiple causes of death" (up to 20). Take a break, have a cuppa, and then lay out the exact WONDER queries that enables me to duplicate the numbers in the CDC report (Data Brief 329).
Table 1: simple counts (e.g. 2017 deaths = 70,237)
- From section #6 "underlying cause of death" Select "UCD - ICD-10 Codes" X40-44, X60-64, X85, Y10-Y14 - Do not select anything under section #7 "multiple causes of death"
OK, while that is fresh in my mind I shall run this for state and produce
- "Multiple Cause of Death, 1999-2017, by state.txt"
Table 2: don't care.
Table 3: this is in two parts because of the double accounting problems!
- Start with the underlying COD selection as in table 1 and add multiple COD, specifically T40.0-40.4, T40.6 - Running without any additional grouping gives the results in the table's "any opioids" column - Running with "multiple cause of death" grouping gives the breakdowns by heroin, synthetic, etc. **NOTE** double accounting means that the sum of these breakdowns exceeds the value in the "any opioids" column!
OK, I'm only going to run this query for NM. I'm not really interested in this info for other states. The two files I get are
- "Multiple Cause of Death, 1999-2017, NM any opioid.txt" - "Multiple Cause of Death, 1999-2017, NM by opioid.txt"
Table 4: don't care
OK, let's rerun my latest script,
The top-line claim "NM is getting better" is true. Yes NM is ranked #17. However the breakdown into fentanyl, heroin etc is a little sketchy: these come from the "multiple cause of death" fields (up to 20 of them) on a death certificate so there's lots of double accounting. Unusually though, I'm getting fewer deaths in aggregate on these fields. Which is odd, because NM is meant to be good at filling these fields in (I saw some old CDC report that had the coverage rate at ~98%). So something is obviously adrift and I've not resolved what
I shall dig into this further after the New Year. It's going to be a chance to play with RMarkdown too.
But first I want to tinker with Tableau Public. OK, re-pull the above data, omitting the suppressed records. (Why? because Tableau Public has limits on the number of records and I don't want to use up my limit on stuff that I will immediately remove).
NOTE: I finally resolved the issue of getting 73,990 opioid deaths in 2017 instead of 70,237: when I choose the "drug-induced causes" button in WONDER I must omit "all other drug induced causes". The latter pulls in some non-opioid deaths in which I am not interested.
By the end of the day I've got a few plots: a table of death rate by year and state and "bump" plots of states' ranks over time. Also maybe a "worst county" plot, though I'm not sure this has saved to the public server (too big?). I've also remembered why I can't stand Tableau. It's too hard to do some simple stuff: rank in particular.
Yeah Tableau Public has real issues with a large dataset. I actually have to pull out the county data if I want the workbook to display online.
Back to R and proper programming.
Much better! I have been able to write up my working on the Dept of Health
claims quite nicely. See