Non-ASCII strings in DICOM cause extractor routines to fail (2)
Possibly related to Issue #256
When importing Philips SC's with special characters in the StudyDescription, I get the following error on line 144 of ct_philips.py
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)
A quick workaround was to replace
commentstudydescription = get_value_kw('StudyDescription', dataset, char_set=ch)
by
commentstudydescription = get_value_kw('StudyDescription', dataset, char_set=ch).encode('utf-8')
Comments (13)
-
-
One way of dealing with this maybe to encode as per the stated SpecificCharacterSet, but if it complains of UnicodeEncodeError then try again with a character set set to utf-8 as a best endeavour guess.
-
reporter SpecificCharacterSet was equal to "ISO_IR 100". So either the Philips Brilliance 64 doesn't use this tag, or PACS is modifying it? Your suggestion to take utf-8 as a fallback sounds good.
-
reporter - changed component to Import: All
-
@tcdewit have you some examples of this? It'd be good to test it with
.decode()
-
reporter - attached sc_philips_unicode_error.dcm
-
Thanks @tcdewit - that'll be a really useful test case for issue
#503as it demonstrates that I need to make sure all the strings are unicode! Can you confirm the correct presentation of the different fields, as with a single extra 'u' it will import into the issue503 branch, but I am getting a variations in what is currently displayed:- Study description: Ear/MASTOID
- Requested procedure: CT MASTOÏD
- Within the comment field: <StudyDescription SRData="CT MASTOÃD" />
- Series 'aquisition protocol' field: MASTOID 0,55 MM
-
reporter To me it's confusing as well... different encodings seem to be mixed.
Study description (0008,1030) seems to use utf-8 (hex-code C38F for Ï, translating in to à when interpreted as latin-1).(Almost) all other fields are in latin-1 format (hex-code CF for Ï) or use I instead of Ï on purpose. Btw I might have asked this before, but why does openrem use protocolname (0018,1030) as study description instead of (0008,1030)?
-
I don't know about the study description. Maybe it wasn't there in the Philips Dose Info images I based the code on? We use StudyDescription in rdsr.py, and in dx.py we go through a list of possible tags to fill that in.
I guess we should change it to prefer StudyDescription - feel free to create an issue, branch from develop (in openrem/openrem) and create a PR!
-
reporter Actually it turns out that StudyDescription (0008,1030) is the field used for the comment field; that explains the Ã, since the encoding is wrong (utf-8). I suggest keeping everything the same (but with the extra 'u' everywhere).
-
- changed status to resolved
-
-
assigned issue to
- changed milestone to 0.8.0
-
assigned issue to
-
Will still need work to improve, I have had a similar issue from elsewhere. But at least it should import now.
- Log in to comment
That shouldn't have been a problem as far as I can see @tcdewit, as long as the character set used matched the declared character set in the header.
What was the string, and what was the value of SpecificCharacterSet (0008,0005) - or does it even exist?