Non-ASCII strings in DICOM cause extractor routines to fail (2)

Issue #476 resolved
Tim de Wit
created an issue

Possibly related to Issue #256

When importing Philips SC's with special characters in the StudyDescription, I get the following error on line 144 of ct_philips.py

UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-9: ordinal not in range(128)

A quick workaround was to replace

commentstudydescription = get_value_kw('StudyDescription', dataset, char_set=ch)

by

commentstudydescription = get_value_kw('StudyDescription', dataset, char_set=ch).encode('utf-8')

Comments (13)

  1. Ed McDonagh

    That shouldn't have been a problem as far as I can see @Tim de Wit, as long as the character set used matched the declared character set in the header.

    What was the string, and what was the value of SpecificCharacterSet (0008,0005) - or does it even exist?

  2. Ed McDonagh

    One way of dealing with this maybe to encode as per the stated SpecificCharacterSet, but if it complains of UnicodeEncodeError then try again with a character set set to utf-8 as a best endeavour guess.

  3. Tim de Wit reporter

    SpecificCharacterSet was equal to "ISO_IR 100". So either the Philips Brilliance 64 doesn't use this tag, or PACS is modifying it? Your suggestion to take utf-8 as a fallback sounds good.

  4. Ed McDonagh

    Thanks @Tim de Wit - that'll be a really useful test case for issue #503 as it demonstrates that I need to make sure all the strings are unicode! Can you confirm the correct presentation of the different fields, as with a single extra 'u' it will import into the issue503 branch, but I am getting a variations in what is currently displayed:

    • Study description: Ear/MASTOID
    • Requested procedure: CT MASTOÏD
    • Within the comment field: <StudyDescription SRData="CT MASTOÏD" />
    • Series 'aquisition protocol' field: MASTOID 0,55 MM
  5. Tim de Wit reporter

    To me it's confusing as well... different encodings seem to be mixed.
    Study description (0008,1030) seems to use utf-8 (hex-code C38F for Ï, translating in to à when interpreted as latin-1).

    (Almost) all other fields are in latin-1 format (hex-code CF for Ï) or use I instead of Ï on purpose. Btw I might have asked this before, but why does openrem use protocolname (0018,1030) as study description instead of (0008,1030)?

  6. Ed McDonagh

    I don't know about the study description. Maybe it wasn't there in the Philips Dose Info images I based the code on? We use StudyDescription in rdsr.py, and in dx.py we go through a list of possible tags to fill that in.

    I guess we should change it to prefer StudyDescription - feel free to create an issue, branch from develop (in openrem/openrem) and create a PR!

  7. Tim de Wit reporter

    Actually it turns out that StudyDescription (0008,1030) is the field used for the comment field; that explains the Ï, since the encoding is wrong (utf-8). I suggest keeping everything the same (but with the extra 'u' everywhere).

  8. Log in to comment