Non-ASCII strings in DICOM cause extractor routines to fail

Issue #256 resolved
Ed McDonagh created an issue

As reported by Samuli Hel: Google Group discussion

Protocol name with ä in causes error: You must not use 8-bit bytestring.

Comments (8)

  1. Ed McDonagh reporter

    from __future__ import unicode_literals removes the error, but the import fails silently (from Google Groups discussion)

  2. Ed McDonagh reporter

    From Eivind https://groups.google.com/forum/#!topic/openrem/rh55ulcQPl8

    1. Adding from __future__ import unicode_literals in rdsr.py fails the if dataset.SOPClassUID .. (near the end of rdsr.py) somehow, so extraction will fail silently.

      • Solution: adding [:] behind SOPClassUID -> SOPClassUID[:] (both instances) will make the script run, but eventually fail (due to the "strange" characters).
    2. get_value_kw(tag,dataset) in get_values.py must be modified to handle both strings and bytes, and code characters properly.

      • Solution: after def get_value_kw(tag,dataset) , adding
    # guarantee byte string in UTF8 encoding 
    _u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t 
    

    and after the if value != '':"

    value=format(_u8(value)) 
    return value.decode(latin-1, replace) 
    

    (I guess changing "latin-1" with something else would work also.) I tried this using Postgresql (default settings), and it seems to work quite well.

  3. Ed McDonagh reporter

    Added unicode encoding to a couple of strings instead of using the str function which can't handle non-ASCII letters. Refs #256 - possibly fixed. Needs more testing.

    → <<cset 1216374ba525>>

  4. Ed McDonagh reporter

    Made use of get_value_kw to cover the case of non-ASCII characters in series description. Refs #256. Also ensures that we don't store '' to an integer field. Refs #316

    → <<cset 3af97675ec37>>

  5. Log in to comment