openrem / OpenREM / issues / #256 - Non-ASCII strings in DICOM cause extractor routines to fail — Bitbucket

Issue #256 resolved

Ed McDonagh created an issue 2015-08-07

As reported by Samuli Hel: Google Group discussion

Protocol name with ä in causes error: You must not use 8-bit bytestring.

Comments (8)

Ed McDonagh reporter
from __future__ import unicode_literals removes the error, but the import fails silently (from Google Groups discussion)
- 2015-08-12T08:25:04+00:00
Ed McDonagh reporter
From Eivind https://groups.google.com/forum/#!topic/openrem/rh55ulcQPl8
1. Adding from __future__ import unicode_literals in rdsr.py fails the if dataset.SOPClassUID .. (near the end of rdsr.py) somehow, so extraction will fail silently.
  - Solution: adding [:] behind SOPClassUID -> SOPClassUID[:] (both instances) will make the script run, but eventually fail (due to the "strange" characters).
2. get_value_kw(tag,dataset) in get_values.py must be modified to handle both strings and bytes, and code characters properly.
  - Solution: after def get_value_kw(tag,dataset) , adding
```
# guarantee byte string in UTF8 encoding 
_u8 = lambda t: t.encode('UTF-8', 'replace') if isinstance(t, unicode) else t 
```
and after the if value != '':"
```
value=format(_u8(value)) 
return value.decode(‘latin-1, ‘replace’) 
```
(I guess changing "latin-1" with something else would work also.) I tried this using Postgresql (default settings), and it seems to work quite well.
- 2015-09-02T14:53:05+00:00
Ed McDonagh reporter
Added unicode encoding to a couple of strings instead of using the str function which can't handle non-ASCII letters. Refs ~~#256~~ - possibly fixed. Needs more testing.

→ <<cset 1216374ba525>>
- 2015-09-02T20:55:44+00:00
Ed McDonagh reporter
Added latin-1 decode which refs ~~#256~~ and looks to fix it, but wouldn't work for other character sets.

→ <<cset 26e8564ad802>>
- 2015-09-03T12:47:49+00:00
Ed McDonagh reporter
- changed status to open
- 2015-09-03T14:06:53+00:00
Ed McDonagh reporter
Corrected mistake that meant changes made in 26e8564ad802 would never work! Refs ~~#256~~ and hopefully fixes it.

→ <<cset ae724b09b510>>
- 2015-09-10T08:36:41+00:00
Ed McDonagh reporter
- changed status to resolved
Confirmed working by Eivind 10th September by email.
- 2015-10-05T21:31:26+00:00
Ed McDonagh reporter
Made use of get_value_kw to cover the case of non-ASCII characters in series description. Refs ~~#256~~. Also ensures that we don't store '' to an integer field. Refs ~~#316~~

→ <<cset 3af97675ec37>>
- 2015-11-27T13:04:11+00:00
Log in to comment

Assignee: Ed McDonagh

Type: bug

Priority: major

Status: resolved

Component: Import: All

Milestone: 0.7.0

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!