dataDictionary "invalid start byte" on Ubuntu/64 bit
I'm running into a decode error when using UTF-8 mode on Linux and hope you can help out.
I'm attempting to execute the below code on the attached data file:
import savReaderWriter as srw
h = srw.SavHeaderReader("/home/user/Desktop/July 2004 Selective Exposure.sav", ioUtf8=True)
d = h.dataDictionary()
On Windows/64 bit, it works perfectly. On Ubuntu/64 bit, it returns the following:
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
d = h.dataDictionary()
File "/home/user/src/savreaderwriter/savReaderWriter/savHeaderReader.py", line 105, in dataDictionary
metadata = dict([(item, getattr(self, item)) for item in items])
File "/home/user/src/savreaderwriter/savReaderWriter/header.py", line 62, in wrapper
uresult[uS(k)][uS(i)] = uS(uL(j))
File "/home/user/src/savreaderwriter/savReaderWriter/header.py", line 48, in <lambda>
uS = lambda x: x.decode("utf-8") if isinstance(x, bytes_) else x
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte
Think the culprit is an apostrophe character on the values of variables like "iraq8a". I've tried on both the bleeding-edge repository version and the pip install version of savReaderWriter.
Any suggestions that don't involve modifying the original dataset? Re-saving in SPSS seems to do away with the problem, but I've got hundreds of datasets to manage so am looking for something a little more programmatic.
Love the tool, thanks for your help!
Comments (3)
-
repo owner -
repo owner I forgot to mention that this works both for v3.3.0 and for the current HEAD revision 917d136e5a.
-
repo owner - changed status to invalid
This is expected behavior, though I am going to improve the error message. Or perhaps issue a warning in case of incompatible encodings.
- Log in to comment
Hi,
This means that the file encoding and the interface encoding are not compatible. The .sav was created with an older version of spss (I believe SPSS v13 only knew codepage mode, unicode was introduced in SPSS v14 IIRC). And now you are trying to open this assuming a UTF-8 encoding. That works fine for the ASCII subset, but not for the accented characters. Maybe the variable labels were typed in MS Office. I see these fancy hyphens and quotes (why ooh why did people have to invent different kinds of quotes?). Anyway, below is the code that worked on my computer. You need to generate a Windows locale on your Ubuntu system. May I include your data in the savReaderWriter test data?
I think I coulld improve the program a bit by at least raising a clearer error message, no?
Best wishes, Albert-Jan