dataDictionary "invalid start byte" on Ubuntu/64 bit

Issue #27 invalid
Former user created an issue

I'm running into a decode error when using UTF-8 mode on Linux and hope you can help out.

I'm attempting to execute the below code on the attached data file:

import savReaderWriter as srw

h = srw.SavHeaderReader("/home/user/Desktop/July 2004 Selective Exposure.sav", ioUtf8=True)

d = h.dataDictionary()

On Windows/64 bit, it works perfectly. On Ubuntu/64 bit, it returns the following:

Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    d = h.dataDictionary()
  File "/home/user/src/savreaderwriter/savReaderWriter/savHeaderReader.py", line 105, in dataDictionary
    metadata = dict([(item, getattr(self, item)) for item in items])
  File "/home/user/src/savreaderwriter/savReaderWriter/header.py", line 62, in wrapper
    uresult[uS(k)][uS(i)] = uS(uL(j))
  File "/home/user/src/savreaderwriter/savReaderWriter/header.py", line 48, in <lambda>
    uS = lambda x: x.decode("utf-8") if isinstance(x, bytes_) else x
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 3: invalid start byte

Think the culprit is an apostrophe character on the values of variables like "iraq8a". I've tried on both the bleeding-edge repository version and the pip install version of savReaderWriter.

Any suggestions that don't involve modifying the original dataset? Re-saving in SPSS seems to do away with the problem, but I've got hundreds of datasets to manage so am looking for something a little more programmatic.

Love the tool, thanks for your help!

Comments (3)

  1. Albert-Jan Roskam repo owner

    Hi,

    This means that the file encoding and the interface encoding are not compatible. The .sav was created with an older version of spss (I believe SPSS v13 only knew codepage mode, unicode was introduced in SPSS v14 IIRC). And now you are trying to open this assuming a UTF-8 encoding. That works fine for the ASCII subset, but not for the accented characters. Maybe the variable labels were typed in MS Office. I see these fancy hyphens and quotes (why ooh why did people have to invent different kinds of quotes?). Anyway, below is the code that worked on my computer. You need to generate a Windows locale on your Ubuntu system. May I include your data in the savReaderWriter test data?

    I think I coulld improve the program a bit by at least raising a clearer error message, no?

    Best wishes, Albert-Jan

    antonia@antonia-HP-2133 ~/Desktop $ lsb_release -irc
    Distributor ID: LinuxMint
    Release:    14
    Codename:   nadia
    antonia@antonia-HP-2133 ~/Desktop $ uname -a
    Linux antonia-HP-2133 3.5.0-17-generic #28-Ubuntu SMP Tue Oct 9 19:32:08 UTC 2012 i686 i686 i686 GNU/Linux
    antonia@antonia-HP-2133 ~/Desktop $ python --version
    Python 2.7.3
    antonia@antonia-HP-2133 ~/Desktop $ sudo localedef -f CP1252 -i en_US /usr/lib/locale/en_US.cp1252
    antonia@antonia-HP-2133 ~/Desktop $ cat issue_27.py 
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    from __future__ import print_function
    from os.path import expanduser
    import codecs
    import pprint
    import unicodedata
    import savReaderWriter as rw
    
    #sudo localedef -f CP1252 -i en_US /usr/lib/locale/en_US.cp1252
    savFileName = expanduser("~/Downloads/July 2004 Selective Exposure.sav")
    with rw.SavHeaderReader(savFileName, ioLocale="en_US.cp1252") as header:
        print("SPSS version: %s.%s.%s" % header.spssVersion) 
        print("File encoding: %s" % header.fileEncoding) 
        metadata = header.dataDictionary(True)
        print("Compatible encoding: %s" % header.isCompatibleEncoding())
        #pprint.pprint(metadata.varLabels)
        report = unicode(header)
    
    with codecs.open("report.txt", "wb", encoding="utf-8") as outfile:
        outfile.write(report)
    
    print("Offending characters:")
    for offending_character in sorted(set([c for c in report if ord(c) > 128])):
        name = unicodedata.name(offending_character)
        print("%s - %s" % (offending_character, name))
    
    antonia@antonia-HP-2133 ~/Desktop $ python issue_27.py 
    SPSS version: 13.0.0
    File encoding: cp1252
    Compatible encoding: True
    Offending characters:
     - EN DASH
     - EM DASH
     - RIGHT SINGLE QUOTATION MARK
     - HORIZONTAL ELLIPSIS
    
  2. Albert-Jan Roskam repo owner

    This is expected behavior, though I am going to improve the error message. Or perhaps issue a warning in case of incompatible encodings.

  3. Log in to comment