problem with special characters in python2

Issue #30 new
gritschel created an issue

Hi AJ,

this issue is about a problem with special characters that can occur in an SPSS file. I'm doing the same thing as always: reading data, reading metadata, writing everything. The writing part seems to have a bug.

When I try

from __future__ import division, print_function
#from __future__ import absolute_import, unicode_literals

import numpy as np
import os
import sys
import savReaderWriter as sav

directory = "C:\\Users\\RitschelG\\Projekte\\spss_to_pandas"
spss_filename = r"test.sav"
spss_file = os.path.join(directory, spss_filename)

# read SPSS file data
ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=False)
with data:
    allData = data.all()
variables = allData[0]
records = allData[1:]

# read SPSS file metadata
with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale="german") as header:
    metadata = header.dataDictionary(asNamedtuple=False)  # Why does this take so long?

# write (unmodified) data to SPSS file
spss_file_out = os.path.join(directory, 'out.sav')
with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale=ioLocale,
                   mode=b'wb', refSavFileName=None, **metadata) as writer:
    for i, record in enumerate(records):
        writer.writerow(record)

I get

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-1-ca2b3672f18a> in <module>()
     26 spss_file_out = os.path.join(directory, 'out.sav')
     27 with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale=ioLocale,
---> 28                    mode=b'wb', refSavFileName=None, **metadata) as writer:
     29     for i, record in enumerate(records):
     30         writer.writerow(record)

C:\Users\RitschelG\AppData\Local\Continuum\32bit\Anaconda\lib\site-packages\savreaderwriter-3.3.0-py2.7.egg\savReaderWriter\savWriter.pyc in __init__(self, savFileName, varNames, varTypes, valueLabels, varLabels, formats, missingValues, measureLevels, columnWidths, alignments, varSets, varRoles, varAttributes, fileAttributes, fileLabel, multRespDefs, caseWeightVar, overwrite, ioUtf8, ioLocale, mode, refSavFileName)
    185             self.missingValues = missingValues
    186             self.measureLevels = measureLevels
--> 187             self.columnWidths = columnWidths
    188             self.alignments = alignments
    189             self.varSets = varSets

C:\Users\RitschelG\AppData\Local\Continuum\32bit\Anaconda\lib\site-packages\savreaderwriter-3.3.0-py2.7.egg\savReaderWriter\header.pyc in columnWidths(self, varColumnWidths)
    704             if retcode:
    705                 msg = "Error setting variable column width: '%s'"
--> 706                 checkErrsWarns(msg % varName.decode(), retcode)
    707 
    708     def _setColWidth10(self):

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)

It seems, one has to specify a codec there, explicitly.

Cheers, Gerhard

(This was tested on Python2, 32bit under Windows.)

Comments (6)

  1. Albert-Jan Roskam repo owner

    Btw, the ValueError is because locale.setlocale does not accept a unicode string in Python 2.7.

    In [1]: import locale
    In [2]: locale.setlocale(locale.LC_CTYPE, "en_US.UTF-8")
    Out[2]: 'en_US.UTF-8'
    
    In [3]: locale.setlocale(locale.LC_CTYPE, u"en_US.UTF-8")
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-10-d8708a417ef8> in <module>()
    ----> 1 locale.setlocale(locale.LC_CTYPE, u"en_US.UTF-8")
    
    /usr/lib/python2.7/locale.pyc in setlocale(category, locale)
        544     if locale and type(locale) is not type(""):
        545         # convert to string
    --> 546         locale = normalize(_build_localename(locale))
        547     return _setlocale(category, locale)
        548 
    
    /usr/lib/python2.7/locale.pyc in _build_localename(localetuple)
        451 
        452     """
    --> 453     language, encoding = localetuple
        454     if language is None:
        455         language = 'C'
    
    ValueError: too many values to unpack
    
  2. gritschel reporter

    Ah, good to know. Thanks for the explanation. But I was not aware that I am passing a unicode string there. Where does ioLocale get converted to unicode? locale.setlocale was called with the string

    "" if ioLocale is None else ioLocale
    

    where ioLocale was defined via

    ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
    
  3. gritschel reporter

    Re: unicode_literals and interface in bytes. So the **metadata trick won't always work with ioUtf8=True.

    • I think that it would be a really big improvement if the **metadata trick would always work and if one could guarantee that unmodified input produces the same output.

    • In my opinion the metadata part of your module is the main selling point, because as far as I know, none of the other possibilities to export SPSS data (csv, xls, dta) works really well with the metadata.

    • To be honest, I don't quite understand why you originally opted for the byte interface. I would have probably naturally chosen unicode strings, instead -- especially when it comes to python3 compatibility. (But it is not my intention to judge your decision. I hope you get me right ...)

    • ioUtf8=False seems to be a way around these issues, for the moment. But it has the ugly side-effect, that SPSS always triples the length of character strings when reading in data exported from savReaderWriter.

  4. Albert-Jan Roskam repo owner

    Hi,

    • Below is a demo of unicode_literals.
    • It is also possible to copy metadata with mode="cp" and refSavFile.
    • The decision to use bytes was was taken when Python 2.6 was still around. Bytestrings are the default strings in Python 2. Besides, with binary data, bytes will still be faster. Unicode strings would still have to be converted into bytestrings, there's no way around that. Still, the b" prefixes are indeed annoying ;-)
    • You could try using ioUtf8=True and then use SET UNICODE=OFF LOCALE = "english_us.65001" inside SPSS (I am not 100 % sure whether this will work, I have no Windows here to try). That way you read it in codepage mode, but with an utf-8 codepage. Alternatively, you could use the SPSS command ALTER TYPE <strvarnames> (A=AMIN). to shrink the tripled variables in spss again. Maybe even simply: SET ERRORS = NONE. ALTER TYPE ALL (A=AMIN). SET ERRORS = LISTING.

    All in all, this issue seems to be a duplicate of a long-standing issue #20 "interface in bytes" (though I am certainly never going to entirely drop the bytes altogether, as is suggested in this issue)

    # without unicode_literals
    antonia@antonia-HP-2133 /tmp $ cat btest.py
    from __future__ import print_function; print(type("some string without u prefix"))
    antonia@antonia-HP-2133 /tmp $ python btest.py
    <type 'str'>
    antonia@antonia-HP-2133 /tmp $ python3.4 btest.py
    <class 'str'>   # str means unicode in python3!
    
    # with unicode_literals
    antonia@antonia-HP-2133 /tmp $ cat utest.py
    from __future__ import unicode_literals, print_function; print(type("some string without u prefix"))
    antonia@antonia-HP-2133 /tmp $ python utest.py
    <type 'unicode'>
    antonia@antonia-HP-2133 /tmp $ python3.4 utest.py
    <class 'str'>
    
  5. gritschel reporter
    • Below is a demo of unicode_literals.

      • Thanks (I think I roughly knew that, though ...)
    • It is also possible to copy metadata with mode="cp" and refSavFile.

      • I know, thanks. (I haven't tried it, yet.)
    • The decision to use bytes was was taken when Python 2.6 was still around. Bytestrings are the default strings in Python 2. Besides, with binary data, Bytes will still be faster. Unicode strings would still have to be converted into bytestrings, there's no way around that. Still, the b" prefixes are indeed annoying ;-)

      • I know. But thanks. :-)
    • You could try using ioUtf8=True and then use SET UNICODE=OFF LOCALE = "english_us.65001" inside SPSS (I am not 100 % sure whether this will work, I have no Windows here to try). That way you read it in Codepage mode, but with an utf-8 codepage. Alternatively, you could use the SPSS command ALTER TYPE <strvarnames> (A=AMIN). to shrink the tripled variables in spss again. Maybe even simply: SET ERRORS = NONE. ALTER TYPE ALL (A=AMIN). SET ERRORS = LISTING.

      • Thanks for these ideas. I thought about the same things. I will try them ...
    • All in all, this issue seems to be a duplicate of a long-standing issue #20 "interface in bytes" (though I am certainly never going to entirely drop the bytes altogether, as is suggested in this issue)

      • I did not want to suggest this. I'm fine with the byte interface. :-) The issue was/is about ioUtf8=True not working properly in the output part. Actually, at the moment, it seems, one should only use ioUtf8=False -- with the implication that SPSS triples character strings.
  6. Log in to comment