problem with special characters in python2
Hi AJ,
this issue is about a problem with special characters that can occur in an SPSS file. I'm doing the same thing as always: reading data, reading metadata, writing everything. The writing part seems to have a bug.
When I try
from __future__ import division, print_function
#from __future__ import absolute_import, unicode_literals
import numpy as np
import os
import sys
import savReaderWriter as sav
directory = "C:\\Users\\RitschelG\\Projekte\\spss_to_pandas"
spss_filename = r"test.sav"
spss_file = os.path.join(directory, spss_filename)
# read SPSS file data
ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=False)
with data:
allData = data.all()
variables = allData[0]
records = allData[1:]
# read SPSS file metadata
with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale="german") as header:
metadata = header.dataDictionary(asNamedtuple=False) # Why does this take so long?
# write (unmodified) data to SPSS file
spss_file_out = os.path.join(directory, 'out.sav')
with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale=ioLocale,
mode=b'wb', refSavFileName=None, **metadata) as writer:
for i, record in enumerate(records):
writer.writerow(record)
I get
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-1-ca2b3672f18a> in <module>()
26 spss_file_out = os.path.join(directory, 'out.sav')
27 with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale=ioLocale,
---> 28 mode=b'wb', refSavFileName=None, **metadata) as writer:
29 for i, record in enumerate(records):
30 writer.writerow(record)
C:\Users\RitschelG\AppData\Local\Continuum\32bit\Anaconda\lib\site-packages\savreaderwriter-3.3.0-py2.7.egg\savReaderWriter\savWriter.pyc in __init__(self, savFileName, varNames, varTypes, valueLabels, varLabels, formats, missingValues, measureLevels, columnWidths, alignments, varSets, varRoles, varAttributes, fileAttributes, fileLabel, multRespDefs, caseWeightVar, overwrite, ioUtf8, ioLocale, mode, refSavFileName)
185 self.missingValues = missingValues
186 self.measureLevels = measureLevels
--> 187 self.columnWidths = columnWidths
188 self.alignments = alignments
189 self.varSets = varSets
C:\Users\RitschelG\AppData\Local\Continuum\32bit\Anaconda\lib\site-packages\savreaderwriter-3.3.0-py2.7.egg\savReaderWriter\header.pyc in columnWidths(self, varColumnWidths)
704 if retcode:
705 msg = "Error setting variable column width: '%s'"
--> 706 checkErrsWarns(msg % varName.decode(), retcode)
707
708 def _setColWidth10(self):
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 0: ordinal not in range(128)
It seems, one has to specify a codec there, explicitly.
Cheers, Gerhard
(This was tested on Python2, 32bit under Windows.)
Comments (6)
-
repo owner -
repo owner Btw, the
ValueError
is becauselocale.setlocale
does not accept a unicode string in Python 2.7.In [1]: import locale In [2]: locale.setlocale(locale.LC_CTYPE, "en_US.UTF-8") Out[2]: 'en_US.UTF-8' In [3]: locale.setlocale(locale.LC_CTYPE, u"en_US.UTF-8") --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-10-d8708a417ef8> in <module>() ----> 1 locale.setlocale(locale.LC_CTYPE, u"en_US.UTF-8") /usr/lib/python2.7/locale.pyc in setlocale(category, locale) 544 if locale and type(locale) is not type(""): 545 # convert to string --> 546 locale = normalize(_build_localename(locale)) 547 return _setlocale(category, locale) 548 /usr/lib/python2.7/locale.pyc in _build_localename(localetuple) 451 452 """ --> 453 language, encoding = localetuple 454 if language is None: 455 language = 'C' ValueError: too many values to unpack
-
reporter Ah, good to know. Thanks for the explanation. But I was not aware that I am passing a unicode string there. Where does
ioLocale
get converted to unicode? locale.setlocale was called with the string"" if ioLocale is None else ioLocale
where
ioLocale
was defined viaioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
-
reporter Re: unicode_literals and interface in bytes. So the
**metadata
trick won't always work with ioUtf8=True.-
I think that it would be a really big improvement if the
**metadata
trick would always work and if one could guarantee that unmodified input produces the same output. -
In my opinion the metadata part of your module is the main selling point, because as far as I know, none of the other possibilities to export SPSS data (csv, xls, dta) works really well with the metadata.
-
To be honest, I don't quite understand why you originally opted for the byte interface. I would have probably naturally chosen unicode strings, instead -- especially when it comes to python3 compatibility. (But it is not my intention to judge your decision. I hope you get me right ...)
-
ioUtf8=False
seems to be a way around these issues, for the moment. But it has the ugly side-effect, that SPSS always triples the length of character strings when reading in data exported from savReaderWriter.
-
-
repo owner Hi,
- Below is a demo of
unicode_literals
. - It is also possible to copy metadata with mode="cp" and refSavFile.
- The decision to use bytes was was taken when Python 2.6 was still around. Bytestrings are the default strings in Python 2. Besides, with binary data, bytes will still be faster. Unicode strings would still have to be converted into bytestrings, there's no way around that. Still, the b" prefixes are indeed annoying ;-)
- You could try using
ioUtf8=True
and then useSET UNICODE=OFF LOCALE = "english_us.65001"
inside SPSS (I am not 100 % sure whether this will work, I have no Windows here to try). That way you read it in codepage mode, but with an utf-8 codepage. Alternatively, you could use the SPSS commandALTER TYPE <strvarnames> (A=AMIN)
. to shrink the tripled variables in spss again. Maybe even simply: SET ERRORS = NONE. ALTER TYPE ALL (A=AMIN). SET ERRORS = LISTING.
All in all, this issue seems to be a duplicate of a long-standing issue #20 "interface in bytes" (though I am certainly never going to entirely drop the bytes altogether, as is suggested in this issue)
# without unicode_literals antonia@antonia-HP-2133 /tmp $ cat btest.py from __future__ import print_function; print(type("some string without u prefix")) antonia@antonia-HP-2133 /tmp $ python btest.py <type 'str'> antonia@antonia-HP-2133 /tmp $ python3.4 btest.py <class 'str'> # str means unicode in python3! # with unicode_literals antonia@antonia-HP-2133 /tmp $ cat utest.py from __future__ import unicode_literals, print_function; print(type("some string without u prefix")) antonia@antonia-HP-2133 /tmp $ python utest.py <type 'unicode'> antonia@antonia-HP-2133 /tmp $ python3.4 utest.py <class 'str'>
- Below is a demo of
-
reporter -
Below is a demo of unicode_literals.
- Thanks (I think I roughly knew that, though ...)
-
It is also possible to copy metadata with mode="cp" and refSavFile.
- I know, thanks. (I haven't tried it, yet.)
-
The decision to use bytes was was taken when Python 2.6 was still around. Bytestrings are the default strings in Python 2. Besides, with binary data, Bytes will still be faster. Unicode strings would still have to be converted into bytestrings, there's no way around that. Still, the b" prefixes are indeed annoying ;-)
- I know. But thanks. :-)
-
You could try using ioUtf8=True and then use SET UNICODE=OFF LOCALE = "english_us.65001" inside SPSS (I am not 100 % sure whether this will work, I have no Windows here to try). That way you read it in Codepage mode, but with an utf-8 codepage. Alternatively, you could use the SPSS command ALTER TYPE <strvarnames> (A=AMIN). to shrink the tripled variables in spss again. Maybe even simply: SET ERRORS = NONE. ALTER TYPE ALL (A=AMIN). SET ERRORS = LISTING.
- Thanks for these ideas. I thought about the same things. I will try them ...
-
All in all, this issue seems to be a duplicate of a long-standing issue #20 "interface in bytes" (though I am certainly never going to entirely drop the bytes altogether, as is suggested in this issue)
- I did not want to suggest this. I'm fine with the byte interface. :-)
The issue was/is about
ioUtf8=True
not working properly in the output part. Actually, at the moment, it seems, one should only useioUtf8=False
-- with the implication that SPSS triples character strings.
- I did not want to suggest this. I'm fine with the byte interface. :-)
The issue was/is about
-
- Log in to comment
Re: unicode_literals, I've always avoided that one. See also:http://stackoverflow.com/questions/809796/any-gotchas-using-unicode-literals-in-python-2-6.
As outlined in the docs, the interface is in bytes. I do intend to make it polymorphic some day. A few header items are ready now, but many to go. So the
**metadata
trick won't always work withioUtf8=True
. The error message indeed contains an error.