1. Albert-Jan Roskam
  2. savReaderWriter
  3. Issues
Issue #9 invalid

Non-ASCII variable names

Anonymous created an issue


in first place: module is great!

Now my "problem": I have to deal with SAV files with non ASCII characters in variable names (German Umlauts). The file/data is read perfectly but all those variables are renamed to "v1", "v2", ... (all other info is fine, valueLabels, variableLabels, ...)

To me it looks like that these names are allready converted right after the self.spssio.spssGetVarNames call. Is there anything I/we could do about this?

Thanks for any answer, Axel

Comments (3)

  1. Albert-Jan Roskam repo owner


    Thanks ;-).

    --What platform (OS and architecture) are you using? --You could try specifiying ioLocale="de_DE.cp1252", or perhaps simply "german" will also work. --Or you could try using ioUtf8=True. --If you can send me some (non-confidential) dataset I can have a look.

    Best wishes, Albert-Jan

  2. Albert-Jan Roskam repo owner

    Okay, I checked it. It works now, though I'd like to check the resulting file in the code below with SPSS (PSPP complains about record type 7, subtype 10). Note the comment about the possible need to define or generate a locale under Linux

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    # Q: "variable names with non-ASCII characters are returned as v1, v2, v3, etc"
    # A: Assuming the file was created in codepage mode (default in SPSS until very recently),
    #    setting ioLocale to the proper (OS-dependent) locale specification should do the trick.
    #    Note that the I/O has its own locale -the locale of the host system is not affected.
    #    I had to generate a German locale with the (Windows) codepage 1252 on my Linux machine first
    #    http://documentation.basis.com/BASISHelp/WebHelp/inst/character_encoding.htm
    #    sudo localedef -f CP1252 -i de_DE /usr/lib/locale/de_DE.cp1252
    #    locale -a | grep de_DE
    # Python 2.7.3 (default, Apr 10 2013, 05:09:49) [GCC 4.7.2] on linux2
    import os
    import savReaderWriter as rw
    # header reader - the correct way
    kwargs = dict(savFileName=os.path.expanduser("~/Downloads/german.sav"),
                  ioLocale="de_DE.cp1252", ioUtf8=True)
    with rw.SavHeaderReader(**kwargs) as header:
        print header.isCompatibleEncoding()
        print header.ioLocale
        print header.varNames
        print " ".join(header.varNames)
    # reader
    kwargs = dict(savFileName=os.path.expanduser("~/Downloads/german.sav"), 
                  ioLocale="de_DE.cp1252", ioUtf8=True, returnHeader=True)
    data  = rw.SavReader(**kwargs)
    with data:
        print data.all()
    # writer
    kwargs = dict(savFileName=os.path.expanduser("~/Downloads/german_out.sav"), 
                  varNames=[u'\xfcberhaupt'], varTypes={u'\xfcberhaupt':0},
                  ioLocale="de_DE.cp1252", ioUtf8=True)
    with rw.SavWriter(**kwargs) as writer:
        print writer.ioLocale


    [u'python', u'programmieren', u'macht', u'\xfcberhaupt', u'v\xf6llig', u'spa\xdf']
    python programmieren macht überhaupt völlig spaß
    [[u'python', u'programmieren', u'macht', u'\xfcberhaupt', u'v\xf6llig', u'spa\xdf'], [1.0, 2.0, 3.0, 4.0, 5.0, 6.0]]

    Without specifying the ioLocale and ioUtf8 in the "header" example, you would get:

    NOTE. SPSS Statistics data file 'german.sav' is written in a character encoding (cp1252) incompatible with the current ioLocale setting. It may not be readable. Consider changing ioLocale or setting ioUtf8=True.
    ['python', 'programmieren', 'macht', 'v1', 'v2', 'v3']

    Long story short: I think it should have been sufficient to specify only ioLocale. I would like to check how this behaves in WIndows. I any case, this may be worthy of describing in the Pypi documentation

  3. Log in to comment