Long labels truncated to 138-142 characters

Issue #37 resolved
George Mironov created an issue

Long labels (utf-8, cyrillic) truncated after writing. Max length not fixed: some label is 138 chars length after truncate, some 141, etc.

Comments (9)

  1. George Mironov reporter

    Someone adviced me update SPSSIO lib. Where I get last version of spssio for linux x64 (centos 6) ?

  2. George Mironov reporter

    After update spssio to 23 version savReaderWriter throw error:

    Traceback (most recent call last):
      File "/home/forapp/fom/releases/20151001095544/lib/spss/writer.py", line 50, in <module>
        with SavWriter(params['filename'], params['variables'], params['types'], params['values'], params['labels'], params['formats'], params['missings'], params['measures'], params['columns'], params['aligns'], None, params['roles'], None, None, None, multRespDefs, None, True, True) as writer:
      File "/home/forapp/src/savreaderwriter/savReaderWriter/savWriter.py", line 186, in __init__
        super(Header, self).__init__(savFileName, ioUtf8, ioLocale)
      File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 29, in __init__
        self.spssio = self.loadLibrary()
      File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 112, in loadLibrary
        spssio = self._loadLibs("lin64")
      File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 89, in _loadLibs
        return [load(os.path.join(path, lib)) for lib in libs][-1]
      File "/usr/local/lib/python2.7/ctypes/__init__.py", line 365, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: libgsk8iccs_64.so: cannot open shared object file: No such file or directory
    
  3. Albert-Jan Roskam repo owner

    Hi,

    Could you post minimal code to reproduce this? I am not sure whether you mean Value Labels or Variable Labels. You could write:

    • in Codepage mode, with a Russian ioLocale. Possibly you need to generate that locale (ru_RU.cp1250??) on your system with locale gen.
    • in Unicode mode, ioUtf8=True. This is probably easiest, but users with older SPSS versions can't use your .sav files.

    The newer I/O libraries support encrypted .sav files, which require extra work to load. The current version does not require setting LD_LIBRARY_PATH and similar.

    Regards, Albert-Jan

  4. George Mironov reporter

    Hello, Albert-Jan.

    In this code, labels truncates ("english" - 256 chars, "russian" - 137 chars):

    # -*- coding: utf-8 -*-
    import json
    from savReaderWriter import SavWriter
    
    raw = """
    {
      "labels": {
        "english": "Guildhall is a building in the City of London, off Gresham and Basinghall Streets, that has been used as a town hall for several hundred years. It remains the ceremonial and administrative centre of the City of London and its Corporation. This photograph shows the interior of its main room, a medieval great hall dating back to 1411.",
        "russian": "С четырнадцати лет проходил обучение в футбольной академии «Сконто». В восемнадцать лет Александр начал профессиональную карьеру, став игроком рижского «Олимп»."
      },
      "records": [["none","none"]],
      "types": { "english": 2000, "russian": 2000 },
      "variables": ["english","russian"]
    }
    """
    
    params = json.loads(raw)
    
    
    with SavWriter("out.sav", params['variables'], params['types'], None, params['labels'], None, None, None, None, None, None, None, None, None, None, None, None, True, True) as writer:
      for record in params['records']:
        writer.writerow(record)
    
  5. Albert-Jan Roskam repo owner

    Hello George,

    The length of Variable Labels is limited to 256 bytes. That is a limitation/characteristic of the .sav file format. They are stored in sav.MAXLENGTHS (see below). Both the English and the Russian variable labels are truncated to 255 bytes (255, not 256; maybe the last one is a terminating null byte?). In English (which uses just the single-byte ascii chars) the number of bytes equals the number of characters, but not in Russian (that text is 160 chars, 298 bytes when encoded in utf-8). Below is a slightly modified version of your code. I think it would be nice if savReaderWriter would issue a warning if labels or other values are truncated.

    unicode mode

    albertjan@debian:~/nfs/Public/savreaderwriter$ uname -a
    Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-2 (2015-04-13) x86_64 GNU/Linux
    albertjan@debian:~/nfs/Public/savreaderwriter$ cat mironov.py 
    # -*- coding: utf-8 -*-
    
    from __future__ import print_function 
    import json
    import savReaderWriter as sav
    
    raw = """
    {
      "labels": {
        "english": "Guildhall is a building in the City of London, off Gresham and Basinghall Streets, that has been used as a town hall for several hundred years. It remains the ceremonial and administrative centre of the City of London and its Corporation. This photograph shows the interior of its main room, a medieval great hall dating back to 1411.",
        "russian": "С четырнадцати лет проходил обучение в футбольной академии «Сконто». В восемнадцать лет Александр начал профессиональную карьеру, став игроком рижского «Олимп»."
      },
      "records": [["none","none"]],
      "types": { "english": 2000, "russian": 2000 },
      "variables": ["english","russian"]
    }
    """
    
    params = json.loads(raw)
    
    kwargs = dict(savFileName="out.sav",
                  varNames=params['variables'],
                  varTypes=params['types'], 
                  varLabels=params['labels'], 
                  ioUtf8=True)   ## Unicode mode
    
    with sav.SavWriter(**kwargs) as writer:
      for record in params['records']:
        writer.writerow(record)
    
    with sav.SavHeaderReader(kwargs["savFileName"]) as header:
        for lang, text in sorted(header.varLabels.items()):
            print("%s: %d bytes, %d chars" % (lang, len(text), len(text.decode("utf-8"))))
    
    print(sav.MAXLENGTHS['SPSS_MAX_VARLABEL'])
    albertjan@debian:~/nfs/Public/savreaderwriter$ python2.7 mironov.py 
    english: 255 bytes, 255 chars
    russian: 255 bytes, 137 chars
    (256, 'Variable label')
    albertjan@debian:~/nfs/Public/savreaderwriter$ python3.4 mironov.py 
    b'english': 255 bytes, 255 chars
    b'russian': 255 bytes, 137 chars
    (256, 'Variable label')
    

    codepage mode

    The code is partially omitted because it is largely the same. The point is that with a single-byte encoding such as cp1251 you can stuff more Russian chars in a label.

    kwargs = dict(savFileName="out.sav",
                  varNames=params['variables'],
                  varTypes=params['types'], 
                  varLabels=params['labels'], 
                  ioLocale="ru_RU.CP1251")  ## codepage mode
    
    with sav.SavWriter(**kwargs) as writer:
      for record in params['records']:
        writer.writerow(record)
    
    with sav.SavHeaderReader(kwargs["savFileName"]) as header:
        for lang, text in sorted(header.varLabels.items()):
            print("%s: %d bytes, %d chars" % (lang, len(text), len(text.decode("cp1251"))))
    

    The output:

    english: 255 bytes, 255 chars
    russian: 255 bytes, 255 chars
    

    But even in codepage mode you're missing "роком рижского «Олимп».". Alas! :-)

    Best wishes, Albert-Jan

  6. George Mironov reporter

    Thanks for answering!

    May I increase sav.MAXLENGTHS value? I'm trying to set something about below, but it's not working :)

    sav.MAXLENGTHS['SPSS_MAX_VARLABEL'] = (2560, 'Variable label')
    
  7. Albert-Jan Roskam repo owner

    Hi, you' re welcome! Unfortunately, the MAXLENGTHS are a restriction of the .sav format. It is impossible to change that, even though you could modify the values of the dictionary savReaderWrtier stores the maxlengths in. It usually makes the output in SPSS very hard to read if the variable labels are that long. You could try storing information in a FILE LABEL, though I really do not know the maxlength of that. String fields have a limit of around 32000 bytes, I think, so you could also consider saving the variable labels in a separate little dataset.

    Regards, Albert-Jan

  8. Log in to comment