fomcl / savReaderWriter / issues / #37 - Long labels truncated to 138-142 characters — Bitbucket

Issue #37 resolved

George Mironov created an issue 2015-10-05

Long labels (utf-8, cyrillic) truncated after writing. Max length not fixed: some label is 138 chars length after truncate, some 141, etc.

Comments (9)

George Mironov reporter
Someone adviced me update SPSSIO lib. Where I get last version of spssio for linux x64 (centos 6) ?
- 2015-10-05T12:36:44+00:00

George Mironov reporter

After update spssio to 23 version savReaderWriter throw error:

Traceback (most recent call last):
  File "/home/forapp/fom/releases/20151001095544/lib/spss/writer.py", line 50, in <module>
    with SavWriter(params['filename'], params['variables'], params['types'], params['values'], params['labels'], params['formats'], params['missings'], params['measures'], params['columns'], params['aligns'], None, params['roles'], None, None, None, multRespDefs, None, True, True) as writer:
  File "/home/forapp/src/savreaderwriter/savReaderWriter/savWriter.py", line 186, in __init__
    super(Header, self).__init__(savFileName, ioUtf8, ioLocale)
  File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 29, in __init__
    self.spssio = self.loadLibrary()
  File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 112, in loadLibrary
    spssio = self._loadLibs("lin64")
  File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 89, in _loadLibs
    return [load(os.path.join(path, lib)) for lib in libs][-1]
  File "/usr/local/lib/python2.7/ctypes/__init__.py", line 365, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libgsk8iccs_64.so: cannot open shared object file: No such file or directory

2015-10-05T13:43:06+00:00

Albert-Jan Roskam repo owner
Hi,

Could you post minimal code to reproduce this? I am not sure whether you mean Value Labels or Variable Labels. You could write:
- in Codepage mode, with a Russian ioLocale. Possibly you need to generate that locale (ru_RU.cp1250??) on your system with locale gen.
- in Unicode mode, ioUtf8=True. This is probably easiest, but users with older SPSS versions can't use your .sav files.
The newer I/O libraries support encrypted .sav files, which require extra work to load. The current version does not require setting LD_LIBRARY_PATH and similar.

Regards, Albert-Jan
- 2015-10-05T19:01:16+00:00

George Mironov reporter

Hello, Albert-Jan.

In this code, labels truncates ("english" - 256 chars, "russian" - 137 chars):

# -*- coding: utf-8 -*-
import json
from savReaderWriter import SavWriter

raw = """
{
  "labels": {
    "english": "Guildhall is a building in the City of London, off Gresham and Basinghall Streets, that has been used as a town hall for several hundred years. It remains the ceremonial and administrative centre of the City of London and its Corporation. This photograph shows the interior of its main room, a medieval great hall dating back to 1411.",
    "russian": "С четырнадцати лет проходил обучение в футбольной академии «Сконто». В восемнадцать лет Александр начал профессиональную карьеру, став игроком рижского «Олимп»."
  },
  "records": [["none","none"]],
  "types": { "english": 2000, "russian": 2000 },
  "variables": ["english","russian"]
}
"""

params = json.loads(raw)


with SavWriter("out.sav", params['variables'], params['types'], None, params['labels'], None, None, None, None, None, None, None, None, None, None, None, None, True, True) as writer:
  for record in params['records']:
    writer.writerow(record)

2015-10-06T10:44:54+00:00

Albert-Jan Roskam repo owner

Hello George,

The length of Variable Labels is limited to 256 bytes. That is a limitation/characteristic of the .sav file format. They are stored in sav.MAXLENGTHS (see below). Both the English and the Russian variable labels are truncated to 255 bytes (255, not 256; maybe the last one is a terminating null byte?). In English (which uses just the single-byte ascii chars) the number of bytes equals the number of characters, but not in Russian (that text is 160 chars, 298 bytes when encoded in utf-8). Below is a slightly modified version of your code. I think it would be nice if savReaderWriter would issue a warning if labels or other values are truncated.

unicode mode

albertjan@debian:~/nfs/Public/savreaderwriter$ uname -a
Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-2 (2015-04-13) x86_64 GNU/Linux
albertjan@debian:~/nfs/Public/savreaderwriter$ cat mironov.py 
# -*- coding: utf-8 -*-

from __future__ import print_function 
import json
import savReaderWriter as sav

raw = """
{
  "labels": {
    "english": "Guildhall is a building in the City of London, off Gresham and Basinghall Streets, that has been used as a town hall for several hundred years. It remains the ceremonial and administrative centre of the City of London and its Corporation. This photograph shows the interior of its main room, a medieval great hall dating back to 1411.",
    "russian": "С четырнадцати лет проходил обучение в футбольной академии «Сконто». В восемнадцать лет Александр начал профессиональную карьеру, став игроком рижского «Олимп»."
  },
  "records": [["none","none"]],
  "types": { "english": 2000, "russian": 2000 },
  "variables": ["english","russian"]
}
"""

params = json.loads(raw)

kwargs = dict(savFileName="out.sav",
              varNames=params['variables'],
              varTypes=params['types'], 
              varLabels=params['labels'], 
              ioUtf8=True)   ## Unicode mode

with sav.SavWriter(**kwargs) as writer:
  for record in params['records']:
    writer.writerow(record)

with sav.SavHeaderReader(kwargs["savFileName"]) as header:
    for lang, text in sorted(header.varLabels.items()):
        print("%s: %d bytes, %d chars" % (lang, len(text), len(text.decode("utf-8"))))

print(sav.MAXLENGTHS['SPSS_MAX_VARLABEL'])
albertjan@debian:~/nfs/Public/savreaderwriter$ python2.7 mironov.py 
english: 255 bytes, 255 chars
russian: 255 bytes, 137 chars
(256, 'Variable label')
albertjan@debian:~/nfs/Public/savreaderwriter$ python3.4 mironov.py 
b'english': 255 bytes, 255 chars
b'russian': 255 bytes, 137 chars
(256, 'Variable label')

codepage mode

The code is partially omitted because it is largely the same. The point is that with a single-byte encoding such as cp1251 you can stuff more Russian chars in a label.

kwargs = dict(savFileName="out.sav",
              varNames=params['variables'],
              varTypes=params['types'], 
              varLabels=params['labels'], 
              ioLocale="ru_RU.CP1251")  ## codepage mode

with sav.SavWriter(**kwargs) as writer:
  for record in params['records']:
    writer.writerow(record)

with sav.SavHeaderReader(kwargs["savFileName"]) as header:
    for lang, text in sorted(header.varLabels.items()):
        print("%s: %d bytes, %d chars" % (lang, len(text), len(text.decode("cp1251"))))

The output:

english: 255 bytes, 255 chars
russian: 255 bytes, 255 chars

But even in codepage mode you're missing "роком рижского «Олимп».". Alas! :-)

Best wishes, Albert-Jan

2015-10-06T20:10:36+00:00

Albert-Jan Roskam repo owner
- marked as enhancement
- 2015-10-06T20:11:46+00:00
George Mironov reporter
Thanks for answering!

May I increase sav.MAXLENGTHS value? I'm trying to set something about below, but it's not working :)
```
sav.MAXLENGTHS['SPSS_MAX_VARLABEL'] = (2560, 'Variable label')
```
- 2015-10-07T09:40:18+00:00
Albert-Jan Roskam repo owner
Hi, you' re welcome! Unfortunately, the MAXLENGTHS are a restriction of the .sav format. It is impossible to change that, even though you could modify the values of the dictionary savReaderWrtier stores the maxlengths in. It usually makes the output in SPSS very hard to read if the variable labels are that long. You could try storing information in a FILE LABEL, though I really do not know the maxlength of that. String fields have a limit of around 32000 bytes, I think, so you could also consider saving the variable labels in a separate little dataset.

Regards, Albert-Jan
- 2015-10-07T19:50:50+00:00
Albert-Jan Roskam repo owner
- changed status to resolved
Resolved quite some time ago --I just forgot to close the issue :)
- 2016-04-04T19:25:08+00:00
Log in to comment

Assignee: –

Type: enhancement

Priority: major

Status: resolved

Votes: 0

Watchers: 2

Jira: the preferred issue tracker for Bitbucket. Join the team!