Long labels truncated to 138-142 characters
Long labels (utf-8, cyrillic) truncated after writing. Max length not fixed: some label is 138 chars length after truncate, some 141, etc.
Comments (9)
-
reporter -
reporter After update spssio to 23 version savReaderWriter throw error:
Traceback (most recent call last): File "/home/forapp/fom/releases/20151001095544/lib/spss/writer.py", line 50, in <module> with SavWriter(params['filename'], params['variables'], params['types'], params['values'], params['labels'], params['formats'], params['missings'], params['measures'], params['columns'], params['aligns'], None, params['roles'], None, None, None, multRespDefs, None, True, True) as writer: File "/home/forapp/src/savreaderwriter/savReaderWriter/savWriter.py", line 186, in __init__ super(Header, self).__init__(savFileName, ioUtf8, ioLocale) File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 29, in __init__ self.spssio = self.loadLibrary() File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 112, in loadLibrary spssio = self._loadLibs("lin64") File "/home/forapp/src/savreaderwriter/savReaderWriter/generic.py", line 89, in _loadLibs return [load(os.path.join(path, lib)) for lib in libs][-1] File "/usr/local/lib/python2.7/ctypes/__init__.py", line 365, in __init__ self._handle = _dlopen(self._name, mode) OSError: libgsk8iccs_64.so: cannot open shared object file: No such file or directory
-
repo owner Hi,
Could you post minimal code to reproduce this? I am not sure whether you mean Value Labels or Variable Labels. You could write:
- in Codepage mode, with a Russian
ioLocale
. Possibly you need to generate that locale (ru_RU.cp1250??) on your system withlocale gen
. - in Unicode mode,
ioUtf8=True
. This is probably easiest, but users with older SPSS versions can't use your .sav files.
The newer I/O libraries support encrypted .sav files, which require extra work to load. The current version does not require setting
LD_LIBRARY_PATH
and similar.Regards, Albert-Jan
- in Codepage mode, with a Russian
-
reporter Hello, Albert-Jan.
In this code, labels truncates ("english" - 256 chars, "russian" - 137 chars):
# -*- coding: utf-8 -*- import json from savReaderWriter import SavWriter raw = """ { "labels": { "english": "Guildhall is a building in the City of London, off Gresham and Basinghall Streets, that has been used as a town hall for several hundred years. It remains the ceremonial and administrative centre of the City of London and its Corporation. This photograph shows the interior of its main room, a medieval great hall dating back to 1411.", "russian": "С четырнадцати лет проходил обучение в футбольной академии «Сконто». В восемнадцать лет Александр начал профессиональную карьеру, став игроком рижского «Олимп»." }, "records": [["none","none"]], "types": { "english": 2000, "russian": 2000 }, "variables": ["english","russian"] } """ params = json.loads(raw) with SavWriter("out.sav", params['variables'], params['types'], None, params['labels'], None, None, None, None, None, None, None, None, None, None, None, None, True, True) as writer: for record in params['records']: writer.writerow(record)
-
repo owner Hello George,
The length of Variable Labels is limited to 256 bytes. That is a limitation/characteristic of the .sav file format. They are stored in
sav.MAXLENGTHS
(see below). Both the English and the Russian variable labels are truncated to 255 bytes (255, not 256; maybe the last one is a terminating null byte?). In English (which uses just the single-byte ascii chars) the number of bytes equals the number of characters, but not in Russian (that text is 160 chars, 298 bytes when encoded in utf-8). Below is a slightly modified version of your code. I think it would be nice ifsavReaderWriter
would issue a warning if labels or other values are truncated.unicode mode
albertjan@debian:~/nfs/Public/savreaderwriter$ uname -a Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt9-2 (2015-04-13) x86_64 GNU/Linux albertjan@debian:~/nfs/Public/savreaderwriter$ cat mironov.py # -*- coding: utf-8 -*- from __future__ import print_function import json import savReaderWriter as sav raw = """ { "labels": { "english": "Guildhall is a building in the City of London, off Gresham and Basinghall Streets, that has been used as a town hall for several hundred years. It remains the ceremonial and administrative centre of the City of London and its Corporation. This photograph shows the interior of its main room, a medieval great hall dating back to 1411.", "russian": "С четырнадцати лет проходил обучение в футбольной академии «Сконто». В восемнадцать лет Александр начал профессиональную карьеру, став игроком рижского «Олимп»." }, "records": [["none","none"]], "types": { "english": 2000, "russian": 2000 }, "variables": ["english","russian"] } """ params = json.loads(raw) kwargs = dict(savFileName="out.sav", varNames=params['variables'], varTypes=params['types'], varLabels=params['labels'], ioUtf8=True) ## Unicode mode with sav.SavWriter(**kwargs) as writer: for record in params['records']: writer.writerow(record) with sav.SavHeaderReader(kwargs["savFileName"]) as header: for lang, text in sorted(header.varLabels.items()): print("%s: %d bytes, %d chars" % (lang, len(text), len(text.decode("utf-8")))) print(sav.MAXLENGTHS['SPSS_MAX_VARLABEL']) albertjan@debian:~/nfs/Public/savreaderwriter$ python2.7 mironov.py english: 255 bytes, 255 chars russian: 255 bytes, 137 chars (256, 'Variable label') albertjan@debian:~/nfs/Public/savreaderwriter$ python3.4 mironov.py b'english': 255 bytes, 255 chars b'russian': 255 bytes, 137 chars (256, 'Variable label')
codepage mode
The code is partially omitted because it is largely the same. The point is that with a single-byte encoding such as
cp1251
you can stuff more Russian chars in a label.kwargs = dict(savFileName="out.sav", varNames=params['variables'], varTypes=params['types'], varLabels=params['labels'], ioLocale="ru_RU.CP1251") ## codepage mode with sav.SavWriter(**kwargs) as writer: for record in params['records']: writer.writerow(record) with sav.SavHeaderReader(kwargs["savFileName"]) as header: for lang, text in sorted(header.varLabels.items()): print("%s: %d bytes, %d chars" % (lang, len(text), len(text.decode("cp1251"))))
The output:
english: 255 bytes, 255 chars russian: 255 bytes, 255 chars
But even in codepage mode you're missing "роком рижского «Олимп».". Alas! :-)
Best wishes, Albert-Jan
-
repo owner - marked as enhancement
-
reporter Thanks for answering!
May I increase sav.MAXLENGTHS value? I'm trying to set something about below, but it's not working :)
sav.MAXLENGTHS['SPSS_MAX_VARLABEL'] = (2560, 'Variable label')
-
repo owner Hi, you' re welcome! Unfortunately, the MAXLENGTHS are a restriction of the .sav format. It is impossible to change that, even though you could modify the values of the dictionary savReaderWrtier stores the maxlengths in. It usually makes the output in SPSS very hard to read if the variable labels are that long. You could try storing information in a FILE LABEL, though I really do not know the maxlength of that. String fields have a limit of around 32000 bytes, I think, so you could also consider saving the variable labels in a separate little dataset.
Regards, Albert-Jan
-
repo owner - changed status to resolved
Resolved quite some time ago --I just forgot to close the issue :)
- Log in to comment
Someone adviced me update SPSSIO lib. Where I get last version of spssio for linux x64 (centos 6) ?