Reading sav file and subsequent writing of the unmodified data should produce the same file

Issue #29 new

gritschel created an issue 2015-02-11

I think that the srw module should produce the same output file, if data (plus metadata) that were read are written again without further modification. Shis should also be a standard test for srw, in my opinion.

I have tried the following:

# reading data from SPSS file
data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale="german")
with data:
    allData = data.all()
allData = np.array(allData)  # in the most recent version one can directly read to numpy arrays
variables = allData[0]
records = allData[1:]

# reading metadata from SPSS file
with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale="german") as header:
    metadata = header.dataDictionary(asNamedtuple=False)  # Why does this take so long?

# writing unmodified data
with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale="german",
                   mode=b'wb', refSavFileName=None, **metadata) as writer:
    for i, record in enumerate(records):
        writer.writerow(record)

Currently, when trying to run the above code, I get the following error, which I was not able to backtrace: (Note, that, of course, the values in metadata['alignments'] are in ['left', 'right', 'center'].)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-b3fa0cebce1a> in <module>()
      1 spss_file_out = os.path.join(directory, 'out.sav')
      2 with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale="german",
----> 3                    mode=b'wb', refSavFileName=None, **metadata) as writer:
      4     for i, record in enumerate(records):
      5         if i < 5: print(record)

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in __init__(self, savFileName, varNames, varTypes, valueLabels, varLabels, formats, missingValues, measureLevels, columnWidths, alignments, varSets, varRoles, varAttributes, fileAttributes, fileLabel, multRespDefs, caseWeightVar, overwrite, ioUtf8, ioLocale, mode, refSavFileName)
    186             self.measureLevels = measureLevels
    187             self.columnWidths = columnWidths
--> 188             self.alignments = alignments
    189             self.varSets = varSets
    190             self.varRoles = varRoles

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\header.py in alignments(self, varAlignments)
    757             if varAlignment.lower() not in alignments:
    758                 ukeys = b", ".join(alignments.keys()).decode()
--> 759                 raise ValueError("Valid alignments are %s" % ukeys)
    760             alignment = alignments.get(varAlignment.lower())
    761             retcode = func(self.fh, c_char_py3k(varName), alignment)

ValueError: Valid alignments are right, center, left

I noticed in the 3.3.0 download version (i.e. long before the update in the master channel, yesterday) that my import/export example from above works fine. An issue there was that dates were read and converted from Gregorian to 'proper dates' fine during the import, but they disappeared in the export. This should, obviously, not be the case.
Furthermore, it seems that string variables had three times the length of the respective length in the input file after exporting to the output file. This is also not desirable.
Another issue I noticed is that reading the metadata from an SPSS file (with the code above) seems to take much longer than reading the actual data. Is there an explanation for this behaviour? (I recognized this with a pretty large file with hundreds of variables.)

Cheers, Gerhard

I am currently working with python3 on Windows (64bit).

Comments (13)

Albert-Jan Roskam repo owner
Hi again Gerhard,

I noticed in the 3.3.0 download version (i.e. long before the update in the master channel, yesterday) that my import/export example from above works fine. An issue there was that dates were read and converted from Gregorian to 'proper dates' fine during the import, but they disappeared in the export. This should, obviously, not be the case. --> use rawMode=True to prevent automatic date conversion. This also causes sysmis values not to be converted into None, and string values are still ceiled multiples of 8 bytes (e.g an A20 variable is 24 bytes long).

Furthermore, it seems that string variables had three times the length of the respective length in the input file after exporting to the output file. This is also not desirable. --> I can't reproduce this with savReaderWriter. But with SPSS, if you open a file in unicode mode (SET UNICODE=ON, default since SPSS v21) whereas it was created in codepage mode (SET UNICODE=OFF), the number of bytes of string variables will be tripled. This is done because the codepage file could theoretically contain accented chars that become 3-byte chars in utf-8 (I guess we're out of luck with east asian codepages then). See: http://www-01.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.statistics.help/faq_unicode.htm

Another issue I noticed is that reading the metadata from an SPSS file (with the code above) seems to take much longer than reading the actual data. Is there an explanation for this behaviour? (I recognized this with a pretty large file with hundreds of variables) --> Explanation (1) there are lots and lots of different kinds of metadata, many of them are stored in arrays. Better to only fetch what you need, if possible (3) With ioUtf8=True things get worse as all bytestrings needs to be transcoded info ustrings. (3) Some profiling wouldn't hurt. It was a lot of work to make the code Python2-3 compatible, premature optimization...

Anyway, I modified the code a little so your (slighly modified) code below works. I have yet to write unittests. See commit 4cd6db34d3. .
```
from __future__ import print_function
import sys
import pprint
import numpy as np
import savReaderWriter as sav

spss_file = "./test_data/gerhard.sav"
spss_file_out = "./test_data/gerhard_out.sav"

ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"

data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=True)
with data:
    allData = data.all()
    variables = allData[0]
    records = allData[1:]
    print(data.varTypes["name"] == len(records[0][1]))   # 24 while it should be 20! -->> rawMode: strings are ceiled multiples of 8

allDataArray = np.array(records)  # in the most recent version one can directly read to numpy arrays
print(records)

# reading metadata from SPSS file
with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale=ioLocale) as header:
    metadata = header.dataDictionary(asNamedtuple=False)  # Why does this take so long?

pprint.pprint(metadata)

# writing unmodified data
with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale=ioLocale,
                   mode=b'wb', refSavFileName=None, **metadata) as writer:
    for i, record in enumerate(records):
        writer.writerow(record)
```
- 2015-02-11T20:31:18+00:00
Albert-Jan Roskam repo owner
Fix for Python 3.4 problem copying metadata in unicode mode, see issue #29

→ <<cset 4cd6db34d3ab>>
- 2015-02-11T20:46:10+00:00

gritschel reporter

Hi AJ,

thanks again for your quick responses!

use rawMode=True to prevent automatic date conversion`
- In the current version, also if rawMode=False, no date conversion is done. I don't know why. (Wouldn't it be clever to have an additional keyword argument for the date conversion?)
- Second, with rawMode=True reading the file seems to take much longer than with rawMode=False. What's the reason for this?
I can't reproduce this with savReaderWriter.
- You are right, it seems that it doesn't happen with the current version, anymore. And thanks for all the other Information!
There are lots and lots of different kinds of metadata, many of them are stored in arrays. Better to only fetch what you need, if possible.
- I would rather like to read all of it, since I want to check, whether I can produce an output file with the same content as in the input file.
Anyway, I modified the code a little so your (slighly modified) code below works. I have yet to write unittests.
- No, it doesn't. Here is what I get:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-25-fb88eac8e98e> in <module>()
      4     for i, record in enumerate(records):
      5         #if i < 5: print(record)
----> 6         writer.writerow(record)

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in writerow(self, record)
    332             cWriterow(self, record)
    333             return
--> 334         self._pyWriterow(record)
    335 
    336     def writerows(self, records):

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in _pyWriterow(self, record)
    325                     value = value.encode("utf-8")
    326             record[i] = value
--> 327         self.record = record
    328 
    329     def writerow(self, record):

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\generic.py in record(self, record)
    522         except struct.error:
    523             msg = "Use ioUtf8=True to write unicode strings [%s]"
--> 524             raise TypeError(msg % sys.exc_info()[1])
    525         self.wholeCaseOut.argtypes = [c_int, c_char_p]
    526         retcode = self.wholeCaseOut(self.fh, c_char_py3k(self.caseBuffer.raw))

TypeError: Use ioUtf8=True to write unicode strings [required argument is not a float]

2015-02-12T12:28:18+00:00

Albert-Jan Roskam repo owner
Quick question: can you confirm that records is an object of type 'list' and not of 'numpy.ndarray'?
- 2015-02-12T14:39:32+00:00
gritschel reporter
I'm sorry. Indeed, it was 'np.ndarray'. With 'list' it works. What exactly is the Problem with that?
- 2015-02-12T15:26:12+00:00
Albert-Jan Roskam repo owner
when you read the list-of-lists into a numpy array you used a simple dtype so all the values got "upcasted" to strings (because there were strings in it). This also explains the "required argument is not a float" error. For this case you'd need a complex dtype.
```
with data:
    #.... (code omitted) ...
    formats = ["S%d" % data.varTypes[v] if data.varTypes[v] else np.float64 for v in data.varNames]
    dtype = np.dtype({'names': data.varNames, 'formats': formats})
    structured_array = np.array([tuple(record) for record in records], dtype=dtype)
```
Now you could write it back to .sav. But first creating an array would slow things down, of course. In addition, you could use savReaderNp to do this: if rawMode=False it does not return ISO8601 dates (like SavReader) but datetime.datetime objects.
- 2015-02-12T20:48:15+00:00
gritschel reporter
Thanks again for your explanations! Did you have time to look at the other questions I asked?
- 2015-02-13T10:57:11+00:00
Albert-Jan Roskam repo owner
which questions are still relevant? I thought that using list instead of numpy.ndarray would also answer the other questions?
- 2015-02-16T15:43:37+00:00
gritschel reporter
- I recognized that even with rawMode=False the date conversion is not always done. (Unfortunately, I can't tell you now for which combination of OS, python version, ... I saw this. I can probably tell you tomorrow.)
- From my perspective it would be desirable if a readable, converted date (in savReaderWriter) would automatically be converted back to Gregorian seconds when exporting again as an sav-file. Otherwise the dates are simply lost or one has to manually convert them all back.
- Actually, I would even prefer to have an additional keyword argument for the date conversion, not only the 'all-together' rawMode=True. This gives more control and should be possible to be implemented easily, I hope.
Cheers, Gerhard
- 2015-02-16T20:01:48+00:00

Albert-Jan Roskam repo owner

I'd love to hear more about the first point! Are you sure those were not missing date values ($sysmis)? Yeah, rawMode is indeed a bit all-or-nothing. Then, converting things is easy-peasy:

with savReader(f) as reader:
    with SavWriter(**kwargs) as writer:
        for record in reader:
           record[0] = writer.spssDateTime(record[0], '%Y-%m-%d')  # datetime in first column.
           writer.writerow(record)

2015-02-16T22:10:02+00:00

gritschel reporter

Hi AJ, I recognized the date issue in python3 (64bit) under Windows with ioUtf8=True and rawMode=False. Here's the test code ...

from __future__ import division, print_function
#from __future__ import absolute_import, unicode_literals

import numpy as np
import os
import sys
import savReaderWriter as sav

directory = "C:\\Users\\RitschelG\\Projekte\\spss_to_pandas"
spss_filename = r"test.sav"
spss_file = os.path.join(directory, spss_filename)

# read SPSS file data
ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=False)
with data:
    allData = data.all()

print(allData)

And here is what it gives:

[['count', 'name', 'date', 'float', 'ümlaut', 'sex', 'job'], [1.0, 'John Doe            ', 13572748800.0, 1.23, 'ä      ', 1.0, 1.0], [2.0, 'Jane Doe            ', 13572835200.0, 4.56, 'ö      ', 0.0, 3.0], [3.0, 'Jack Doe            ', 13572921600.0, 7.89, 'ß      ', 1.0, None], [4.0, 'Jim Doe             ', 13573008000.0, 0.12, 'é      ', 9.0, 2.0]]

Cheers, Gerhard

2015-02-17T09:05:12+00:00

gritschel reporter
- attached test.sav
- 2015-02-17T09:06:47+00:00
gritschel reporter
Hi AJ, thanks for your other answers. Do you think an additional kwarg for the date conversion would make sense in the reader part? And is there a possibility to have the date back-conversion run automatically in the writer part -- maybe also with an additional kwarg? This would increase usability for the end user, I think. Cheers, Gerhard
- 2015-02-17T09:09:48+00:00
Log in to comment

Assignee: Albert-Jan Roskam

Type: bug

Priority: major

Status: new

Votes: 0

Watchers: 2