Reading sav file and subsequent writing of the unmodified data should produce the same file

Issue #29 new
gritschel created an issue

I think that the srw module should produce the same output file, if data (plus metadata) that were read are written again without further modification. Shis should also be a standard test for srw, in my opinion.

I have tried the following:

# reading data from SPSS file
data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale="german")
with data:
    allData = data.all()
allData = np.array(allData)  # in the most recent version one can directly read to numpy arrays
variables = allData[0]
records = allData[1:]
# reading metadata from SPSS file
with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale="german") as header:
    metadata = header.dataDictionary(asNamedtuple=False)  # Why does this take so long?
# writing unmodified data
with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale="german",
                   mode=b'wb', refSavFileName=None, **metadata) as writer:
    for i, record in enumerate(records):
        writer.writerow(record)

Currently, when trying to run the above code, I get the following error, which I was not able to backtrace: (Note, that, of course, the values in metadata['alignments'] are in ['left', 'right', 'center'].)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-b3fa0cebce1a> in <module>()
      1 spss_file_out = os.path.join(directory, 'out.sav')
      2 with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale="german",
----> 3                    mode=b'wb', refSavFileName=None, **metadata) as writer:
      4     for i, record in enumerate(records):
      5         if i < 5: print(record)

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in __init__(self, savFileName, varNames, varTypes, valueLabels, varLabels, formats, missingValues, measureLevels, columnWidths, alignments, varSets, varRoles, varAttributes, fileAttributes, fileLabel, multRespDefs, caseWeightVar, overwrite, ioUtf8, ioLocale, mode, refSavFileName)
    186             self.measureLevels = measureLevels
    187             self.columnWidths = columnWidths
--> 188             self.alignments = alignments
    189             self.varSets = varSets
    190             self.varRoles = varRoles

C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\header.py in alignments(self, varAlignments)
    757             if varAlignment.lower() not in alignments:
    758                 ukeys = b", ".join(alignments.keys()).decode()
--> 759                 raise ValueError("Valid alignments are %s" % ukeys)
    760             alignment = alignments.get(varAlignment.lower())
    761             retcode = func(self.fh, c_char_py3k(varName), alignment)

ValueError: Valid alignments are right, center, left
  • I noticed in the 3.3.0 download version (i.e. long before the update in the master channel, yesterday) that my import/export example from above works fine. An issue there was that dates were read and converted from Gregorian to 'proper dates' fine during the import, but they disappeared in the export. This should, obviously, not be the case.

  • Furthermore, it seems that string variables had three times the length of the respective length in the input file after exporting to the output file. This is also not desirable.

  • Another issue I noticed is that reading the metadata from an SPSS file (with the code above) seems to take much longer than reading the actual data. Is there an explanation for this behaviour? (I recognized this with a pretty large file with hundreds of variables.)

Cheers, Gerhard

I am currently working with python3 on Windows (64bit).

Comments (13)

  1. Albert-Jan Roskam repo owner

    Hi again Gerhard,

    I noticed in the 3.3.0 download version (i.e. long before the update in the master channel, yesterday) that my import/export example from above works fine. An issue there was that dates were read and converted from Gregorian to 'proper dates' fine during the import, but they disappeared in the export. This should, obviously, not be the case. --> use rawMode=True to prevent automatic date conversion. This also causes sysmis values not to be converted into None, and string values are still ceiled multiples of 8 bytes (e.g an A20 variable is 24 bytes long).

    Furthermore, it seems that string variables had three times the length of the respective length in the input file after exporting to the output file. This is also not desirable. --> I can't reproduce this with savReaderWriter. But with SPSS, if you open a file in unicode mode (SET UNICODE=ON, default since SPSS v21) whereas it was created in codepage mode (SET UNICODE=OFF), the number of bytes of string variables will be tripled. This is done because the codepage file could theoretically contain accented chars that become 3-byte chars in utf-8 (I guess we're out of luck with east asian codepages then). See: http://www-01.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.statistics.help/faq_unicode.htm

    Another issue I noticed is that reading the metadata from an SPSS file (with the code above) seems to take much longer than reading the actual data. Is there an explanation for this behaviour? (I recognized this with a pretty large file with hundreds of variables) --> Explanation (1) there are lots and lots of different kinds of metadata, many of them are stored in arrays. Better to only fetch what you need, if possible (3) With ioUtf8=True things get worse as all bytestrings needs to be transcoded info ustrings. (3) Some profiling wouldn't hurt. It was a lot of work to make the code Python2-3 compatible, premature optimization...

    Anyway, I modified the code a little so your (slighly modified) code below works. I have yet to write unittests. See commit 4cd6db34d3. .

    from __future__ import print_function
    import sys
    import pprint
    import numpy as np
    import savReaderWriter as sav
    
    spss_file = "./test_data/gerhard.sav"
    spss_file_out = "./test_data/gerhard_out.sav"
    
    ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
    
    data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=True)
    with data:
        allData = data.all()
        variables = allData[0]
        records = allData[1:]
        print(data.varTypes["name"] == len(records[0][1]))   # 24 while it should be 20! -->> rawMode: strings are ceiled multiples of 8
    
    allDataArray = np.array(records)  # in the most recent version one can directly read to numpy arrays
    print(records)
    
    # reading metadata from SPSS file
    with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale=ioLocale) as header:
        metadata = header.dataDictionary(asNamedtuple=False)  # Why does this take so long?
    
    pprint.pprint(metadata)
    
    # writing unmodified data
    with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale=ioLocale,
                       mode=b'wb', refSavFileName=None, **metadata) as writer:
        for i, record in enumerate(records):
            writer.writerow(record)
    
  2. gritschel reporter

    Hi AJ,

    thanks again for your quick responses!

    • use rawMode=True to prevent automatic date conversion`

      • In the current version, also if rawMode=False, no date conversion is done. I don't know why. (Wouldn't it be clever to have an additional keyword argument for the date conversion?)
      • Second, with rawMode=True reading the file seems to take much longer than with rawMode=False. What's the reason for this?
    • I can't reproduce this with savReaderWriter.

      • You are right, it seems that it doesn't happen with the current version, anymore. And thanks for all the other Information!
    • There are lots and lots of different kinds of metadata, many of them are stored in arrays. Better to only fetch what you need, if possible.

      • I would rather like to read all of it, since I want to check, whether I can produce an output file with the same content as in the input file.
    • Anyway, I modified the code a little so your (slighly modified) code below works. I have yet to write unittests.

      • No, it doesn't. Here is what I get:
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-25-fb88eac8e98e> in <module>()
          4     for i, record in enumerate(records):
          5         #if i < 5: print(record)
    ----> 6         writer.writerow(record)
    
    C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in writerow(self, record)
        332             cWriterow(self, record)
        333             return
    --> 334         self._pyWriterow(record)
        335 
        336     def writerows(self, records):
    
    C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in _pyWriterow(self, record)
        325                     value = value.encode("utf-8")
        326             record[i] = value
    --> 327         self.record = record
        328 
        329     def writerow(self, record):
    
    C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\generic.py in record(self, record)
        522         except struct.error:
        523             msg = "Use ioUtf8=True to write unicode strings [%s]"
    --> 524             raise TypeError(msg % sys.exc_info()[1])
        525         self.wholeCaseOut.argtypes = [c_int, c_char_p]
        526         retcode = self.wholeCaseOut(self.fh, c_char_py3k(self.caseBuffer.raw))
    
    TypeError: Use ioUtf8=True to write unicode strings [required argument is not a float]
    
  3. Albert-Jan Roskam repo owner

    Quick question: can you confirm that records is an object of type 'list' and not of 'numpy.ndarray'?

  4. gritschel reporter

    I'm sorry. Indeed, it was 'np.ndarray'. With 'list' it works. What exactly is the Problem with that?

  5. Albert-Jan Roskam repo owner

    when you read the list-of-lists into a numpy array you used a simple dtype so all the values got "upcasted" to strings (because there were strings in it). This also explains the "required argument is not a float" error. For this case you'd need a complex dtype.

    with data:
        #.... (code omitted) ...
        formats = ["S%d" % data.varTypes[v] if data.varTypes[v] else np.float64 for v in data.varNames]
        dtype = np.dtype({'names': data.varNames, 'formats': formats})
        structured_array = np.array([tuple(record) for record in records], dtype=dtype)
    

    Now you could write it back to .sav. But first creating an array would slow things down, of course. In addition, you could use savReaderNp to do this: if rawMode=False it does not return ISO8601 dates (like SavReader) but datetime.datetime objects.

  6. gritschel reporter

    Thanks again for your explanations! Did you have time to look at the other questions I asked?

  7. Albert-Jan Roskam repo owner

    which questions are still relevant? I thought that using list instead of numpy.ndarray would also answer the other questions?

  8. gritschel reporter
    • I recognized that even with rawMode=False the date conversion is not always done. (Unfortunately, I can't tell you now for which combination of OS, python version, ... I saw this. I can probably tell you tomorrow.)

    • From my perspective it would be desirable if a readable, converted date (in savReaderWriter) would automatically be converted back to Gregorian seconds when exporting again as an sav-file. Otherwise the dates are simply lost or one has to manually convert them all back.

    • Actually, I would even prefer to have an additional keyword argument for the date conversion, not only the 'all-together' rawMode=True. This gives more control and should be possible to be implemented easily, I hope.

    Cheers, Gerhard

  9. Albert-Jan Roskam repo owner

    I'd love to hear more about the first point! Are you sure those were not missing date values ($sysmis)? Yeah, rawMode is indeed a bit all-or-nothing. Then, converting things is easy-peasy:

    with savReader(f) as reader:
        with SavWriter(**kwargs) as writer:
            for record in reader:
               record[0] = writer.spssDateTime(record[0], '%Y-%m-%d')  # datetime in first column.
               writer.writerow(record)
    
  10. gritschel reporter

    Hi AJ, I recognized the date issue in python3 (64bit) under Windows with ioUtf8=True and rawMode=False. Here's the test code ...

    from __future__ import division, print_function
    #from __future__ import absolute_import, unicode_literals
    
    import numpy as np
    import os
    import sys
    import savReaderWriter as sav
    
    directory = "C:\\Users\\RitschelG\\Projekte\\spss_to_pandas"
    spss_filename = r"test.sav"
    spss_file = os.path.join(directory, spss_filename)
    
    # read SPSS file data
    ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252"
    data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=False)
    with data:
        allData = data.all()
    
    print(allData)
    

    And here is what it gives:

    [['count', 'name', 'date', 'float', 'ümlaut', 'sex', 'job'], [1.0, 'John Doe            ', 13572748800.0, 1.23, 'ä      ', 1.0, 1.0], [2.0, 'Jane Doe            ', 13572835200.0, 4.56, 'ö      ', 0.0, 3.0], [3.0, 'Jack Doe            ', 13572921600.0, 7.89, 'ß      ', 1.0, None], [4.0, 'Jim Doe             ', 13573008000.0, 0.12, 'é      ', 9.0, 2.0]]
    

    Cheers, Gerhard

  11. gritschel reporter

    Hi AJ, thanks for your other answers. Do you think an additional kwarg for the date conversion would make sense in the reader part? And is there a possibility to have the date back-conversion run automatically in the writer part -- maybe also with an additional kwarg? This would increase usability for the end user, I think. Cheers, Gerhard

  12. Log in to comment