Reading sav file and subsequent writing of the unmodified data should produce the same file
I think that the srw module should produce the same output file, if data (plus metadata) that were read are written again without further modification. Shis should also be a standard test for srw, in my opinion.
I have tried the following:
# reading data from SPSS file
data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale="german")
with data:
allData = data.all()
allData = np.array(allData) # in the most recent version one can directly read to numpy arrays
variables = allData[0]
records = allData[1:]
# reading metadata from SPSS file
with sav.SavHeaderReader(spss_file, ioUtf8=True, ioLocale="german") as header:
metadata = header.dataDictionary(asNamedtuple=False) # Why does this take so long?
# writing unmodified data
with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale="german",
mode=b'wb', refSavFileName=None, **metadata) as writer:
for i, record in enumerate(records):
writer.writerow(record)
Currently, when trying to run the above code, I get the following error, which I was not able to backtrace: (Note, that, of course, the values in metadata['alignments'] are in ['left', 'right', 'center'].)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-12-b3fa0cebce1a> in <module>()
1 spss_file_out = os.path.join(directory, 'out.sav')
2 with sav.SavWriter(spss_file_out, overwrite=True, ioUtf8=True, ioLocale="german",
----> 3 mode=b'wb', refSavFileName=None, **metadata) as writer:
4 for i, record in enumerate(records):
5 if i < 5: print(record)
C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in __init__(self, savFileName, varNames, varTypes, valueLabels, varLabels, formats, missingValues, measureLevels, columnWidths, alignments, varSets, varRoles, varAttributes, fileAttributes, fileLabel, multRespDefs, caseWeightVar, overwrite, ioUtf8, ioLocale, mode, refSavFileName)
186 self.measureLevels = measureLevels
187 self.columnWidths = columnWidths
--> 188 self.alignments = alignments
189 self.varSets = varSets
190 self.varRoles = varRoles
C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\header.py in alignments(self, varAlignments)
757 if varAlignment.lower() not in alignments:
758 ukeys = b", ".join(alignments.keys()).decode()
--> 759 raise ValueError("Valid alignments are %s" % ukeys)
760 alignment = alignments.get(varAlignment.lower())
761 retcode = func(self.fh, c_char_py3k(varName), alignment)
ValueError: Valid alignments are right, center, left
-
I noticed in the 3.3.0 download version (i.e. long before the update in the master channel, yesterday) that my import/export example from above works fine. An issue there was that dates were read and converted from Gregorian to 'proper dates' fine during the import, but they disappeared in the export. This should, obviously, not be the case.
-
Furthermore, it seems that string variables had three times the length of the respective length in the input file after exporting to the output file. This is also not desirable.
-
Another issue I noticed is that reading the metadata from an SPSS file (with the code above) seems to take much longer than reading the actual data. Is there an explanation for this behaviour? (I recognized this with a pretty large file with hundreds of variables.)
Cheers, Gerhard
I am currently working with python3 on Windows (64bit).
Comments (13)
-
repo owner -
repo owner Fix for Python 3.4 problem copying metadata in unicode mode, see issue #29
→ <<cset 4cd6db34d3ab>>
-
reporter Hi AJ,
thanks again for your quick responses!
-
use
rawMode=True
to prevent automatic date conversion`- In the current version, also if
rawMode=False
, no date conversion is done. I don't know why. (Wouldn't it be clever to have an additional keyword argument for the date conversion?) - Second, with
rawMode=True
reading the file seems to take much longer than withrawMode=False
. What's the reason for this?
- In the current version, also if
-
I can't reproduce this with savReaderWriter.
- You are right, it seems that it doesn't happen with the current version, anymore. And thanks for all the other Information!
-
There are lots and lots of different kinds of metadata, many of them are stored in arrays. Better to only fetch what you need, if possible.
- I would rather like to read all of it, since I want to check, whether I can produce an output file with the same content as in the input file.
-
Anyway, I modified the code a little so your (slighly modified) code below works. I have yet to write unittests.
- No, it doesn't. Here is what I get:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-25-fb88eac8e98e> in <module>() 4 for i, record in enumerate(records): 5 #if i < 5: print(record) ----> 6 writer.writerow(record) C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in writerow(self, record) 332 cWriterow(self, record) 333 return --> 334 self._pyWriterow(record) 335 336 def writerows(self, records): C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\savWriter.py in _pyWriterow(self, record) 325 value = value.encode("utf-8") 326 record[i] = value --> 327 self.record = record 328 329 def writerow(self, record): C:\...\lib\site-packages\savreaderwriter-3.3.0-py3.4.egg\savReaderWriter\generic.py in record(self, record) 522 except struct.error: 523 msg = "Use ioUtf8=True to write unicode strings [%s]" --> 524 raise TypeError(msg % sys.exc_info()[1]) 525 self.wholeCaseOut.argtypes = [c_int, c_char_p] 526 retcode = self.wholeCaseOut(self.fh, c_char_py3k(self.caseBuffer.raw)) TypeError: Use ioUtf8=True to write unicode strings [required argument is not a float]
-
-
repo owner Quick question: can you confirm that records is an object of type 'list' and not of 'numpy.ndarray'?
-
reporter I'm sorry. Indeed, it was 'np.ndarray'. With 'list' it works. What exactly is the Problem with that?
-
repo owner when you read the list-of-lists into a numpy array you used a simple dtype so all the values got "upcasted" to strings (because there were strings in it). This also explains the "required argument is not a float" error. For this case you'd need a complex dtype.
with data: #.... (code omitted) ... formats = ["S%d" % data.varTypes[v] if data.varTypes[v] else np.float64 for v in data.varNames] dtype = np.dtype({'names': data.varNames, 'formats': formats}) structured_array = np.array([tuple(record) for record in records], dtype=dtype)
Now you could write it back to .sav. But first creating an array would slow things down, of course. In addition, you could use
savReaderNp
to do this: ifrawMode=False
it does not return ISO8601 dates (likeSavReader
) butdatetime.datetime
objects. -
reporter Thanks again for your explanations! Did you have time to look at the other questions I asked?
-
repo owner which questions are still relevant? I thought that using list instead of numpy.ndarray would also answer the other questions?
-
reporter -
I recognized that even with
rawMode=False
the date conversion is not always done. (Unfortunately, I can't tell you now for which combination of OS, python version, ... I saw this. I can probably tell you tomorrow.) -
From my perspective it would be desirable if a readable, converted date (in savReaderWriter) would automatically be converted back to Gregorian seconds when exporting again as an sav-file. Otherwise the dates are simply lost or one has to manually convert them all back.
-
Actually, I would even prefer to have an additional keyword argument for the date conversion, not only the 'all-together'
rawMode=True
. This gives more control and should be possible to be implemented easily, I hope.
Cheers, Gerhard
-
-
repo owner I'd love to hear more about the first point! Are you sure those were not missing date values ($sysmis)? Yeah, rawMode is indeed a bit all-or-nothing. Then, converting things is easy-peasy:
with savReader(f) as reader: with SavWriter(**kwargs) as writer: for record in reader: record[0] = writer.spssDateTime(record[0], '%Y-%m-%d') # datetime in first column. writer.writerow(record)
-
reporter Hi AJ, I recognized the date issue in python3 (64bit) under Windows with
ioUtf8=True
andrawMode=False
. Here's the test code ...from __future__ import division, print_function #from __future__ import absolute_import, unicode_literals import numpy as np import os import sys import savReaderWriter as sav directory = "C:\\Users\\RitschelG\\Projekte\\spss_to_pandas" spss_filename = r"test.sav" spss_file = os.path.join(directory, spss_filename) # read SPSS file data ioLocale = "german" if sys.platform.startswith("win") else "de_DE.cp1252" data = sav.SavReader(spss_file, returnHeader=True, ioUtf8=True, ioLocale=ioLocale, rawMode=False) with data: allData = data.all() print(allData)
And here is what it gives:
[['count', 'name', 'date', 'float', 'ümlaut', 'sex', 'job'], [1.0, 'John Doe ', 13572748800.0, 1.23, 'ä ', 1.0, 1.0], [2.0, 'Jane Doe ', 13572835200.0, 4.56, 'ö ', 0.0, 3.0], [3.0, 'Jack Doe ', 13572921600.0, 7.89, 'ß ', 1.0, None], [4.0, 'Jim Doe ', 13573008000.0, 0.12, 'é ', 9.0, 2.0]]
Cheers, Gerhard
-
reporter - attached test.sav
-
reporter Hi AJ, thanks for your other answers. Do you think an additional kwarg for the date conversion would make sense in the reader part? And is there a possibility to have the date back-conversion run automatically in the writer part -- maybe also with an additional kwarg? This would increase usability for the end user, I think. Cheers, Gerhard
- Log in to comment
Hi again Gerhard,
I noticed in the 3.3.0 download version (i.e. long before the update in the master channel, yesterday) that my import/export example from above works fine. An issue there was that dates were read and converted from Gregorian to 'proper dates' fine during the import, but they disappeared in the export. This should, obviously, not be the case. --> use
rawMode=True
to prevent automatic date conversion. This also causes sysmis values not to be converted intoNone
, and string values are still ceiled multiples of 8 bytes (e.g an A20 variable is 24 bytes long).Furthermore, it seems that string variables had three times the length of the respective length in the input file after exporting to the output file. This is also not desirable. --> I can't reproduce this with savReaderWriter. But with SPSS, if you open a file in unicode mode (SET UNICODE=ON, default since SPSS v21) whereas it was created in codepage mode (SET UNICODE=OFF), the number of bytes of string variables will be tripled. This is done because the codepage file could theoretically contain accented chars that become 3-byte chars in utf-8 (I guess we're out of luck with east asian codepages then). See: http://www-01.ibm.com/support/knowledgecenter/SSLVMB_21.0.0/com.ibm.spss.statistics.help/faq_unicode.htm
Another issue I noticed is that reading the metadata from an SPSS file (with the code above) seems to take much longer than reading the actual data. Is there an explanation for this behaviour? (I recognized this with a pretty large file with hundreds of variables) --> Explanation (1) there are lots and lots of different kinds of metadata, many of them are stored in arrays. Better to only fetch what you need, if possible (3) With
ioUtf8=True
things get worse as all bytestrings needs to be transcoded info ustrings. (3) Some profiling wouldn't hurt. It was a lot of work to make the code Python2-3 compatible, premature optimization...Anyway, I modified the code a little so your (slighly modified) code below works. I have yet to write unittests. See commit 4cd6db34d3. .