Adding support for numpy.nan as sysmis

Issue #25 resolved
NoRecces created an issue

I think it would be great if there will be support not only None as missing value but also numpy.nan

For now I have to replace numpy.nan objects by None in every record that I write by savWriter.

Test case is

from savReaderWriter import SavWriter
import numpy as np

def main():
    test_array = np.array([1,2,3,4,5,6, np.nan])
    with SavWriter(savFileName='/tmp/test_base.sav',
                   varNames=['a'],
                   varTypes={'a': 0},
                   ioUtf8=True) as writer:

        for record in test_array:
            writer.writerow([record])

    return 'done'


if __name__ == '__main__':
    main()

Comments (8)

  1. Albert-Jan Roskam repo owner

    Hi,

    What version of savReaderWriter and Python are you using?

    >>> import sys,  savReaderWriter as rw
    >>> sys.version_info, rw.__version__
    (sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0),
     '3.3.0')
    

    The code you gave runs without errors on Windows 7 32 bit. In SPSS, the np.nan shows up as $sysmis (blank) in the data editor.

    Best wishes, Albert-Jan

  2. NoRecces reporter

    Hi Albert-Jan,

    >>> import sys,  savReaderWriter as rw
    >>> sys.version_info, rw.__version__
    (sys.version_info(major=2, minor=7, micro=6, releaselevel='final', serial=0), '3.3.0')
    
    >>> import platform
    >>> platform.platform()
    'Linux-3.13.0-32-generic-x86_64-with-Ubuntu-14.04-trusty'
    

    Could you please show descriptives statistics (freqs) of 'a' variable? I also see np.nan as sysmis when I look into the data editor (win-x64, spss v22; I run python code under platform, that Ive described above, but I have spss installed only under windows so I have to copy test_base.sav to windows platform and open it there): 09.png but in fact it is not $sysmis, because

    fre a.
    

    in the syntax editor outputs 06.png

    but I expect:08.png

    As I can see there is difference between processing None and np.nan.

  3. Albert-Jan Roskam repo owner

    Hi,

    Ah, now I see what you mean. That's annoying indeed. It might be nice to have a parameter similar to recodeSysmisTo in SavReader. It is simple, but quite expensive to convert np.nan to sysmis, because you would need go check every value. I will keep this issue open. Meanwhile (you probably are doing this already) you could try something like:

    In [1]: import numpy as np, savReaderWriter as rw
    In [2]: arr = np.array([np.nan, 1, np.nan, 666]).reshape(4, 1)
    In [3]: with rw.SavWriter("somefile.sav", ["v1"], {"v1": 0}) as writer:
       .....:     arr[:] = np.where(np.isnan(arr), writer.sysmis, arr)
       .....:     writer.writerows(arr.tolist())
       .....:
    

    Best wishes, Albert-Jan

  4. Albert-Jan Roskam repo owner

    hmmm, come to think of it: it would be a very small effort to add a method writearray which --yes-- writes an array, with nan values converted into SPSS $sysmis.

        def writearray(self, array):
            """Write a numpy array to a .sav"""
            for i in range(len( np.where(np.isnan(array), self.sysmis, array) )):
                record = array[i].tolist()
                self._pyWriterow(record)
    
  5. NoRecces reporter

    Hi @fomcl I think new method for particular data type makes user api more complicated. What if writerows will be smarter and will be type-aware?

    def writerows(self, records):
        """ This function writes all records."""
        if not isinstance(records, (tuple, list, np.array)):
            raise TypeError('records instance type must be one of list, tuple, numpy.array but got %s' % (type(records), ))
        if isinstance(records, np.array):
            for i in range(len( np.where(np.isnan(records), self.sysmis, records) )):
                record = records[i].tolist()
                self.writerow(record)
        if isinstance(records, (list, tuple)):
            for record in records:
                self.writerow(record)
    
  6. NoRecces reporter

    @fomcl Nice commit!

    Also I've noticed another one bug (or maybe it is feature?)

    this test will pass

    args = ( ["v1", "v2"], dict(v1=0, v2=0) )
    desired = [[1.0, 1.0], [1.0, 1.0]]
    def test_writerows_str():
        records = ['11', '11']
        savFileName = "output_regular.sav"
        with srw.SavWriter(savFileName, *args) as writer:
            writer.writerows(records)
        with srw.SavReader(savFileName) as reader:
            actual = reader.all()
        assert actual == desired, actual
    

    maybe its better to test on whether records in writerows isinstance(records, collections.Iterable) and first record must be iterable too?

  7. Log in to comment