N rows skipped, and then the last row is repeated N times

Issue #27 new
dumbmatter NA created an issue

I noticed a problem with your parser (and also my JS port) in some sas7bdat files. Sadly I can't share the files because it's private data that I don't own. haven does seem to work correctly, but I really hate R so I really don't want to mess around with the internals of a compiled R package to figure out what the difference is, so I figured I'd ask if you have seen anything like this before or have any insight...

The file parses, no errors, and returns the correct number of rows. It's like 8000 rows, and the first ~6300 rows are all exactly correct. The problem is, starting about 6300 rows in, most (but not all) rows get skipped. So by row 6800 it's hit the last row, and then it just repeats that same row 1200 times until it reaches the correct number of rows.

I think part of the problem comes from https://bitbucket.org/jaredhobbs/sas7bdat/src/da1faa90d0b15c2c97a2a8eb86c91c58081bdd86/sas7bdat.py?fileviewer=file-view-default#sas7bdat.py-608 because what that will do is allow the previous value of self.current_row to be repeatedly yielded at the end of this function. That's what's happening - an IndexError occurs every iteration after the last row is reached, and self.current_row never changes, so the same row is yielded 1200 times. I have a bunch of SAS files I'm using for testing, and none of the ones that work correctly trigger that exception, so maybe it is not well tested? Do you remember what that is supposed to be doing? Maybe it would be better to not ignore that exception, so at least this error does not happen silently?

But that still doesn't answer the question of why rows are being skipped/lost, which is the root of the problem since if no rows were skipped/lost the above repeating row scenario wouldn't be triggered in the first place.

As I said, I can't share the file, but I probably can share the metadata logged by your program which is hopefully useful:

[sv.sas7bdat]
Header:
    col_count_p1: 12
    col_count_p2: 0
    column_count: 12
    compression: SASYZCRL
    creator: None
    creator_proc: SORT (2)
    date_created: 2015-04-21 23:09:49.837232
    date_modified: 2015-04-21 23:09:49.837232
    endianess: little
    file_type: DATA
    filename: sv.sas7bdat
    header_length: 8192
    lcp: 8
    lcs: 14
    mix_page_row_count: 486
    name: SV
    os_name: x86_64
    os_type: 2.6.18-238.1.1.e
    page_count: 8
    page_length: 131072
    platform: unix
    row_count: 7919
    row_length: 262
    sas_release: 9.0301M1
    server_type: Linux
    u64: True

Contents of dataset "SV":
Num Name     Type   Length Format Label                       
--- -------- ------ ------ ------ ------------------------------
  1 STUDYID  string     15        Study Identifier            
  2 DOMAIN   string      2        Domain Abbreviation         
  3 USUBJID  string     25        Unique Subject Identifier   
  4 VISITNUM number      8        Visit Number                
  5 VISIT    string     50        Visit Name                  
  6 VISITDY  number      8        Planned Study Day of Visit  
  7 EPOCH    string     50        Epoch                       
  8 SVSTDTC  string     19        Start Date/Time of Visit    
  9 SVENDTC  string     19        End Date/Time of Visit      
 10 SVSTDY   number      8        Study Day of Start of Visit 
 11 SVENDY   number      8        Study Day of End of Visit   
 12 SVUPDES  string     50        Description of Unplanned Visit

Comments (2)

  1. Matt

    Not sure if this is helpful but I have a similar issue; In my case I found the SAS file was corrupted; when opening in SAS to export it says:

    ERROR: Expecting page 2992, got page -1 instead.

    ERROR: Page validation error while reading XXX.XXXX.DATA.

    ERROR: File XXXX.XXXX.DATA is damaged. I/O processing did not complete.

    In my local fork I am just re-raising the IndexError when I encounter a file like this.

  2. Matt

    I think I may have a solution, I found that some of the pages of data were being skipped in the file I was loading; looking further they were self.header.PAGE_METC_TYPE, then when the readlines got to the end of the file it just kept repeating the last row till it reached the row_count. The fix I applied locally was to treat pages of self.header.PAGE_METC_TYPE like self.header.PAGE_META_TYPE. After making that change the pages were not skipped and I was able to read the rows from that page.

  3. Log in to comment