N rows skipped, and then the last row is repeated N times
I noticed a problem with your parser (and also my JS port) in some sas7bdat files. Sadly I can't share the files because it's private data that I don't own. haven does seem to work correctly, but I really hate R so I really don't want to mess around with the internals of a compiled R package to figure out what the difference is, so I figured I'd ask if you have seen anything like this before or have any insight...
The file parses, no errors, and returns the correct number of rows. It's like 8000 rows, and the first ~6300 rows are all exactly correct. The problem is, starting about 6300 rows in, most (but not all) rows get skipped. So by row 6800 it's hit the last row, and then it just repeats that same row 1200 times until it reaches the correct number of rows.
I think part of the problem comes from https://bitbucket.org/jaredhobbs/sas7bdat/src/da1faa90d0b15c2c97a2a8eb86c91c58081bdd86/sas7bdat.py?fileviewer=file-view-default#sas7bdat.py-608 because what that will do is allow the previous value of self.current_row to be repeatedly yielded at the end of this function. That's what's happening - an IndexError occurs every iteration after the last row is reached, and self.current_row never changes, so the same row is yielded 1200 times. I have a bunch of SAS files I'm using for testing, and none of the ones that work correctly trigger that exception, so maybe it is not well tested? Do you remember what that is supposed to be doing? Maybe it would be better to not ignore that exception, so at least this error does not happen silently?
But that still doesn't answer the question of why rows are being skipped/lost, which is the root of the problem since if no rows were skipped/lost the above repeating row scenario wouldn't be triggered in the first place.
As I said, I can't share the file, but I probably can share the metadata logged by your program which is hopefully useful:
[sv.sas7bdat]
Header:
col_count_p1: 12
col_count_p2: 0
column_count: 12
compression: SASYZCRL
creator: None
creator_proc: SORT (2)
date_created: 2015-04-21 23:09:49.837232
date_modified: 2015-04-21 23:09:49.837232
endianess: little
file_type: DATA
filename: sv.sas7bdat
header_length: 8192
lcp: 8
lcs: 14
mix_page_row_count: 486
name: SV
os_name: x86_64
os_type: 2.6.18-238.1.1.e
page_count: 8
page_length: 131072
platform: unix
row_count: 7919
row_length: 262
sas_release: 9.0301M1
server_type: Linux
u64: True
Contents of dataset "SV":
Num Name Type Length Format Label
--- -------- ------ ------ ------ ------------------------------
1 STUDYID string 15 Study Identifier
2 DOMAIN string 2 Domain Abbreviation
3 USUBJID string 25 Unique Subject Identifier
4 VISITNUM number 8 Visit Number
5 VISIT string 50 Visit Name
6 VISITDY number 8 Planned Study Day of Visit
7 EPOCH string 50 Epoch
8 SVSTDTC string 19 Start Date/Time of Visit
9 SVENDTC string 19 End Date/Time of Visit
10 SVSTDY number 8 Study Day of Start of Visit
11 SVENDY number 8 Study Day of End of Visit
12 SVUPDES string 50 Description of Unplanned Visit
Comments (2)
-
-
I think I may have a solution, I found that some of the pages of data were being skipped in the file I was loading; looking further they were self.header.PAGE_METC_TYPE, then when the readlines got to the end of the file it just kept repeating the last row till it reached the row_count. The fix I applied locally was to treat pages of self.header.PAGE_METC_TYPE like self.header.PAGE_META_TYPE. After making that change the pages were not skipped and I was able to read the rows from that page.
- Log in to comment
Not sure if this is helpful but I have a similar issue; In my case I found the SAS file was corrupted; when opening in SAS to export it says:
ERROR: Expecting page 2992, got page -1 instead.
ERROR: Page validation error while reading XXX.XXXX.DATA.
ERROR: File XXXX.XXXX.DATA is damaged. I/O processing did not complete.
In my local fork I am just re-raising the IndexError when I encounter a file like this.