Only first 13,567 records extracted, then everything's zero/null/empty
I have a *.sas7bdat
file from a customer that's pretty big - 29G in its original form, containing 2,528,692 records. However, when I process it I get good data for the first 13,567 rows, then all values are zeros / nulls / empty strings for the remaining 2,515,125 rows. The types do look correct though. My processing script is as follows:
from sas7bdat import SAS7BDAT
import sys
import csv
csvwriter = csv.writer(sys.stdout)
with SAS7BDAT(sys.argv[1]) as f:
for row in f:
csvwriter.writerow(row)
The only warning/error I get when processing is a single instance of this:
[bmsdata_2015_06_25.sas7bdat] column count mismatch
Does this ring any bells? Anything significant about 13,567? It's not close to a power of 2 or anything I can think of. I asked the customer to re-confirm there's data all the way through the file and they say it looks fine.
Comments (5)
-
Account Deleted -
Old issue, but if you're still there, is the file compressed? What is the value of f.properties.compression?
We just fixed a bug that affected RLE compressed files (compression = SASYZCRL), and there are known issues with RDC compression (compression = SASYZCR2) that I think I can fix in a few weeks.
-
Hi @kshedden - did you ever get around to looking at the issues with RDC compression? I'm using version 2.0.7 and the decompress_row method is raising an error on my file with unknown markers - 6,7,9,10,11, etc.
-
I ported this over to Pandas ... have you tried the pandas version (pandas.read_sas)?
I think I added a few RDC codes there, but I think more are still missing.
Can you share the data?
Other options are wizard and https://github.com/kshedden/datareader
Kerby
-
Kerby - thanks much!! pandas.read_sas worked well.
The data was here:
- Log in to comment
Just realized that https://pyhacker.com/pages/sas7bdat.html says Python 2.7+ is required, but https://bitbucket.org/jaredhobbs/sas7bdat says only 2.6+ is required. I'm using 2.6. Might that explain it?