Need to determine original string encoding

Hi,

I'm having string encoding problems when loading and then saving sas data, like so:

with SAS7BDAT(fname) as sas:
Extract sas.columns[*].label/name fields
save these with json.dumps( ... , encoding='utf8')

At this point I get: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 19: invalid start byte

Using:

        import chardet
        encoding = chardet.detect(theString)

I find that SAS7BDAT() gave me a 'windows-1252' encoded python str (confidence 0.5). The string was something like: str("BLAHBLAHBLAH\xa9BLAHBLAHBLAH") (as non-unicode object).

So, I suspect that the extracted columns names or labels have a non-UTF8, non-ASCII encoding (my assumption is that all ASCII is always encodable as UT8 -- since my destination encoding was UTF8, the input was not ASCII or UTF8).

Looking at the sas.header, I don't see an encoding listed in the the properties, but it seems likely that SAS7BDAT supports outputting various encodings and that the encodings are are held in the header somewhere: http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002601944.htm (?)

Generally, its not safe to guess an original string encoding.

So, is it possible to:

find the string encodings that I'll receive
or (better) supply a destination encoding to: SAS7BDAT(fname, encode_as='utf8') so that all of the strings it extracts are recoded as specified? If the SAS7BDAT header doesn't indicate an input encoding it would need to be specified: SAS7BDAT(fname, encode_as="utf8", expect_encoding="windows-1252")

If I'm on the right track, I don't might looking into patching this.

I'm using today's version of the master: Jan 6 2016, @ da1faa90d0b15c2c97a2a8eb86c91c58081bdd86.

(Also, thanks for maintaining this library -- is extremely useful).

Comments (5)