Need to determine original string encoding

Issue #22 new
Stuart Reynolds created an issue

Hi,

I'm having string encoding problems when loading and then saving sas data, like so:

  1. with SAS7BDAT(fname) as sas:
  2. Extract sas.columns[*].label/name fields
  3. save these with json.dumps( ... , encoding='utf8')

At this point I get: UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 19: invalid start byte

Using:

        import chardet
        encoding = chardet.detect(theString)

I find that SAS7BDAT() gave me a 'windows-1252' encoded python str (confidence 0.5). The string was something like: str("BLAHBLAHBLAH\xa9BLAHBLAHBLAH") (as non-unicode object).

So, I suspect that the extracted columns names or labels have a non-UTF8, non-ASCII encoding (my assumption is that all ASCII is always encodable as UT8 -- since my destination encoding was UTF8, the input was not ASCII or UTF8).

Looking at the sas.header, I don't see an encoding listed in the the properties, but it seems likely that SAS7BDAT supports outputting various encodings and that the encodings are are held in the header somewhere: http://support.sas.com/documentation/cdl/en/nlsref/61893/HTML/default/viewer.htm#a002601944.htm (?)

Generally, its not safe to guess an original string encoding.

So, is it possible to:

  • find the string encodings that I'll receive

  • or (better) supply a destination encoding to: SAS7BDAT(fname, encode_as='utf8') so that all of the strings it extracts are recoded as specified? If the SAS7BDAT header doesn't indicate an input encoding it would need to be specified: SAS7BDAT(fname, encode_as="utf8", expect_encoding="windows-1252")

If I'm on the right track, I don't might looking into patching this.


I'm using today's version of the master: Jan 6 2016, @ da1faa90d0b15c2c97a2a8eb86c91c58081bdd86.

(Also, thanks for maintaining this library -- is extremely useful).

Comments (5)

  1. Stuart Reynolds reporter

    Hi Jared. Thank you - I missed that. I tried:

    SAS7BDAT(input, encoding="utf8", encoding_errors='strict')

    which moves the error forward -- it now fails with the same error during loading (encoding_errors='ignore' was the default).

    I found that the bad string I ran across also was not correctly detected by chardet:

    theString.encode("windows-1252")

    also fails as do all of the encoding in Python's encodings.aliases.values() -- none of them seem to work. It suggests that either: - SAS originally wrote out a bad string - or that SAS7BDAT isn't loading them correctly. ... so I'm not sure where the mistake is.

    I'm curious, the encoding parameter to SAS7BDAT sets the destination encoding for loaded strings. But to correctly load strings, don't we need to know the source encoding (which we'd need to assume is fixed, or it must be specified in the file)?

  2. Log in to comment