useful regex code for unicode read errors?

Issue #4 resolved
John Qiu created an issue

So at work, I have to deal with crappy text data that has hex randomly leaked throughout. in my case, it looks like the following:

pretend this is normal text data but then\\X0D\X0A\this happens

I wrote the following in python to replace it with white space it on the read-in. Can you guys verify that it works or suggest improvements?

In my particular use case, I am really limited to using regex to solve the problem and I need to try to preserve as much of the original data as possible, so removing the \r\n prior is not an option, and there are other hex issues anyways in my data.

import re

def cleanCrap(inText)
    inText =re.sub(r'[\\]{0,2}[xX][a-fA-F0-9][a-fA-F0-9][\\]{0,2}', ' ', inText)
    return(inText)

Another thing you can try is changing the default encoding for python:

import sys
reload(sys)
sys.setdefaultencoding('latin-1')
#or alternatively:
#sys.setdefaultencoding("utf-8")

Comments (4)

  1. Log in to comment