IllegalCharacterError when processing some workbooks

Issue #280 resolved
Alex Stapleton created an issue

(I know very little about OO-XML so this is mostly guess work, sorry.)

Some of our workbooks raise IllegalCharacterError when load_workbook()ing under 1.9. This appears to be because they contain string values that start with a space and the ILLEGAL_CHARACTERS_RE pattern is both incorrect and that it is used with match instead of search.

In 1.9 we have

ILLEGAL_CHARACTERS_RE = re.compile('|'.join(chr(x) for x in range(33)))

i.e. the pattern is 0x00|0x01|...|0x20.

I suspect this pattern is wrong because 0x20 is space. It seems unlikely that strings should never contain spaces. I suggest [\000-\037] as a replacement.

Secondly this problem only appears on strings that start with things that match this pattern because the pattern is checked like this

if ILLEGAL_CHARACTERS_RE.match(value):

and the Python docs say this about match

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance.

It seems likely the behaviour that was intended is that of search.

Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding MatchObject instance.

Or perhaps next(re.finditer, None) since only care if there are greater than zero matches?

Comments (10)

  1. Alex Stapleton reporter

    Space is not an XML control character? It does effect existing files that were created by Excel.

    Excel is escaping control characters correctly but it does not escape space as it has no reason to. Normally leading spaces are trimmed by either Excel or OpenPyXl it's self however these particular cells have the xml:space="preserve" flag set which disables this behaviour.

  2. Log in to comment