Invalid characters can be output but will result with corrupted file

Issue #244 duplicate
Ying-Jui Chen created an issue

Some characters cause corrupted xlsx file. I'm not sure what characters will cause the corrupted file.(I'm sure at least \x11 will.)

So I did a replace before set the value. for i in xrange(0, 32): ok_str = ok_str.replace(chr(i), '?')

I'm thinking it will be better if the library do this.

Comments (9)

  1. CharlieC

    This sounds like it might be an encoding issue, but yes there are illegal characters. Could you please provide a more extensive example together with the exception that is raised.

  2. Ying-Jui Chen reporter

    The python part finished without error. But the file cannot be opened in Excel correctly.

    wb = Workbook()
    ws = wb.get_active_sheet()
    ws.cell("A1").value = u'\x11'"a.xlsx")
  3. Ying-Jui Chen reporter

    I've tested on 0~31. Only 9, 10, 13 worked and other failed.

    for i in xrange(0, 32):
        wb = Workbook()
        ws = wb.get_active_sheet()
        ws.cell("A1").value = chr(i)
  4. CharlieC

    The problem seems to be related to the fact that 0-31 are control characters and not valid in XML. See for more information

    As a result the SharedStrings.xml part of the archive is invalid XML.

    I'm not sure if I'd like to see openpyxl check (could be a big performance hit) for these characters as opposed to making it a responsibility of client code. Could you provide some more information as to where the characters are coming from? Excel itself does not support these characters natively but you can use =CHAR(i) to get them and see that they are serialised as escaped. That might also be a workaround in client code: insert the values as formulae.

    Given a strong enough use case I'm open to suggestions on how to handle these characters.

  5. Ying-Jui Chen reporter

    I tried to output some data collected from out side of my system. So there might be some unexpected character mixed in. Of course, I can do the filter. But it took me quiet some time to figure out what the problem is.

  6. CharlieC

    I'm going to close this. The key problem is that the characters need to be escaped to be valid XML. However, as we are serialising UTF-8 to XML we do not have to do any kind of escaping: XML without an explicit encoding is UTF-8 but control characters are illegal. It would be a huge performance hit to check all strings character by character.

    One thing you might need to consider is that the characters you are receiving are incorrectly encoded and, therefore, mapping to escape characters.

  7. CharlieC

    XML cannot contain control characters. It is the responsibility of client code to identify and escape or otherwise process them.

  8. Log in to comment