XLIFF filter should handle invalid XML characters better

Issue #551 resolved
Chase Tingley created an issue

It is an unfortunate fact of life that invalid numeric entities (such as & #x03; or & #x1F;?) sometimes show up in XLIFF files, particularly (in my experience) SDLXLIFF. Nobody knows where they come from, but they break many XML parsers, including the one we use.

A better behavior would be to strip these unparsable characters as we encountered them. This would need to be done at the I/O level, before the content was handed to the XML parser.

Sample attached. You can see the failure by running

tikal.sh -fc okf_xliff -x invalid_xml_entity.xlf

We currently crash with this stack:

Illegal character entity: expansion character (code 0x3
 at [row,col,system-id]: [7,33,"file:/Users/chase/Downloads/invalid_xml_entity.xlf"]
    at net.sf.okapi.filters.xliff.its.ITSStandoffManager.parseXLIFF(ITSStandoffManager.java:112)
    at net.sf.okapi.filters.xliff.XLIFFITSFilterExtension.parseInDocumentITSStandoff(XLIFFITSFilterExtension.java:79)
    at net.sf.okapi.filters.xliff.XLIFFFilter.open(XLIFFFilter.java:396)
    at net.sf.okapi.filters.xliff.XLIFFFilter.open(XLIFFFilter.java:316)
    at net.sf.okapi.filters.xliff.XLIFFFilter.open(XLIFFFilter.java:309)

Comments (10)

  1. Chase Tingley reporter
    • edited description

    (Getting the description to look right is hard, because Bitbucket's handling of entities is itself pretty wonky!)

  2. Yakov

    In fact it's not mistake of XML parsers, because of it's not a valid file, and this exception is totally correct. & #x03; is not valid character in xml 1.0, but valid in xml 1.1 https://www.w3.org/TR/2008/REC-xml-20081126/#charsets https://www.w3.org/TR/xml11/#charsets

    If we set xml version="1.1" - our parser is working fine. com.ctc.wstx.sr.StreamScaner line 2400

    Do we really want to remove this characters? or we want to set correct version to xml file? (just replacing 1.0 by 1.1 without analyze file? or only after we found symbol witch valid only in xml 1.1?)

  3. YvesS

    One note on this issue: We have a step XML Character Fixing that can replace the invalid characters with an expression that can be later converted back to the non-valid XML character.

    It does not resolves this issue, but it maybe a work-around in some cases.

  4. Chase Tingley reporter

    Yakov, your analysis is correct, this is a problem with the XLIFF, not the XML parser. However, we still have the problem that SDL tools generate XML 1.0 documents that contain these invalid entities.

    It is possible to fix these by hand (either by removing the entity, or by changing the XML header as you suggested), but in our experience regular users don't know to do this.

    Yves, thank you for the reference to the code (in okapi/steps/xmlvalidation/src/main/java/net/sf/okapi/steps/xmlcharfixing/XMLCharFixingStep.java). I think we can make use of this code.

    However I do have a slight preference for building a limited version of this directly into the filter (as an option that would strip the entities), just to make this case as simple as possible. Do you think that's an overreach?

  5. YvesS

    That would be fine I think.

    The only concern I can think some users might have would be the stripping itself: Maybe those control characters are important and it may be useful to not lose completely the invalid character. maybe a replacement with a special text marker would be better? (Just thinking aloud, other might have better ideas).

  6. Chase Tingley reporter

    The sample file I attached (which I made by hand) is a bit misleading in that the character is translatable. In the real files where I see this problem, it usually (always?) occurs in skeleton. So we could substitute the character, but I don't know if that's useful. (Our workaround has always been to just delete those characters, and it doesn't seem to break anything on the SDL end.)

  7. Log in to comment