OpenXml filter fails on Windows...

Issue #1162 resolved
jhargrave-straker created an issue

We have found some openxml documents fail on Windows (no issues on Linux) because Java still returns windows-1252 as the default charset. This cuases issues if the OpenXml document has a utf-8 encoding BOM or there are other encoding oddities.

The OpenXml filter depends on the default charset rather than specifying utf-8:

private Namespaces2 namespacesOf(final ZipEntry entry) throws IOException, XMLStreamException {
        try (final Reader reader = new InputStreamReader(this.generalDocument.inputStreamFor(entry))) {
            final Namespaces2 namespaces = new Namespaces2.Default(
                    this.generalDocument.inputFactory()
            );
            namespaces.readWith(reader);
            return namespaces;
        }
    }

I think we should specify utf-8 in all cases as this is by far the most common encoding across platforms.

Comments (12)

  1. Denis Konovalyenko

    @jhargrave-straker I would like to mention the intermediate results of my investigation on this issue.

    1. The proposed adjustments are ready and I can commit the changes and open a PR.
    2. It looks like platform encoding influences the way the Woodstox library handles XML. It throws the following exception when Tikal tries to extract XLIFF from a document (I will attach it to the issue soon) with a BOM in its document part on Windows OS with CP-1251 default encoding (the UTF-8 encoding is passed for creating an XML reader at this time):

  2. jhargrave-straker reporter

    @Denis Konovalyenko That would be great! I was just going to update two places to specify UTF-8 vs the default platform encoding (I think we should avoid defaulting to CP-1252 going forward). But perhaps the problem is more complicated.

  3. Denis Konovalyenko

    @jhargrave-straker Woodstox bootstraps input and throws the following exception:

    net.sf.okapi.common.exceptions.OkapiException: An error occurred during extraction
        at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:269)
        at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:263)
        at net.sf.okapi.steps.common.RawDocumentToFilterEventsStep.handleEvent(RawDocumentToFilterEventsStep.java:166)
        at net.sf.okapi.common.pipeline.Pipeline.execute(Pipeline.java:117)
        at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:227)
        at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:199)
        at net.sf.okapi.common.pipelinedriver.PipelineDriver.processBatch(PipelineDriver.java:182)
        at net.sf.okapi.applications.tikal.Main.extractFile(Main.java:1585)
        at net.sf.okapi.applications.tikal.Main.process(Main.java:975)
        at net.sf.okapi.applications.tikal.Main.main(Main.java:563)
    Caused by: com.ctc.wstx.exc.WstxIOException: Unexpected first character (char code 0xEF), not valid in xml document: could be mangled UTF-8 BOM marker. Make sure that the Reader uses correct encoding or pass an InputStream instead
        at com.ctc.wstx.io.ReaderBootstrapper.bootstrapInput(ReaderBootstrapper.java:175)
        at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:577)
        at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:637)
        at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:691)
        at com.ctc.wstx.stax.WstxInputFactory.createXMLEventReader(WstxInputFactory.java:295)
        at net.sf.okapi.filters.openxml.Namespaces2$Default.readWith(Namespaces2.java:45)
        at net.sf.okapi.filters.openxml.WordDocument.namespacesOf(WordDocument.java:261)
        at net.sf.okapi.filters.openxml.WordDocument.styleOptimisationsFor(WordDocument.java:239)
        at net.sf.okapi.filters.openxml.WordDocument.nextPart(WordDocument.java:174)
        at net.sf.okapi.filters.openxml.Document$General.nextPart(Document.java:255)
        at net.sf.okapi.filters.openxml.OpenXMLFilter.nextInDocument(OpenXMLFilter.java:444)
        at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:254)
    

    Related source code comments:

    /* We may also get something that would be invalid XML

    * ("garbage" char; neither '<' nor space). If so, and

    * it's one of "well-known" cases, we can not only throw

    * an exception but also indicate a clue as to what is likely

    * to be wrong.*/

    /* Specifically, UTF-8 read via, say, ISO-8859-1 reader, can

    * "leak" marker (0xEF, 0xBB, 0xBF). While we could just eat

    * it, there's bound to be other problems cropping up, so let's

    * inform about the problem right away. */

  4. Log in to comment