-
assigned issue to
OpenXml filter fails on Windows...
We have found some openxml documents fail on Windows (no issues on Linux) because Java still returns windows-1252 as the default charset. This cuases issues if the OpenXml document has a utf-8 encoding BOM or there are other encoding oddities.
The OpenXml filter depends on the default charset rather than specifying utf-8:
private Namespaces2 namespacesOf(final ZipEntry entry) throws IOException, XMLStreamException {
try (final Reader reader = new InputStreamReader(this.generalDocument.inputStreamFor(entry))) {
final Namespaces2 namespaces = new Namespaces2.Default(
this.generalDocument.inputFactory()
);
namespaces.readWith(reader);
return namespaces;
}
}
I think we should specify utf-8 in all cases as this is by far the most common encoding across platforms.
Comments (12)
-
reporter -
reporter Seems Java 18 will finally switch Windows default charset to utf-8
-
+1 for being paranoid and always specifying
-
reporter -
assigned issue to
-
assigned issue to
-
@jhargrave-straker I would like to mention the intermediate results of my investigation on this issue.
- The proposed adjustments are ready and I can commit the changes and open a PR.
- It looks like platform encoding influences the way the Woodstox library handles XML. It throws the following exception when Tikal tries to extract XLIFF from a document (I will attach it to the issue soon) with a BOM in its document part on Windows OS with CP-1251 default encoding (the UTF-8 encoding is passed for creating an XML reader at this time):
-
- attached 1162.docx
-
reporter @Denis Konovalyenko That would be great! I was just going to update two places to specify UTF-8 vs the default platform encoding (I think we should avoid defaulting to CP-1252 going forward). But perhaps the problem is more complicated.
-
reporter -
assigned issue to
-
assigned issue to
-
reporter Here’s my branch just to make sure we don’t miss these: https://bitbucket.org/okapiframework/okapi/branch/Issue%231162
-
@jhargrave-straker Woodstox bootstraps input and throws the following exception:
net.sf.okapi.common.exceptions.OkapiException: An error occurred during extraction at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:269) at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:263) at net.sf.okapi.steps.common.RawDocumentToFilterEventsStep.handleEvent(RawDocumentToFilterEventsStep.java:166) at net.sf.okapi.common.pipeline.Pipeline.execute(Pipeline.java:117) at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:227) at net.sf.okapi.common.pipeline.Pipeline.process(Pipeline.java:199) at net.sf.okapi.common.pipelinedriver.PipelineDriver.processBatch(PipelineDriver.java:182) at net.sf.okapi.applications.tikal.Main.extractFile(Main.java:1585) at net.sf.okapi.applications.tikal.Main.process(Main.java:975) at net.sf.okapi.applications.tikal.Main.main(Main.java:563) Caused by: com.ctc.wstx.exc.WstxIOException: Unexpected first character (char code 0xEF), not valid in xml document: could be mangled UTF-8 BOM marker. Make sure that the Reader uses correct encoding or pass an InputStream instead at com.ctc.wstx.io.ReaderBootstrapper.bootstrapInput(ReaderBootstrapper.java:175) at com.ctc.wstx.stax.WstxInputFactory.doCreateSR(WstxInputFactory.java:577) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:637) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:691) at com.ctc.wstx.stax.WstxInputFactory.createXMLEventReader(WstxInputFactory.java:295) at net.sf.okapi.filters.openxml.Namespaces2$Default.readWith(Namespaces2.java:45) at net.sf.okapi.filters.openxml.WordDocument.namespacesOf(WordDocument.java:261) at net.sf.okapi.filters.openxml.WordDocument.styleOptimisationsFor(WordDocument.java:239) at net.sf.okapi.filters.openxml.WordDocument.nextPart(WordDocument.java:174) at net.sf.okapi.filters.openxml.Document$General.nextPart(Document.java:255) at net.sf.okapi.filters.openxml.OpenXMLFilter.nextInDocument(OpenXMLFilter.java:444) at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:254)
Related source code comments:
/* We may also get something that would be invalid XML
* ("garbage" char; neither '<' nor space). If so, and
* it's one of "well-known" cases, we can not only throw
* an exception but also indicate a clue as to what is likely
* to be wrong.*/
/* Specifically, UTF-8 read via, say, ISO-8859-1 reader, can
* "leak" marker (0xEF, 0xBB, 0xBF). While we could just eat
* it, there's bound to be other problems cropping up, so let's
* inform about the problem right away. */
-
A related pull request #653 was opened.
-
- changed status to resolved
Pull request #653 was merged.
- Log in to comment