Error parsing XML content (with OpenXML filter in OmegaT)
Preconditions
- Windows 10 and Microsoft Office 365
- Install the Okapi filter plugin for OmegaT, i.e. put the
okapiFiltersForOmegaT-1.9-1.41.0.jar
file in theplugins
folder of OmegaT.
Steps to reproduce
- Create an OmegaT project and disable the default OpenXML filter so that the Okapi OpenXML filter will be used instead.
- Go to the
source
folder and create a new spreadsheet (e.g. right-click > New > Microsoft Excel Worksheet). See sample files attached. - Reload the OmegaT project (F5)
Expected results
OmegaT loads the file and extracts its contents, if any (none, in this case).
Actual results
Error parsing XML content: OmegaT cannot load the specified project. See the attached screenshot.
Same results if the file has some content (see sample files attached).
Comments (8)
-
reporter -
Additional details:
- This can be reproduced with the New_Microsoft_Excel_Worksheet.xlsx document only.
- It can be added to the project first and the error will appear on loading OmegaT.
- Error stack trace:
11099: Error: java.io.IOException: Issues/42/otp-1/source/New_Microsoft_Excel_Worksheet.xlsx 11099: Error: net.sf.okapi.common.exceptions.OkapiIOException: Error parsing XML content 11099: Error: at org.omegat.filters2.master.FilterMaster.loadFile(FilterMaster.java:206) 11099: Error: at org.omegat.core.data.RealProject.loadSourceFiles(RealProject.java:1151) 11099: Error: at org.omegat.core.data.RealProject.loadProject(RealProject.java:358) 11099: Error: at org.omegat.core.data.ProjectFactory.loadProject(ProjectFactory.java:72) 11099: Error: at org.omegat.gui.main.ProjectUICommands$7.lambda$doInBackground$0(ProjectUICommands.java:618) 11099: Error: at org.omegat.core.Core.executeExclusively(Core.java:385) 11099: Error: at org.omegat.gui.main.ProjectUICommands$7.doInBackground(ProjectUICommands.java:614) 11099: Error: at org.omegat.gui.main.ProjectUICommands$7.doInBackground(ProjectUICommands.java:605) 11099: Error: at javax.swing.SwingWorker$1.call(SwingWorker.java:295) 11099: Error: at java.util.concurrent.FutureTask.run(FutureTask.java:266) 11099: Error: at javax.swing.SwingWorker.run(SwingWorker.java:334) 11099: Error: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 11099: Error: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 11099: Error: at java.lang.Thread.run(Thread.java:750) 11099: Error: Caused by: net.sf.okapi.common.exceptions.OkapiIOException: Error parsing XML content 11099: Error: at net.sf.okapi.filters.openxml.OpenXMLFilter.openDocument(OpenXMLFilter.java:438) 11099: Error: at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:252) 11099: Error: at net.sf.okapi.lib.omegat.AbstractOkapiFilter.processFile(AbstractOkapiFilter.java:356) 11099: Error: at net.sf.okapi.lib.omegat.AbstractOkapiFilter.parseFile(AbstractOkapiFilter.java:197) 11099: Error: at net.sf.okapi.lib.omegat.OpenXMLFilter.parseFile(OpenXMLFilter.java:24) 11099: Error: at org.omegat.filters2.master.FilterMaster.loadFile(FilterMaster.java:204) 11099: Error: ... 13 more 11099: Error: Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1] 11099: Error: Message: Content is not allowed in prolog. 11099: Error: at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604) 11099: Error: at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83) 11099: Error: at net.sf.okapi.filters.openxml.WorkbookFragments$Default.readWith(WorkbookFragments.java:153) 11099: Error: at net.sf.okapi.filters.openxml.ExcelDocument.workbookFragments(ExcelDocument.java:118) 11099: Error: at net.sf.okapi.filters.openxml.ExcelDocument.open(ExcelDocument.java:87) 11099: Error: at net.sf.okapi.filters.openxml.Document$General.open(Document.java:133) 11099: Error: at net.sf.okapi.filters.openxml.OpenXMLFilter.openDocument(OpenXMLFilter.java:426) 11099: Error: ... 18 more
This happens over the presence of the BOM in some parts of the Excel document. And the Stax2 implementation (woodstocks-core:6.2.7 at the moment of writing) handles BOMs very well if the execution is performed from Okapi or Okapi OmegaT Plugin contexts. However, this is not the case when the processing is done from OmegaT. So, there is still a need to find the reason for this.
-
@Manuel Souto Pico the origin of this issue is the StAX implementation which OmegaT application uses at its running time -
com.sun.org.apache.xerces
. It does not handle BOMs. And the aforementioned error appears when a document part (workbook.xml in the current case) is read and it contains a BOM.I can see possible solutions:
- OmegaT project starts using another XML parser -
com.fasterxml.woodstox
for instance. I assume it would be enough to add the following dependency tobuild.gradle
-runtimeOnly 'com.fasterxml.woodstox:woodstox-core:6.2.8'
… At least, it does not trigger the error when the New_Microsoft_Excel_Worksheet.xlsx document is filtered. - Okapi OpenXML Filter entails an additional programmatic layer, which deals with BOMs before the processing is communicated to a StAX implementation.
Do you think someone knowledgeable about OmegaT internals could advise whether the 1st approach is feasible and can be put into life without concerns?
- OmegaT project starts using another XML parser -
-
“I assume it would be enough to add the following dependency to
build.gradle
-runtimeOnly 'com.fasterxml.woodstox:woodstox-core:6.2.8'
…”Not sure it would be enough to add an alternative parser. The fact that it is present does not mean that it has priority over other parsers.
If I am not wrong, all the contents of Woodstox parser is copied into okapi’s JAR file, so why does OmegaT not use it? Simply because JVM gives priority to Xerces parser which is in the JVM itself.
What could be done:
- start OmegaT with -Djavax.xml.stream.XMLInputFactory=
com.ctc.wstx.stax.WstxInputFactory
(should be added in all OmagaT runners - Omagat.bat, OmegaT.sh, etc)
But yes, if we do it in the main Git of OmegaT, then the dependency must be added in gradle file - or change in the code of Okapi framework (not only the plugin, unfortunately) so that it does not call XmlInputFactory.newInstance() but directly “new
com.ctc.wstx.stax.WstxInputFactory”
- start OmegaT with -Djavax.xml.stream.XMLInputFactory=
-
@t_cordonnier thank you for your feedback on this! Let me clarify some points.
Firstly, I agree that all required dependencies, and Woodstox in particular, are included in an Okapi jar archive. According to
javax.xml.stream.XMLInputFactory#newFactory()
, this is the way a new factory instance is created by using:
1. The javax.xml.stream.XMLInputFactory system property.
2. The configuration file "stax.properties".
3. The jaxp configuration file "jaxp.properties".
4. The service-provider loading facility, defined by the java.util.ServiceLoader class, to attempt to locate and load an implementation of the service using the default loading mechanism: the service-provider loading facility will use the current thread's context class loader to attempt to load the service. If the context class loader is null, the system class loader will be used.
5. The system-default implementation.I suppose our main interest can be the service-provider loading mechanism (
#4) as theMETA-INF/services
path in the archive contains necessary configurations for the initialisation of Woodstox factories (XMLInputFactory, XMLEventFactory, XMLOutputFactory).Also, there is a reference Woodstox project documentation with the following class loading details:
- javax.xml.stream.XMLInputFactory.newInstance() is called by client code
- The file
META-INF/services/javax.xml.stream.XMLInputFactory
is searched in the current classpath - As the current classpath contains woodstox-core-asl-4.2.0.jar, the
fileMETA-INF/services/javax.xml.stream.XMLInputFactory
is found as shown below - The contents of
META-INF/services/javax.xml.stream.XMLInputFactory
(as shown below) are read. - The name of the Woodstox class, WstxInputFactory, mentioned in the above step is read
- WstxInputFactory class is loaded using Java Reflection and returned
-
In all the sample programs,
WstxInputFactory class is loaded using the above steps.Other Woodstox classes corresponding to for schema validation (DTD, RelaxNG, W3c),
XMLEventFactory and
XMLOutputFactory are loaded in a similar manner as mentioned in the steps above.
Secondly, as far as I understand the
org.omegat.filters2.master.PluginUtils#loadPlugins
method and related code, class loading is performed for the listed classes (filters) in the provided Okapi jar manifest, however, the service-provider loading of theMETA-INF/services
contents is not implemented.Thirdly, I appreciate your suggestion on adding
implementation 'com.fasterxml.woodstox:woodstox-core:6.2.8
to thebuild.gradle
, which looks fairly simple and will do the loading of required classes when thejavax.xml.stream.XMLInputFactory#newFactory
method is called, but it seems to me that the best architecture-related solution for OmegaT would be adding the dynamic loading of the implementation classes listed in a plugin jar underMETA-INF/services/
path. What do you think?By the way, I have found an example of plugins loading on StackOverflow.
CC: @Manuel Souto Pico
-
@t_cordonnier the root cause of this issue is that the XML factories (XMLInputFactory, XMLEventFactory and XMLOutputFactory) are instantiated with the help of the application class loader but not with the class loader used for plugin filters instantiation. So, a solution would be to pass the relevant class loader to the factory instantiation method, e.g.
XMLInputFactory.newFactory("javax.xml.stream.XMLInputFactory", getClass().getClassLoader());
This should be performed on Okapi side. So, please disregard my proposal for loading the implementation classes listed in a plugin jar under
META-INF/services/
path.To be precise, there is one thing to improve in OmegaT - the plugin loading (
org.omegat.filters2.master.PluginUtils#loadPlugins
) could be done with the initialisation of class loader (URLClassLoader
at the moment) per jar archive. In that way, possible classpath collisions between plugins would be avoided. Do you think this can be documented as an issue in OmegaT at least? -
First point to avoid any confusion: I did not suggest to modify
build.gradle
, this suggestion came from you and I alerted that it would change nothing (it will add Woodstox in OmegaT’slib
directory but without asking it to be used). My suggestion was to add-Djavax.xml.stream.XMLInputFactory=com.ctc.wstx.stax.WstxInputFactory
in the starter (bat, sh or another): this solution can be implemented even by a non-developper, in short term. So if anybody has the problem, we can use this solution until a real one is implemented.Now, before searching the best solution for OmegaT, one question: if Okapi applications (Rainbow, for example) are also using Woodstox, how do they ensure that XMLInputFactory is using Woodstox rather than Java default? Maybe you should have a look to it and we can decide whenever we want to use the same mehanism or not.
Finally, I understand your proposal in your second message, looks good. But not sure I understand what you want to document in OmegaT, since it seems that your solution can be implemented fully in Okapi side. Can you try to implemnent it, and then tell me what you want us to document exactly?
-
Pull request #21 was opened to pick up Okapi changes.
- Log in to comment
Transferred from Okapi issue #1054