Error parsing XML content (with OpenXML filter in OmegaT)

Issue #42 new
Manuel Souto Pico created an issue

Preconditions

  • Windows 10 and Microsoft Office 365
  • Install the Okapi filter plugin for OmegaT, i.e. put the okapiFiltersForOmegaT-1.9-1.41.0.jar file in the plugins folder of OmegaT.

Steps to reproduce

  1. Create an OmegaT project and disable the default OpenXML filter so that the Okapi OpenXML filter will be used instead.
  2. Go to the source folder and create a new spreadsheet (e.g. right-click > New > Microsoft Excel Worksheet). See sample files attached.
  3. Reload the OmegaT project (F5)

Expected results

OmegaT loads the file and extracts its contents, if any (none, in this case).

Actual results

Error parsing XML content: OmegaT cannot load the specified project. See the attached screenshot.

Same results if the file has some content (see sample files attached).

Comments (8)

  1. Denis Konovalyenko

    Additional details:

    1. This can be reproduced with the New_Microsoft_Excel_Worksheet.xlsx document only.
    2. It can be added to the project first and the error will appear on loading OmegaT.
    3. Error stack trace:
    11099: Error: java.io.IOException: Issues/42/otp-1/source/New_Microsoft_Excel_Worksheet.xlsx 
    11099: Error: net.sf.okapi.common.exceptions.OkapiIOException: Error parsing XML content 
    11099: Error:   at org.omegat.filters2.master.FilterMaster.loadFile(FilterMaster.java:206) 
    11099: Error:   at org.omegat.core.data.RealProject.loadSourceFiles(RealProject.java:1151) 
    11099: Error:   at org.omegat.core.data.RealProject.loadProject(RealProject.java:358) 
    11099: Error:   at org.omegat.core.data.ProjectFactory.loadProject(ProjectFactory.java:72) 
    11099: Error:   at org.omegat.gui.main.ProjectUICommands$7.lambda$doInBackground$0(ProjectUICommands.java:618) 
    11099: Error:   at org.omegat.core.Core.executeExclusively(Core.java:385) 
    11099: Error:   at org.omegat.gui.main.ProjectUICommands$7.doInBackground(ProjectUICommands.java:614) 
    11099: Error:   at org.omegat.gui.main.ProjectUICommands$7.doInBackground(ProjectUICommands.java:605) 
    11099: Error:   at javax.swing.SwingWorker$1.call(SwingWorker.java:295) 
    11099: Error:   at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
    11099: Error:   at javax.swing.SwingWorker.run(SwingWorker.java:334) 
    11099: Error:   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
    11099: Error:   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
    11099: Error:   at java.lang.Thread.run(Thread.java:750) 
    11099: Error: Caused by: net.sf.okapi.common.exceptions.OkapiIOException: Error parsing XML content 
    11099: Error:   at net.sf.okapi.filters.openxml.OpenXMLFilter.openDocument(OpenXMLFilter.java:438) 
    11099: Error:   at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:252) 
    11099: Error:   at net.sf.okapi.lib.omegat.AbstractOkapiFilter.processFile(AbstractOkapiFilter.java:356) 
    11099: Error:   at net.sf.okapi.lib.omegat.AbstractOkapiFilter.parseFile(AbstractOkapiFilter.java:197) 
    11099: Error:   at net.sf.okapi.lib.omegat.OpenXMLFilter.parseFile(OpenXMLFilter.java:24) 
    11099: Error:   at org.omegat.filters2.master.FilterMaster.loadFile(FilterMaster.java:204) 
    11099: Error:   ... 13 more 
    11099: Error: Caused by: javax.xml.stream.XMLStreamException: ParseError at [row,col]:[1,1] 
    11099: Error: Message: Content is not allowed in prolog. 
    11099: Error:   at com.sun.org.apache.xerces.internal.impl.XMLStreamReaderImpl.next(XMLStreamReaderImpl.java:604) 
    11099: Error:   at com.sun.xml.internal.stream.XMLEventReaderImpl.nextEvent(XMLEventReaderImpl.java:83) 
    11099: Error:   at net.sf.okapi.filters.openxml.WorkbookFragments$Default.readWith(WorkbookFragments.java:153) 
    11099: Error:   at net.sf.okapi.filters.openxml.ExcelDocument.workbookFragments(ExcelDocument.java:118) 
    11099: Error:   at net.sf.okapi.filters.openxml.ExcelDocument.open(ExcelDocument.java:87) 
    11099: Error:   at net.sf.okapi.filters.openxml.Document$General.open(Document.java:133) 
    11099: Error:   at net.sf.okapi.filters.openxml.OpenXMLFilter.openDocument(OpenXMLFilter.java:426) 
    11099: Error:   ... 18 more
    

    This happens over the presence of the BOM in some parts of the Excel document. And the Stax2 implementation (woodstocks-core:6.2.7 at the moment of writing) handles BOMs very well if the execution is performed from Okapi or Okapi OmegaT Plugin contexts. However, this is not the case when the processing is done from OmegaT. So, there is still a need to find the reason for this.

  2. Denis Konovalyenko

    @Manuel Souto Pico the origin of this issue is the StAX implementation which OmegaT application uses at its running time - com.sun.org.apache.xerces. It does not handle BOMs. And the aforementioned error appears when a document part (workbook.xml in the current case) is read and it contains a BOM.

    I can see possible solutions:

    1. OmegaT project starts using another XML parser - com.fasterxml.woodstox for instance. I assume it would be enough to add the following dependency to build.gradle - runtimeOnly 'com.fasterxml.woodstox:woodstox-core:6.2.8'… At least, it does not trigger the error when the New_Microsoft_Excel_Worksheet.xlsx document is filtered.
    2. Okapi OpenXML Filter entails an additional programmatic layer, which deals with BOMs before the processing is communicated to a StAX implementation.

    Do you think someone knowledgeable about OmegaT internals could advise whether the 1st approach is feasible and can be put into life without concerns?

  3. t_cordonnier

    “I assume it would be enough to add the following dependency to build.gradle - runtimeOnly 'com.fasterxml.woodstox:woodstox-core:6.2.8'…”

    Not sure it would be enough to add an alternative parser. The fact that it is present does not mean that it has priority over other parsers.

    If I am not wrong, all the contents of Woodstox parser is copied into okapi’s JAR file, so why does OmegaT not use it? Simply because JVM gives priority to Xerces parser which is in the JVM itself.

    What could be done:

    • start OmegaT with -Djavax.xml.stream.XMLInputFactory=com.ctc.wstx.stax.WstxInputFactory
      (should be added in all OmagaT runners - Omagat.bat, OmegaT.sh, etc)
      But yes, if we do it in the main Git of OmegaT, then the dependency must be added in gradle file
    • or change in the code of Okapi framework (not only the plugin, unfortunately) so that it does not call XmlInputFactory.newInstance() but directly “new com.ctc.wstx.stax.WstxInputFactory”

  4. Denis Konovalyenko

    @t_cordonnier thank you for your feedback on this! Let me clarify some points.

    Firstly, I agree that all required dependencies, and Woodstox in particular, are included in an Okapi jar archive. According to javax.xml.stream.XMLInputFactory#newFactory(), this is the way a new factory instance is created by using:
    1. The javax.xml.stream.XMLInputFactory system property.
    2. The configuration file "stax.properties".
    3. The jaxp configuration file "jaxp.properties".
    4. The service-provider loading facility, defined by the java.util.ServiceLoader class, to attempt to locate and load an implementation of the service using the default loading mechanism: the service-provider loading facility will use the current thread's context class loader to attempt to load the service. If the context class loader is null, the system class loader will be used.
    5. The system-default implementation.

    I suppose our main interest can be the service-provider loading mechanism (#4) as the META-INF/services path in the archive contains necessary configurations for the initialisation of Woodstox factories (XMLInputFactory, XMLEventFactory, XMLOutputFactory).

    Also, there is a reference Woodstox project documentation with the following class loading details:

    1. javax.xml.stream.XMLInputFactory.newInstance() is called by client code
    2. The file META-INF/services/javax.xml.stream.XMLInputFactory is searched in the current classpath
    3. As the current classpath contains woodstox-core-asl-4.2.0.jar, the
      file META-INF/services/javax.xml.stream.XMLInputFactory is found as shown below
    4. The contents of META-INF/services/javax.xml.stream.XMLInputFactory (as shown below) are read.
    5. The name of the Woodstox class, WstxInputFactory, mentioned in the above step is read
    6. WstxInputFactory class is loaded using Java Reflection and returned
    7. In all the sample programs,
      WstxInputFactory class is loaded using the above steps.

      Other Woodstox classes corresponding to for schema validation (DTD, RelaxNG, W3c),
      XMLEventFactory and
      XMLOutputFactory are loaded in a similar manner as mentioned in the steps above.

    Secondly, as far as I understand the org.omegat.filters2.master.PluginUtils#loadPlugins method and related code, class loading is performed for the listed classes (filters) in the provided Okapi jar manifest, however, the service-provider loading of the META-INF/services contents is not implemented.

    Thirdly, I appreciate your suggestion on adding implementation 'com.fasterxml.woodstox:woodstox-core:6.2.8 to the build.gradle, which looks fairly simple and will do the loading of required classes when the javax.xml.stream.XMLInputFactory#newFactory method is called, but it seems to me that the best architecture-related solution for OmegaT would be adding the dynamic loading of the implementation classes listed in a plugin jar under META-INF/services/ path. What do you think?

    By the way, I have found an example of plugins loading on StackOverflow.

    CC: @Manuel Souto Pico

  5. Denis Konovalyenko

    @t_cordonnier the root cause of this issue is that the XML factories (XMLInputFactory, XMLEventFactory and XMLOutputFactory) are instantiated with the help of the application class loader but not with the class loader used for plugin filters instantiation. So, a solution would be to pass the relevant class loader to the factory instantiation method, e.g.

    XMLInputFactory.newFactory("javax.xml.stream.XMLInputFactory", getClass().getClassLoader());
    

    This should be performed on Okapi side. So, please disregard my proposal for loading the implementation classes listed in a plugin jar under META-INF/services/ path.

    To be precise, there is one thing to improve in OmegaT - the plugin loading (org.omegat.filters2.master.PluginUtils#loadPlugins) could be done with the initialisation of class loader (URLClassLoader at the moment) per jar archive. In that way, possible classpath collisions between plugins would be avoided. Do you think this can be documented as an issue in OmegaT at least?

  6. t_cordonnier

    First point to avoid any confusion: I did not suggest to modify build.gradle, this suggestion came from you and I alerted that it would change nothing (it will add Woodstox in OmegaT’s lib directory but without asking it to be used). My suggestion was to add -Djavax.xml.stream.XMLInputFactory=com.ctc.wstx.stax.WstxInputFactory in the starter (bat, sh or another): this solution can be implemented even by a non-developper, in short term. So if anybody has the problem, we can use this solution until a real one is implemented.

    Now, before searching the best solution for OmegaT, one question: if Okapi applications (Rainbow, for example) are also using Woodstox, how do they ensure that XMLInputFactory is using Woodstox rather than Java default? Maybe you should have a look to it and we can decide whenever we want to use the same mehanism or not.

    Finally, I understand your proposal in your second message, looks good. But not sure I understand what you want to document in OmegaT, since it seems that your solution can be implemented fully in Okapi side. Can you try to implemnent it, and then tell me what you want us to document exactly?

  7. Log in to comment