OpenXML filter for OmegaT gives error when opening Excel file with embedded HTML

Issue #38 new
Manuel Souto Pico created an issue

I have an OmegaT project to translate an Excel file containing some HTML content and I have tried to use the Okapi OpenXML filter to open it.

When I enable the Okapi OpenXML filter and reload the project, I get an error and the project is closed. If I try to open it again, I get the same error again and the project closes again. In other words, the project is now unusable. The only way I can fix it is to disable the Okapi OpenXML filter editing the `filters.xml` file manually.

This is the error OmegaT gives me before the project is closed: 

This is the log: 

17318: Info: Omtv= 4.2.0 flag2_5=true flag3_plus=true
17318: Info: Loading: 'C:\Users\Valentina\Downloads\JLFW_Test3\source\br.xlsx' with okf_openxml
17318: Error: Failed to load specified project! (TF_LOAD_ERROR)
17318: Error: java.io.IOException: C:/Users/Valentina/Downloads/JLFW_Test3/source/br.xlsx
17318: Error: java.lang.ClassCastException: com.sun.xml.internal.stream.events.CharacterEvent cannot be cast to javax.xml.stream.events.EndElement
17318: Error: at org.omegat.filters2.master.FilterMaster.loadFile(FilterMaster.java:206)
17318: Error: at org.omegat.core.data.RealProject.loadSourceFiles(RealProject.java:1150)
17318: Error: at org.omegat.core.data.RealProject.loadProject(RealProject.java:369)
17318: Error: at org.omegat.core.data.ProjectFactory.loadProject(ProjectFactory.java:72)
17318: Error: at org.omegat.gui.main.ProjectUICommands$6.lambda$doInBackground$0(ProjectUICommands.java:523)
17318: Error: at org.omegat.core.Core.executeExclusively(Core.java:379)
17318: Error: at org.omegat.gui.main.ProjectUICommands$6.doInBackground(ProjectUICommands.java:523)
17318: Error: at org.omegat.gui.main.ProjectUICommands$6.doInBackground(ProjectUICommands.java:442)
17318: Error: at javax.swing.SwingWorker$1.call(SwingWorker.java:295)
17318: Error: at java.util.concurrent.FutureTask.run(FutureTask.java:266)
17318: Error: at javax.swing.SwingWorker.run(SwingWorker.java:334)
17318: Error: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
17318: Error: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
17318: Error: at java.lang.Thread.run(Thread.java:748)
17318: Error: Caused by: java.lang.ClassCastException: com.sun.xml.internal.stream.events.CharacterEvent cannot be cast to javax.xml.stream.events.EndElement
17318: Error: at com.sun.xml.internal.stream.events.DummyEvent.asEndElement(DummyEvent.java:122)
17318: Error: at net.sf.okapi.filters.openxml.StringItemParser.processText(StringItemParser.java:128)
17318: Error: at net.sf.okapi.filters.openxml.StringItemParser.parse(StringItemParser.java:78)
17318: Error: at net.sf.okapi.filters.openxml.SharedStringsPart.process(SharedStringsPart.java:113)
17318: Error: at net.sf.okapi.filters.openxml.SharedStringsPart.open(SharedStringsPart.java:91)
17318: Error: at net.sf.okapi.filters.openxml.OpenXMLFilter.nextInDocument(OpenXMLFilter.java:446)
17318: Error: at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:256)
17318: Error: at net.sf.okapi.filters.openxml.OpenXMLFilter.next(OpenXMLFilter.java:265)
17318: Error: at net.sf.okapi.lib.omegat.AbstractOkapiFilter.processFile(AbstractOkapiFilter.java:374)
17318: Error: at net.sf.okapi.lib.omegat.AbstractOkapiFilter.parseFile(AbstractOkapiFilter.java:209)
17318: Error: at net.sf.okapi.lib.omegat.OpenXMLFilter.parseFile(OpenXMLFilter.java:21)
17318: Error: at org.omegat.filters2.master.FilterMaster.loadFile(FilterMaster.java:204)
17318: Error: ... 13 more
17318: Info: Project loading end (LOG_DATAENGINE_LOAD_END)

The sample OmegaT project is attached, including a source sample file, also attached separately.

I am using:

  • OmegaT 4.2.0
  • okapiFiltersForOmegaT-1.8-1.40.0-dist.zip
  • Windows 10

Comments (7)

  1. Kuro Kurosaka (BH Lab)

    I don’t know if this is related to this bug in any way but I’d like to note that the private method net.sf.okapi.filters.openxml.SharedStringsPart.process(SharedStringsPart.java:113) that is part of the stack trace no longer exists. This was removed in commit 9cf885f73e094d4059a4b9bb8eb4e99fe9c2f5e9 in order to fix issue #1051.
    A call to XMLEventReader.asEndElement() still exists in net.sf.okapi.filters.openxml.SharedStrings.Default#readWith(XMLEventReader). So replacing the plugin with the latest version probably wouldn’t fix this issue.

    Also, I’d like to note that an OmegaT project that includes .docx but not .xlsx opens fine. It seems this only happens when the project includes a .xlsx file.

  2. Manuel Souto Pico reporter

    I can still reproduce this with plugin okapiFiltersForOmegaT-1.12-1.44.0.jar (custom build based on commit 24a23ea)

    Adding file 22b35.xlsx to an OmegaT project using the Okapi OpenXML filter gives me this error:

    and this configuration:

    • Version: OmegaT-5.7.1_0_c3206253
    • Platform: Linux 6.0.2-arch1-1
    • Java: 1.8.0_312 amd64
    • Memory: 296MiB total / 169MiB free / 3520MiB max

    I have been able to isolate what causes the issue to an ampersand character, which can be found in URLs (e.g. https://foo.org/index?a=b&z=y) also by itself or in HTML entities.

  3. Manuel Souto Pico reporter

    A silly question from a non-developer: how does the filter handle characters found in the source text that are reserved in XML, such as <, > or &? Are they parsed in any way so that they don’t break the XML?

  4. Kuro Kurosaka (BH Lab)

    Hi @Manuel Souto Pico , usually these characters are represented by entity references such as &amp for the ampersand, &lt for the less-than symbol, etc. between the XML tags. They could be represented in the CDATA section, but that is not often used.

    What is weird about this bug is the ampersand character in an Excel cell is attempted to be interpreted as a part of XML or HTML data. It could happen if HTML or XML is specified as a subfilter, but that is not the case here. The same file can be processed without an issue by Okapi in isolation.

    Another possible cause is that a different XML parser implementation that is different from the one Okapi usually use, is used when Okapi is used from OmegaT.

  5. Log in to comment