Crash merging XLIFF 2.0 - code alignment failing due to type set to "null"

Issue #1165 resolved
Chase Tingley created an issue

Clean testcase attached. The testcase.xliff is XLIFF 2.0 generated by Memsource; the testcase.xliff.xlf was generated by Okapi and has a target translation provided.

Running the merge via tikal causes interesting output:

$ tikal.sh -fc okf_xliff2 -m testcase.xliff.xlf -trace
[.. some output snippet ..]
Merging
Source language: en
Target language: ja
Default input encoding: UTF-8
Output encoding: UTF-8
Filter configuration: okf_xliff2
XLIFF: testcase.xliff.xlf
Output: testcase.out.xliff
Input: /home/tingley/Downloads/ENG-5028/testcase/testcase.xliff.xlf
BOM not found. Now trying to guess document encoding.
thread [pool-1-thread-1] started.
XMLInputFactory: com.ctc.wstx.stax.WstxInputFactory
BOM not found. Now trying to guess document encoding.
Error in <file> id='doc1', <unit> id='5'
Last element read: '{urn:oasis:names:tc:xliff:document:2.0}pc':
Both 'dataRefStart' and 'dataRefEnd' should be present or absent.
Can't find matching target Code(s) id='1624' originalId='1i' data='<x-fmt id=1624>', Closing data='</fmt>' in TU 5
 Segment: "Blah blah blah. blah blah blah blah blah blah blah..."
Can't find matching target Code(s) id='1624' originalId='1i' data='</x-fmt>' in TU 5
 Segment: "Blah blah blah. blah blah blah blah blah blah blah..."
thread [pool-1-thread-1] closed.
net.sf.okapi.common.exceptions.OkapiMergeException: Error merging from original file
    at net.sf.okapi.lib.merge.step.OriginalDocumentXliffMergerStep$1.produce(OriginalDocumentXliffMergerStep.java:159)
    at net.sf.okapi.lib.merge.step.OriginalDocumentXliffMergerStep$1.produce(OriginalDocumentXliffMergerStep.java:147)
    at net.sf.okapi.common.io.InputStreamFromOutputStream$DataProducer.call(InputStreamFromOutputStream.java:139)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: net.sf.okapi.common.exceptions.OkapiMergeException: Tried writing placeholder to XLIFF 2 with the same ID as another placeholder in the same text unit. Previous ID: 1624 | XLIFF 2 ID: 1i | Placeholder: 
    at net.sf.okapi.filters.xliff2.OkpToX2Converter.copyOver(OkpToX2Converter.java:350)
    at net.sf.okapi.filters.xliff2.OkpToX2Converter.textUnit(OkpToX2Converter.java:206)
    at net.sf.okapi.filters.xliff2.OkpToX2Converter.handleEvent(OkpToX2Converter.java:112)
    at net.sf.okapi.filters.xliff2.XLIFF2FilterWriter.handleEvent(XLIFF2FilterWriter.java:109)
    at net.sf.okapi.lib.merge.merge.SkeletonMergerWriter.processTextUnit(SkeletonMergerWriter.java:266)
    at net.sf.okapi.lib.merge.merge.SkeletonMergerWriter.handleEvent(SkeletonMergerWriter.java:137)
    at net.sf.okapi.lib.merge.step.OriginalDocumentXliffMergerStep$1.produce(OriginalDocumentXliffMergerStep.java:156)

I have debugged far enough in to discover that something is going wrong in TextFragmentUtil.alignAndCopyCodeMetadata() (which is called with strategy STRICT by the TextUnitMerger). The problem is that the type value of the codes in the from code set has been reset to the string "null", rather than the string "g" as it should be. This causes the alignment to fail, and then one set of codes that cloned onto the other for some reason.

I have not yet gone far enough to find why the type value is wrong coming out of the skeleton.

Comments (21)

  1. jhargrave-straker

    @Chase Tingley

    I ran testcase.xliff through the integration tests and didn’t get any error. However, there are some differences:

    1. memsource custom tags are stripped out (this is a known problem)
    2. data id was converted to d1 - but I did think it strange, maybe illegal, that pc id and data id were both source1 in the original.

    I’ve gone over the tikal code several times to try to find if it’s pipeline is different from our standard integration tests. Possible it’s the output of the xliff 1.2. Anyway there must be some difference in tikal that is causing trouble. We need to track that down.

    Check the tkitMerged file - xliff_extracted is the xliff 1.2 file. Would be interesting to compare with what tikal spit out.

  2. jhargrave-straker

    I see the difference. tikal is outputting the xliff 1.2 with simplified codes (g, x etc..). That’s not going to work for the new merger code. I thought we updated that so that tikal would output standard xliff codes (bpt, ept etc..). If not I will change that now. As I’ve mentioned in the past the xliff 1.2 format generally is not going to be sufficient to merge in the future. There is simply to much loss of information - even with the full inline codes. The new serialized format should be the standard going forward as it outputs all information without loss.

    Maybe this release we should kook up the serialized merge as an option for tikal and rainbow? Currently it’s only being output in the integration tests.

  3. Chase Tingley reporter

    @jhargrave-straker Thanks. The actual issue was discovered using Longhorn - I just used tikal in the report since it made it easy to reproduce. I assume we need a similar fix there?

  4. jhargrave-straker

    hum, let me check. But I think Longhorn uses the Rainbow settings.

    UPDATE: I didn’t see anything specific longhorn that sets the placeholdermode. I’m guessing the default rainbow options are used. But I don’t use lonhorn and not familiar with how it works. In any case I’m assuming the current build issues with longhorn would have to be fixed so it pulls in Okapi 1.45.0

  5. jhargrave-straker

    This should now be resolved with the changes to tikal and Rainbow to use full (non-simplified bpt/ept, ph etc) inline codes.

  6. Chase Tingley reporter
    • changed status to open

    Tentatively reopening this - Alec tested the fix and we were still having trouble with this case (or at least, the original file that it was distilled from). I'm going to dig in deeper. Assigning to myself for the followup.

  7. Chase Tingley reporter

    @jhargrave-straker I did some more debugging here, and my original analysis about the type field is incorrect, i think. What appears to be going on is that the XLIFF 2.0<-->1.2 mapping is not symmetric in a way that is causing confusion when aligning the codes.

    • In the source XLIFF 2.0, the segment has <data> like this.
    <originalData>
    <data id="source1">&lt;w:hyperlink r:id="rId7"&gt;&lt;/w:hyperlink&gt;</data>
    </originalData>
    

    This data is referenced as the data for a <pc> pair:

    <pc id="source1" dataRefStart="source1">Blah blah blah.</pc>
    

    (There are various other things wrong here – no dataRefEnd, the repeated id values that you mentioned – but neither of these things is causing us problems I don’t think.)

    • The XLIFF 1.2 interchange extracts this to a <g> tag, but embeds the data value as an equiv-text value:
    <g equiv-text="&lt;w:hyperlink r:id=&quot;rId7&quot;&gt;&lt;/w:hyperlink&gt;" id="source1">Blah blah blah.</g>
    

    This seems to be the start of the problem.

    • On merge, the XLIFF 1.2 that is parsed back in has this w:hyperlink markup as the displayText field of the parsed code. But the XLIFF 2.0 parsed from source still has it as part of the data field of its code.
    • This mismatch causes the TextFragmentUtil#strictSearch() to fail all the rules that compare CMP_DATA. The rule that CMP_ID and CMP_TAG_TYPE should still work, but for some reason the the XLIFF 1.2 code has the numerical id of -2021876874 while the XLIFF 2.0 code has the numerical id of 2021876874. (These are hashcode values of the string source1, which means that the XLIFF 1.2 code appears correct.)
    • So the alignment of these codes gets confused (in fact, the first pair of codes from one format is matched with the second pair of codes in the other, because both of them have a null value for the data field.

    This is still the case regardless of the gmode value that you changed in your recent PR.

  8. Chase Tingley reporter

    The id mismatch looks like it’s caused by the XLIFF 2.0 code calling StringUtil.generateIntId(), which includes a call to Math.abs() at the end. I suspect this is the culprit. If I look at what’s happening in the XLIFF 1.2 case, I bet we’re just calling the JVM native hashCode()

  9. Chase Tingley reporter

    Ah, except this still doesn’t work because of the false positive match against the wrong tag because CMP_DATA+CMP_TAG_TYPE is still prioritized over CMP_ID+CMP_TAG_TYPE.

  10. jhargrave-straker

    ah yes, that code is deprecated. See the FIXME comment. I created a JVM independent method in StringUtils.generateIntId() that we should be using.

    I hate touching that xliff 1.2 code - there are always a few tests that fail

  11. jhargrave-straker

    @Chase Tingley I'll take a look at this. I've gone back and forth with ID vs DATA being the priority. Theoretically ID should be the final word for matching - but the xliff 1.2 filter doesn't always create the proper id's. I had all kinds of trouble getting this to work generally.

    It seems no matter what did some test would fail.

  12. Chase Tingley reporter

    I have a local branch that seems to work where I updated the XLIFFFilter to use generateIntId() and then inverted the order of the data vs id comparisons. Switching the rule ordering does cause some tests to break, though.

  13. jhargrave-straker

    @Chase Tingley If we try to fix this it may be a good time to remove XliffFilter.retrieveId(). The other problem beyond the JVM hashcode is that over time we have added code that independently creates code id’s for source and target codes. These become desynchronized and need to be matched up and given the same id - the main reason we match on Data+type. It’s possible other bilingual filters have the same bug.

    The goal would be to make the Code.id the primary way to synchronize codes. Not sure this is possible because we do tricky things to codes in a number of places like renumbering, balancing etc.. I tried this in the past but never got it to work. It’s like herding cats - you fix one test then another one breaks.

  14. Chase Tingley reporter

    I think my debugging into the data/displayText part of the problem was using a stale interchange file before you switched TIkal over to using extended code syntax. With that change applied, and your current PR changes to code ID generation, I think this is now fixed. (🎉 ) I’ll continue to do some testing.

  15. Chase Tingley reporter

    I’m a huge idiot, I was still having problems but it’s because the import bconf I was sending to Longhorn still writerOptions.placeholderMode.b=true set in it.

  16. Chase Tingley reporter

    I think this is resolved as far as Okapi is concerned. Thanks for all your help, Jim!

  17. Log in to comment