Crash merging XLIFF 2.0 - code alignment failing due to type set to "null"
Clean testcase attached. The testcase.xliff is XLIFF 2.0 generated by Memsource; the testcase.xliff.xlf was generated by Okapi and has a target translation provided.
Running the merge via tikal causes interesting output:
$ tikal.sh -fc okf_xliff2 -m testcase.xliff.xlf -trace
[.. some output snippet ..]
Merging
Source language: en
Target language: ja
Default input encoding: UTF-8
Output encoding: UTF-8
Filter configuration: okf_xliff2
XLIFF: testcase.xliff.xlf
Output: testcase.out.xliff
Input: /home/tingley/Downloads/ENG-5028/testcase/testcase.xliff.xlf
BOM not found. Now trying to guess document encoding.
thread [pool-1-thread-1] started.
XMLInputFactory: com.ctc.wstx.stax.WstxInputFactory
BOM not found. Now trying to guess document encoding.
Error in <file> id='doc1', <unit> id='5'
Last element read: '{urn:oasis:names:tc:xliff:document:2.0}pc':
Both 'dataRefStart' and 'dataRefEnd' should be present or absent.
Can't find matching target Code(s) id='1624' originalId='1i' data='<x-fmt id=1624>', Closing data='</fmt>' in TU 5
Segment: "Blah blah blah. blah blah blah blah blah blah blah..."
Can't find matching target Code(s) id='1624' originalId='1i' data='</x-fmt>' in TU 5
Segment: "Blah blah blah. blah blah blah blah blah blah blah..."
thread [pool-1-thread-1] closed.
net.sf.okapi.common.exceptions.OkapiMergeException: Error merging from original file
at net.sf.okapi.lib.merge.step.OriginalDocumentXliffMergerStep$1.produce(OriginalDocumentXliffMergerStep.java:159)
at net.sf.okapi.lib.merge.step.OriginalDocumentXliffMergerStep$1.produce(OriginalDocumentXliffMergerStep.java:147)
at net.sf.okapi.common.io.InputStreamFromOutputStream$DataProducer.call(InputStreamFromOutputStream.java:139)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: net.sf.okapi.common.exceptions.OkapiMergeException: Tried writing placeholder to XLIFF 2 with the same ID as another placeholder in the same text unit. Previous ID: 1624 | XLIFF 2 ID: 1i | Placeholder:
at net.sf.okapi.filters.xliff2.OkpToX2Converter.copyOver(OkpToX2Converter.java:350)
at net.sf.okapi.filters.xliff2.OkpToX2Converter.textUnit(OkpToX2Converter.java:206)
at net.sf.okapi.filters.xliff2.OkpToX2Converter.handleEvent(OkpToX2Converter.java:112)
at net.sf.okapi.filters.xliff2.XLIFF2FilterWriter.handleEvent(XLIFF2FilterWriter.java:109)
at net.sf.okapi.lib.merge.merge.SkeletonMergerWriter.processTextUnit(SkeletonMergerWriter.java:266)
at net.sf.okapi.lib.merge.merge.SkeletonMergerWriter.handleEvent(SkeletonMergerWriter.java:137)
at net.sf.okapi.lib.merge.step.OriginalDocumentXliffMergerStep$1.produce(OriginalDocumentXliffMergerStep.java:156)
I have debugged far enough in to discover that something is going wrong in TextFragmentUtil.alignAndCopyCodeMetadata()
(which is called with strategy STRICT
by the TextUnitMerger
). The problem is that the type
value of the codes in the from
code set has been reset to the string "null"
, rather than the string "g"
as it should be. This causes the alignment to fail, and then one set of codes that cloned onto the other for some reason.
I have not yet gone far enough to find why the type
value is wrong coming out of the skeleton.
Comments (21)
-
-
I see the difference. tikal is outputting the xliff 1.2 with simplified codes (g, x etc..). That’s not going to work for the new merger code. I thought we updated that so that tikal would output standard xliff codes (bpt, ept etc..). If not I will change that now. As I’ve mentioned in the past the xliff 1.2 format generally is not going to be sufficient to merge in the future. There is simply to much loss of information - even with the full inline codes. The new serialized format should be the standard going forward as it outputs all information without loss.
Maybe this release we should kook up the serialized merge as an option for tikal and rainbow? Currently it’s only being output in the integration tests.
-
I created a PR that should fix this with tikal and rainbow: https://bitbucket.org/okapiframework/okapi/pull-requests/643/tikal-and-rainbow-now-output-xliff-12-with
-
reporter @jhargrave-straker Thanks. The actual issue was discovered using Longhorn - I just used tikal in the report since it made it easy to reproduce. I assume we need a similar fix there?
-
hum, let me check. But I think Longhorn uses the Rainbow settings.
UPDATE: I didn’t see anything specific longhorn that sets the placeholdermode. I’m guessing the default rainbow options are used. But I don’t use lonhorn and not familiar with how it works. In any case I’m assuming the current build issues with longhorn would have to be fixed so it pulls in Okapi 1.45.0
-
- changed status to resolved
This should now be resolved with the changes to tikal and Rainbow to use full (non-simplified bpt/ept, ph etc) inline codes.
-
reporter - changed status to open
Tentatively reopening this - Alec tested the fix and we were still having trouble with this case (or at least, the original file that it was distilled from). I'm going to dig in deeper. Assigning to myself for the followup.
-
reporter -
assigned issue to
-
assigned issue to
-
reporter @jhargrave-straker I did some more debugging here, and my original analysis about the
type
field is incorrect, i think. What appears to be going on is that the XLIFF 2.0<-->1.2 mapping is not symmetric in a way that is causing confusion when aligning the codes.- In the source XLIFF 2.0, the segment has
<data>
like this.
<originalData> <data id="source1"><w:hyperlink r:id="rId7"></w:hyperlink></data> </originalData>
This data is referenced as the data for a
<pc>
pair:<pc id="source1" dataRefStart="source1">Blah blah blah.</pc>
(There are various other things wrong here – no
dataRefEnd
, the repeatedid
values that you mentioned – but neither of these things is causing us problems I don’t think.)- The XLIFF 1.2 interchange extracts this to a
<g>
tag, but embeds the data value as anequiv-text
value:
<g equiv-text="<w:hyperlink r:id="rId7"></w:hyperlink>" id="source1">Blah blah blah.</g>
This seems to be the start of the problem.
- On merge, the XLIFF 1.2 that is parsed back in has this
w:hyperlink
markup as thedisplayText
field of the parsed code. But the XLIFF 2.0 parsed from source still has it as part of thedata
field of its code. - This mismatch causes the
TextFragmentUtil#strictSearch()
to fail all the rules that compareCMP_DATA
. The rule thatCMP_ID
andCMP_TAG_TYPE
should still work, but for some reason the the XLIFF 1.2 code has the numerical id of-2021876874
while the XLIFF 2.0 code has the numerical id of2021876874
. (These are hashcode values of the stringsource1
, which means that the XLIFF 1.2 code appears correct.) - So the alignment of these codes gets confused (in fact, the first pair of codes from one format is matched with the second pair of codes in the other, because both of them have a
null
value for thedata
field.
This is still the case regardless of the gmode value that you changed in your recent PR.
- In the source XLIFF 2.0, the segment has
-
reporter The id mismatch looks like it’s caused by the XLIFF 2.0 code calling
StringUtil.generateIntId()
, which includes a call toMath.abs()
at the end. I suspect this is the culprit. If I look at what’s happening in the XLIFF 1.2 case, I bet we’re just calling the JVM nativehashCode()
… -
reporter -
reporter Ah, except this still doesn’t work because of the false positive match against the wrong tag because
CMP_DATA+CMP_TAG_TYPE
is still prioritized overCMP_ID+CMP_TAG_TYPE
. -
ah yes, that code is deprecated. See the FIXME comment. I created a JVM independent method in StringUtils.generateIntId() that we should be using.
I hate touching that xliff 1.2 code - there are always a few tests that fail
-
@Chase Tingley I'll take a look at this. I've gone back and forth with ID vs DATA being the priority. Theoretically ID should be the final word for matching - but the xliff 1.2 filter doesn't always create the proper id's. I had all kinds of trouble getting this to work generally.
It seems no matter what did some test would fail.
-
reporter I have a local branch that seems to work where I updated the XLIFFFilter to use
generateIntId()
and then inverted the order of the data vs id comparisons. Switching the rule ordering does cause some tests to break, though. -
@Chase Tingley If we try to fix this it may be a good time to remove XliffFilter.retrieveId(). The other problem beyond the JVM hashcode is that over time we have added code that independently creates code id’s for source and target codes. These become desynchronized and need to be matched up and given the same id - the main reason we match on Data+type. It’s possible other bilingual filters have the same bug.
The goal would be to make the Code.id the primary way to synchronize codes. Not sure this is possible because we do tricky things to codes in a number of places like renumbering, balancing etc.. I tried this in the past but never got it to work. It’s like herding cats - you fix one test then another one breaks.
-
reporter I think my debugging into the data/displayText part of the problem was using a stale interchange file before you switched TIkal over to using extended code syntax. With that change applied, and your current PR changes to code ID generation, I think this is now fixed. ( ) I’ll continue to do some testing.
-
reporter I’m a huge idiot, I was still having problems but it’s because the import bconf I was sending to Longhorn still
writerOptions.placeholderMode.b=true
set in it. -
reporter I think this is resolved as far as Okapi is concerned. Thanks for all your help, Jim!
-
reporter - changed milestone to 1.45.0
-
reporter - changed status to resolved
- Log in to comment
@Chase Tingley
I ran testcase.xliff through the integration tests and didn’t get any error. However, there are some differences:
data
id was converted tod1
- but I did think it strange, maybe illegal, thatpc
id and data id were bothsource1
in the original.I’ve gone over the tikal code several times to try to find if it’s pipeline is different from our standard integration tests. Possible it’s the output of the xliff 1.2. Anyway there must be some difference in tikal that is causing trouble. We need to track that down.
Check the tkitMerged file - xliff_extracted is the xliff 1.2 file. Would be interesting to compare with what tikal spit out.