Merging XLIFF2 file results in some target segments being left out

Issue #989 resolved
Joseph Hovik created an issue

File sent to Okapi to merge:

<?xml version="1.0" encoding="UTF-8"?><xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsxlf="http://www.w3.org/ns/its-xliff/" xmlns:okp="okapi-framework:xliff-extensions" its:version="2.0" version="1.2">
<file datatype="x-undefined" okp:configId="/filterconfiguration.fprm" okp:inputEncoding="UTF-8" original="unknown" source-language="en-US" target-language="de-DE">
  <body>
    <trans-unit id="3">
      <source>Want a quiet mind? Move your body.</source>
      <seg-source><mrk mid="0" mtype="seg">Want a quiet mind?</mrk> <mrk mid="2" mtype="seg">Move your body.</mrk></seg-source>
      <target><mrk mid="0" mtype="seg">Möchten Sie einen ruhigen Geist?</mrk> <mrk mid="2" mtype="seg">Bewegen Sie Ihren Körper.</mrk></target>
      <note annotates="general" priority="1">3.0</note>
      <note annotates="general" priority="1"/>
    </trans-unit>
  </body>
</file>
</xliff>

Expected file from merging:

<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="de-DE">
  <file id="f1">
    <unit id="3">
      <notes>
        <note category="key">3.0</note>
        <note category="description"></note>
      </notes>
      <segment>
        <source xml:space="preserve">Want a quiet mind? Move your body.</source>
        <target xml:space="preserve">Möchten Sie einen ruhigen Geist? Bewegen Sie Ihren Körper.</target>
      </segment>
    </unit>
  </file>
</xliff>

Actual file from merging:

<?xml version="1.0"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="en-US" trgLang="de-DE">
  <file id="f1">
    <unit id="3">
      <notes>
        <note category="key">3.0</note>
        <note category="description"></note>
      </notes>
      <segment>
        <source xml:space="preserve">Want a quiet mind? Move your body.</source>
        <target xml:space="preserve">Möchten Sie einen ruhigen Geist?</target>
      </segment>
    </unit>
  </file>
</xliff>

Notice that “Bewegen Sie Ihren Körper.“ is missing from the actual file.

I think I may have found the problem. It's in the XLIFF2OkpToX2Converter.java class.

In the “private List<Event> textUnit(ITextUnit okapiTextUnit, LocaleId targetLocale)” method, the “okapiTextUnit” variable has the correct targets:
{LocaleId@5944} "de-DE" -> {TextContainer@5965} "Möchten Sie einen ruhigen Geist? Bewegen Sie Ihren Körper."

But when the “textUnit()” method is returned from in XLIFF2FilterWriter.java, within the “xliff2Event” object, within the “parts” array, the target is missing the second segment:

“Möchten Sie einen ruhigen Geist?”

Then in the XLIFFWriter.java class, in the “writeUnit()” method, when the unit is written, only the first target segment is written.

I’ve attached the following files: original-file.xlf, to-merge, from-merge, and okf_xliff2@resegment_xliff2.fprm.

Comments (7)

  1. Jim Hargrave (OLD)

    The issue here is that the xliff 2 filter applied “segmentation deepening“. During the merge the segmentation is not applied as this was part of the pipeline and we end up with a different number of segments. We do log an error but continue processing. The output is basically truncated.

    The xliff 2 fprm looks like this

    v1
    maxValidation.b=true
    mergeAsParagraph.b=false
    needsSegmentation.b=true
    

    Any xliff 2 unit with canResegment=”yes” is sent to the segmenter and TextUnit adjusted with new segments.

    Any bilingual xliff 2 is obviously not going to work in all cases as we have no way to align the source/target segments after segmentation. We should add checks for this case and shouldn’t proceed if segment counts differ.

  2. Log in to comment