XLIFF2 Filter/Rainbowkit: Target does not get merged correctly if it's missing from source file.

Issue #1117 resolved
Tyler Angelo created an issue

We have a scenario set up that merges a translated xliff file back into the source it came from using rainbowkit merge step. It seems that if the original segment is missing a target, or doesn’t have any codes, it will not be merged correctly. I’ve narrowed it down to the new XLIFF2 filter matching strategy in OkpToX2Converter#handleUnit(). This used to just iterate over the previously aligned segments and attempt to match by index. Now that it uses more heuristic analysis, our segments cant be matched properly.

Also possibly related, in 1.40 several segments we have were not being returned as ignorables. Now I believe they are being incorrectly returned as ignorable, even though they should be merged into a single origin segment. I’ve attached our source, translated, expected merge and actual merge. Please let me know of any questions. Thanks!

Official response

  • jhargrave-straker

    @{5b16d423c2fc1b1bc37bb2a7} @Mihai Nita

    What we are considering is adding an extra metadata to the internal Okapi TextUnit. Which would mirror your okp:paragraph, okp:title etc.. We would only segment okp:paragraph (if you had a complete list that would help). We can make it a configurable option for the segmenter as an ENUM of standard Okapi content types.

    May I ask where these come from? I know some OKapi filters provide them if configured like HTML and XmlStream.

    With the addition of changes I have currently (*will* be part of the release in a few weeks) I think we may this covered.

    We will solve the “segmentation problem” by preprocessing the file (segmenting it) with the provided meta in the xliff2.

    How does that sound?

Comments (24)

  1. Jim Hargrave (OLD)

    @{5b16d423c2fc1b1bc37bb2a7} @ysavourel Thank you very much for the submission. The example files are very much appreciated and will be added to our tests.

    One unusual thing for your files is that the original is basically “unsegmented“ or paragraphs. One of the weaknesses of xliff 2.x is it doesn’t have a strong notion of paragraph vs segments like in xliff 1.2 where this distinction is clear. We normally expect the original file to be already segmented (like your translation). However, there is one workaround with the “canResegment“ attribute. My understanding is that if a unit has canResgment=yes then the system may further segment the content. But only f this attribute is present. Also, once segmented I would expect the merged output to also be segmented (unlike your content).

    The fact that this worked before may have been an accident 🙂 But we will investigate further with the xliff 2.x experts and try to determine the correct behavior.

    Can you also attach the FPRM or filter config you are using? I know the xliff2 filter has a needsSegmentation and mergeAsParagraph options - but we consider these dangerous and they were scheduled to be removed. Are you using these options?

  2. Jim Hargrave (OLD)

    Looking closer at the xliff 2.1 spec it looks like by default canResegment is “yes” on the file element. Which means segmentation is allowed by default (via inheritance) across all units. But I still think that it's more clear to make this explicit by having canResegment=”yes”.

    @ysavourel Did I misread that? If so what should be the behavior of merge if a translated xliff 2.x file has been further segmented? Do we preserve the segmentation of the original or the translated? IMHO, it should be the latter.

  3. Tyler Angelo Account Deactivated reporter
      <div class="preview-container wiki-content"><!-- loaded via ajax --></div>
      <div class="mask"></div>
    </div>
    

    </div> </form>

  4. Tyler Angelo Account Deactivated reporter

    Hey Jim, thanks for responding so quickly. I believe we use the default options during our merge process (maybe this is the issue?). I’ve pulled the loaded config from the xliff2 filter during a breakpoint on my debugger.

  5. Tyler Angelo Account Deactivated reporter

    Also, regarding segment alignment (for example how f1/tu1/s1 is missing it’s target in the merged version), would this be an issue of segmentation as well? This feels more like a bug IMO from my basic investigation.

  6. Jim Hargrave (OLD)

    @{5b16d423c2fc1b1bc37bb2a7} Our support of xliff 2 is still evolving. Generally I agree that linguistic segmentation should be allowed with xliff 2 (but see the warnings below) - so your original and translated documents are perfectly fine. What I am not sure about is what the merged file should look like. My opinion is that it should look like the translated file (with multiple segments per unit) - not like the expected merged file. That is, if your workflow modifies the xliff 2 with segmentation - that segmentation should be preserved in the merged document. Would that be a problem?

    As I said xliff 2 doesn’t provide a simultaneous paragraph and segmented view like xliff 1.2 (source and seg-source). I find this a bit confusing as there is no way to tell from the input xliff 2 is segmentation is required/expected or if it has been pre-segmented.

    So, yes there are probably bugs in matching source/target segments. But fixing those may not produce the output you expect (non-segmented units). Let me see if I can get a consensus from the team on what should be the proper output and I will test with our dev version that already has some fixes and changes.

    XLIFF 2 Segmentation Warnings

    This wasn’t written by our group - but they are experienced filter writers and I trust their judgment.

    https://galaglobal.github.io/TAPICC/T1/WG3/XLIFF-EM-BP-V1.0-prd01.xhtml

    2.1.3. Controlling Segmentation

    Depending on Extraction rules for mapping of original document structures into XLIFF Documents, individual sentences within a paragraph; verses within a stanza; items or entries of a list; rows or cells of a table; items of a dialog window; and so on might be Extracted as segments of a single unit. While it is generally not advisable to perform segmentation at the time of ExtractionExtractors that Extracted multiple sentences, verses, entries, rows, and so on into a single non-segmented unit (a single <segment> element within each <unit>) and their corresponding Mergers need to expect that the Modifiers will need to transform them into individual segments within the same unit (multiple <segment> elements representing individual sentences, verses, and so on within each <unit>) during the roundtrip.

    In cases where subsequent Modifiers cannot be reasonably expected to detect the segmentation logic, for instance due to the lack of knowledge of the original format logic, the content owner is advised to perform the segmentation and protection of that segmentation before sending their XLIFF Documents for the service roundtrip.

    While it's generally desirable to be able to Modify segmentation within a unit during the roundtrip, doing so in some of the above cases might prevent Merging, cause build issues, or have negative impact on target product user experience.

    Attribute canResegmentcan be used with care to control segmentation Modification behavior. Its values need to be controlled by rules derived from the structural and inline logic of the native format. For instance, more often than not it will make sense to set canResegmentto no for:

    • lists
    • tables or table rows
    • UI elements

    Extracted as segments of a unit.

    In UI elements and tables, it is likely that the available segmentation needs to be protected, on the other hand, it is advisable not to change the default canResegment="yes" for normal paragraph text and similar, see Role of the <unit> Element.

    Importantly, preventing Modification of segmentation using the attribute canResegment(set to no when necessary) will not prevent reordering of segments within a unit using the order attribute on the <target> elements within the same unit. So in case an ordered list needs to be for instance alphabetically collated, translators can do so even in case the canResegmentattribute is set to no. The segmentation logic of the native format remains protected without preventing collation. This would be all hampered if the Extractor decided to Extract each segment as a separate unit, which is the most evil practice that cannot be discouraged enough.

  7. Jim Hargrave (OLD)

    I should add that the reason the old segment matching logic won’t work (one to one matching in order) is that reordering or split/merge may happen in the workflow/workbench. The only way we have to match source/target segments is with id’s - but these are optional in xliff 2 (I recommend they should always be added). Also some workflows transform the content and that transformation may loose information.

    I will take a close look at this logic to see if there is more we can do in cases where id’s are missing or the number of source/target segments differ.

  8. ysavourel

    Did I misread that? If so what should be the behavior of merge if a translated xliff 2.x file has been further segmented? Do we preserve the segmentation of the original or the translated? IMHO, it should be the latter.

    Yes, I think we keep the translated version.

  9. Tyler Angelo Account Deactivated reporter

    @jhargrave-straker Thanks so much for the detailed explanation. The excerpt from GALA helps me understand alot more about segmentation. I think In this case preserving the original segmentation probably isn't totally necessary. This was just one example I pulled out of our test suite, so it's likely our real-world data would be different. We’d likely be passing generated xliff from some other program like xcode or WalkMe. So we’d really just be after source/target matching at this point. Again, I really appreciate your response and you guys looking into this.

  10. Jim Hargrave (OLD)

    @{5b16d423c2fc1b1bc37bb2a7} No worries - and sorry for the information overload. The truth is we are very excited to see xliff 2 being used “in the wild“. I am keen on getting everything working correctly, test our assumptions and to help promote best practices as the adoption curve of xliff 2 progresses. We plan on getting this worked out for the 1.43.0 release.

    Tasks:

    1. Resolve all source/target segment matching bugs.
    2. Keep the needsSegmentation option and add more tests using a segmentation workflow.
    3. Fix outstanding bugs where xliff 2 attributes are lost in the merge.
    4. Fix issue with internal mrk elements (they are treated as inline codes currently - but interfere with source/target code matching)

    Future tasks will add more module support beyond the core and metadata.

  11. Jim Hargrave (OLD)

    @{5b16d423c2fc1b1bc37bb2a7} There were a lot more failed tests than I anticipated. I am slowly fixing them in the test code. But it is a tedious and long process. This might have to wait till the 1.44.0 release as I have a few other branched ahead of yours. 1.44.0 may be a quick release as it will also be our official Java 11 migration and we don’t want to mix many deep code changes with that. So I am guessing April time frame.

  12. Tyler Angelo Account Deactivated reporter

    @jhargrave-straker No worries Jim! Thanks again for looking into this.

  13. Jim Hargrave (OLD)

    @{5b16d423c2fc1b1bc37bb2a7} @ysavourel We reprioritized this - working on it now and will make it the upcoming 1.43.00. I have also added full mrk tag support so these do not get stripped on output.

    Any other features you need you can send them to okapi-dev or DM me. While I’m digging in the code they might be easy enough to include. But the main focus will be near perfect roundtrip.

    BTW: You don’t have preserve whitespace enabled in your sample file - seems for some segments that’s what you want or the whitespace will be normalized.

    cheers

  14. Tyler Angelo Account Deactivated reporter

    Great, thanks @jhargrave-straker – When is the 1.43.0 release expected? Also thanks for the tip. We have this enabled in our files that we run through our processing in production, so maybe it would be good tot enable it in these test files as well for consistency.

  15. Jim Hargrave (OLD)

    @{5b16d423c2fc1b1bc37bb2a7} As soon as I can get these fixes in and current PRs and we do our testing. I am hoping end of March. Note that 1.44.00 will come out fairly quickly after that (1-2 months?) and will be upgraded to Java 11.

  16. jhargrave-straker
    • changed status to open

    actively working on this @jhargrave-straker

    Note I have deleted my previous Jim Hargrave (OLD)

  17. jhargrave-straker

    @{5b16d423c2fc1b1bc37bb2a7} @Mihai Nita

    What we are considering is adding an extra metadata to the internal Okapi TextUnit. Which would mirror your okp:paragraph, okp:title etc.. We would only segment okp:paragraph (if you had a complete list that would help). We can make it a configurable option for the segmenter as an ENUM of standard Okapi content types.

    May I ask where these come from? I know some OKapi filters provide them if configured like HTML and XmlStream.

    With the addition of changes I have currently (*will* be part of the release in a few weeks) I think we may this covered.

    We will solve the “segmentation problem” by preprocessing the file (segmenting it) with the provided meta in the xliff2.

    How does that sound?

  18. jhargrave-straker

    @{5b16d423c2fc1b1bc37bb2a7} @ysavourel @Chase Tingley

    Here are the latest output. Assuming you send original or the translated variant. Both have gone through additional segmentation. The only known remaining issue is loss of som xliff attribues (segment id) on output. But this is normal in cases where new segments are created.

    Let me know if we are getting close. Once I merge my PR this week there will be a SNAPSHOT in maven to test further.

    1.43_output.zip

  19. Tyler Angelo Account Deactivated reporter

    May I ask where these come from? I know some OKapi filters provide them if configured like HTML and XmlStream

    Yeah, I believe these come from the okapi xliff2 filter. Not 100% confident though. I’m not too concerned about these, I think your proposed strategy makes sense though. Also these test output files look great. 👍 Thanks a ton, Jim!

  20. Log in to comment