TextFragmentUtil.alignAndCopyCodeMetadata inserts only an open tag, leaving the TextFragment unbalanced

Issue #1043 resolved
Kuro Kurosaka (BH Lab) created an issue

This is against dev commit 45ecaf5 after PR#503 has been merged.

When the BaseConnector finds a fuzzy match where a TM entry almost matches except that the TM entry lacks a code pair, it was found that in the target, just the open tag is inserted. This results in an invalid XLIFF file.

It seems TextFragmentUtil.alignAndCopyCodeMetadata is not working as expected. This can be demonstrated by applying the following patch to TextFragmentUtilTest.java.

diff --git a/okapi/core/src/test/java/net/sf/okapi/common/resource/TextFragmentUtilTest.java b/okapi/core/src/test/java/net/sf/okapi/common/resource/TextFragmentUtilTest.java
index 38773d3317..ebe82d6878 100644
--- a/okapi/core/src/test/java/net/sf/okapi/common/resource/TextFragmentUtilTest.java
+++ b/okapi/core/src/test/java/net/sf/okapi/common/resource/TextFragmentUtilTest.java
@@ -281,6 +281,11 @@ public class TextFragmentUtilTest {
                TextFragment oriFrag;
                TextFragment trgFrag;

+               oriFrag = GenericContent.fromLetterCodedToFragment("xxx <g1>yyy</g1> zzz", null, false, true);
+               trgFrag = GenericContent.fromLetterCodedToFragment("xxx yyy zzz", null, false, true);
+               alignAndCopyCodeMetadata(oriFrag, trgFrag, true, false);
+               assertEquals("xxx yyy zzz<1></1>", fmt.setContent(trgFrag).toString());
+
                oriFrag = GenericContent.fromLetterCodedToFragment("src<x1/>", null, false, true);
                trgFrag = GenericContent.fromLetterCodedToFragment("trg", null, false, true);
                alignAndCopyCodeMetadata(oriFrag, trgFrag, true, true);

Comments (10)

  1. Jim Hargrave (OLD)

    @Kuro Kurosaka The original functionality of addMissingCodes was to only add codes that were guaranteed to have accurate placement. This meant only leading or trailing codes for the segment. We could enhance the functionality to add all missing codes - with the understanding that the placement will be incorrect and require post-editing.

    I don’t see a problem adding this as it would allow you to merge leveraged documents which would be useful for context preview etc.

  2. Jim Hargrave (OLD)

    @Kuro Kurosaka But you are right I did find a problem when the 'to” frgmant has zero codes. I will fix this.

    What do you think about the enhancement for adding all codes?

  3. Kuro Kurosaka (BH Lab) reporter

    @Jim Hargrave (OLD) thank you for looking into it on Sunday! I thought this option was meant to add all extra codes found in the source to the end of of the target so that the human editor can reposition them later. If that is not the intention of this flag, I misunderstood its purpose. I am not sure what you mean by “the enhancement for adding all codes”. Could you have some example that shows what you the result of your fix might be without this enhancement, and what it would be with this enhancement?

  4. Jim Hargrave (OLD)

    @Kuro Kurosaka Actually you are right. I misread the original code. All missing codes are added- except the bug case which I will fix.

    public static void addMissingCodes(TextFragment from, TextFragment to, CodeMatches codeMatches) {
            List<Code> fromCodes = from.codes;
            TextFragment leadingCodes = new TextFragment();
    
            // iterate over all the mismatches in from and add if possible
            for (Integer index : codeMatches.getFromMismatchIndexIterator()) {
                Code fc = fromCodes.get(index);
                if (isLeadingCode(fc, from)) {
                    leadingCodes.append(fc.clone());
                } else {
                    to.append(fc.clone());
                }
            }
    
            // add missing leading codes
            to.insert(0, leadingCodes, true);
        }
    

  5. Log in to comment