Invalid return from "QueryUtil.fromCodedHTML"

Issue #1305 open
Johan Warnar Lando Boekkooi created an issue

Hey all,

At the moment I’m running into an issue when using the QueryUtil.fromCodedHTML from net.sf.okapi.lib.translation.QueryUtil with both dev and v1.45.0.

I have been able to create the following test in net.sf.okapi.lib.translation.QueryUtilTest but I'm unable to figure out what exactly is going wrong. So maybe someone can give me a hand?

    @Test
    public void testFromSameHTMLMisMatchingCodes () {
        var codes = new ArrayList<Code>();

        var c = new Code(TagType.CLOSING, "x-cstyle");
        c.setOriginalId("215-2");
        c.setId(47688187);
        c.appendOuterData("</g>");
        codes.add(c);

        c = new Code(TagType.OPENING, "x-cstyle");
        c.setOriginalId("215-1");
        c.setId(47688186);
        c.appendOuterData("<g ctype=\"x-cstyle\" id=\"215-1\">");
        codes.add(c);

        c = new Code(TagType.CLOSING, "x-cstyle");
        c.setOriginalId("215-1");
        c.setId(47688186);
        c.appendOuterData("</g>");
        codes.add(c);

        c = new Code(TagType.OPENING, "x-cstyle");
        c.setOriginalId("215-2");
        c.setId(47688187);
        c.appendOuterData("<g ctype=\"x-cstyle\" id=\"215-2\">");
        codes.add(c);

        String srcText = "Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. \uE103\uE110 \uE101\uE111a\uE102\uE112 \uE103\uE113:";
        TextFragment tf = new TextFragment(srcText, codes);

        String htmlText = qu.toCodedHTML(tf);
        String codedText = qu.fromCodedHTML(htmlText, tf, true);
        TextFragment resFrag = new TextFragment(codedText, tf.getClonedCodes());

        assertEquals(srcText, codedText);
        assertTrue(resFrag.compareTo(tf, TextFragment.CompareMode.IGNORE_CODE) == 0);
        assertTrue(resFrag.compareTo(tf, TextFragment.CompareMode.CODE_DATA_ONLY) == 0);
    }

In any case thanks for taking the time to look at this issue.

Cheers,
Warnar

Comments (5)

  1. jhargrave-straker

    I believe that TextFragment.balanceMarkers is being called. It would convert your first level CLOSING/CLOSING codes to ISOLATED - which is a different PUA Unicode character. Your codes are not well-formed - was that intended?

  2. Johan Warnar Lando Boekkooi reporter

    The codes where output like this by the SegmentationStep so I’m not sure why they are invalid.
    Let me try and create a Pipeline example to show the issue.

  3. jhargrave-straker

    Oh, ok. So what you have is a segment with isolated tags. That’s fine. I thought this was your input segment. Personally I wish we only recorded the code index marker as a PUA. All the other info should be in the Code object - much easier to debug. Let me look closer at you PUA chars.

  4. jhargrave-straker

    toCodedHTML: Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. <br id='e47688187'/> <u id='47688186'>a</u> <br id='b47688187'/>:

    fromCodedHTML: Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. \uE103\uE110 \uE101\uE111a\uE102\uE112 \uE103\uE110:

    expected: Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. \uE103\uE110 \uE101\uE111a\uE102\uE112 \uE103\uE113:

    I’ll need to walk through this but posting this in case something pops up.

  5. Log in to comment