- changed status to open
Invalid return from "QueryUtil.fromCodedHTML"
Hey all,
At the moment I’m running into an issue when using the QueryUtil.fromCodedHTML
from net.sf.okapi.lib.translation.QueryUtil
with both dev
and v1.45.0
.
I have been able to create the following test in net.sf.okapi.lib.translation.QueryUtilTest
but I'm unable to figure out what exactly is going wrong. So maybe someone can give me a hand?
@Test
public void testFromSameHTMLMisMatchingCodes () {
var codes = new ArrayList<Code>();
var c = new Code(TagType.CLOSING, "x-cstyle");
c.setOriginalId("215-2");
c.setId(47688187);
c.appendOuterData("</g>");
codes.add(c);
c = new Code(TagType.OPENING, "x-cstyle");
c.setOriginalId("215-1");
c.setId(47688186);
c.appendOuterData("<g ctype=\"x-cstyle\" id=\"215-1\">");
codes.add(c);
c = new Code(TagType.CLOSING, "x-cstyle");
c.setOriginalId("215-1");
c.setId(47688186);
c.appendOuterData("</g>");
codes.add(c);
c = new Code(TagType.OPENING, "x-cstyle");
c.setOriginalId("215-2");
c.setId(47688187);
c.appendOuterData("<g ctype=\"x-cstyle\" id=\"215-2\">");
codes.add(c);
String srcText = "Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. \uE103\uE110 \uE101\uE111a\uE102\uE112 \uE103\uE113:";
TextFragment tf = new TextFragment(srcText, codes);
String htmlText = qu.toCodedHTML(tf);
String codedText = qu.fromCodedHTML(htmlText, tf, true);
TextFragment resFrag = new TextFragment(codedText, tf.getClonedCodes());
assertEquals(srcText, codedText);
assertTrue(resFrag.compareTo(tf, TextFragment.CompareMode.IGNORE_CODE) == 0);
assertTrue(resFrag.compareTo(tf, TextFragment.CompareMode.CODE_DATA_ONLY) == 0);
}
In any case thanks for taking the time to look at this issue.
Cheers,
Warnar
Comments (5)
-
-
I believe that
TextFragment.balanceMarkers
is being called. It would convert your first level CLOSING/CLOSING codes to ISOLATED - which is a different PUA Unicode character. Your codes are not well-formed - was that intended? -
reporter The codes where output like this by the SegmentationStep so I’m not sure why they are invalid.
Let me try and create a Pipeline example to show the issue. -
Oh, ok. So what you have is a segment with isolated tags. That’s fine. I thought this was your input segment. Personally I wish we only recorded the code index marker as a PUA. All the other info should be in the Code object - much easier to debug. Let me look closer at you PUA chars.
-
toCodedHTML:
Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. <br id='e47688187'/> <u id='47688186'>a</u> <br id='b47688187'/>:
fromCodedHTML:
Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. \uE103\uE110 \uE101\uE111a\uE102\uE112 \uE103\uE110:
expected:
Evaluation of checkpoints trained on 1-7 segment tasks on varying input lengths. \uE103\uE110 \uE101\uE111a\uE102\uE112 \uE103\uE113:
I’ll need to walk through this but posting this in case something pops up.
- Log in to comment
approve ticket