- edited description
TextFragment.balanceMarkers() does not handle overlapping Code pairs properly
This was observed during development of the subfiltering step. The subfiltering step will apply a subfilter (HTML Filter only at this time) to TextUnits from the main filter (OpenXmlFilter, for example). As a result, code pairs from main filter may overlap with code pairs from subfilter, resulting in what we call overlapping runs situation.
For example, imagine there is a Word file that contains this line:
abc <i>xxx def</i>
The OpenXmlFilter puts an opening code before &;abc" and a closing code after "xxx" to mark the run of text that should be rendered bold.#34
The HTML filter turns <i> and </i> into a code pair.
As a result, we have two overlapping runs of text
They would be expressed using Codes in TextFragment as:
0abc 1xxx2 def3
where [i] means a marker + code index pair that points to the i&;th Code in the TextFragment. We have a choice of 3 markers, OPENING, CLOSING and ISOLATE.#39
A unit test case listed below was added to TextFragmentTest to examine the behavior of TextFragment.balanceMarkers() handling overlapping code pairs.
The most desirable result of balanceMarkers() might be to use these markers:
This is desirable because XLIFF Writer can write this like: <g id="1">abc <x id="2">xxx</g> def<x id="3">
Another acceptable result might use just ISOLATE markers to all codes.
But the result of running the following code is that these markers are used:
This would cause the XLIFF Writer to write: <g id=&;1">abc <g id="2">xxx</g> def</g>#34
This is not acceptable because it change the meaning of the text as it it were:
abc <i>xxx</i> def
Reference: https://groups.google.com/forum/#!topic/okapi-devel/UWFLMjhbc3g
@Test
public void testCrossedCodePairsWithCodeIdAssignments () {
// "<t>abc<u>xxx</t>def</u>"
// <u> and </u> should be marked isolated so XLIFF is well-formed.
TextFragment tf = new TextFragment();
Code code = new Code(TagType.OPENING, "ttag", "<t>");
code.setId(1);
tf.append(code);
tf.append("abc");
code = new Code(TagType.OPENING, "utag", "<u>");
code.setId(2);
tf.append(code);
tf.append("xxx");
code = new Code(TagType.CLOSING, "ttag", "</t>");
code.setId(1); // Same id as its opening tag
tf.append(code);
tf.append("def");
code = new Code(TagType.CLOSING, "utag", "</u>");
code.setId(2); // Give it unique id
tf.append(code);
assertThat(tf.getCodes().size(), equalTo(4));
assertThat(tf.getCodes().get(0).getId(), equalTo(1));
assertThat(tf.getCodes().get(1).getId(), equalTo(2));
assertThat(tf.getCodes().get(2).getId(), equalTo(1));
assertThat(tf.getCodes().get(3).getId(), equalTo(2));
String t1 = tf.getCodedText();
assertThat(t, equalTo("\uE101\uE110abc\uE103\uE111xxx\uE102\uE112def\uE103\uE113"));
// Actual: "\uE101\uE110abc\uE101\uE111xxx\uE102\uE112def\uE102\uE113"
}
See also: issue #792
Comments (7)
-
reporter -
reporter - edited description
-
reporter - edited description
Made the first call explicit.
-
reporter - edited description
Edited to separate the idempotent issue and the improper handling of overlapping code pairs.
-
reporter - edited description
- changed title to TextFragment.balanceMarkers() does not handle overlapping Code pairs properly
-
reporter - edited description
This was observed during development of the subfiltering step. The subfiltering step will apply a subfilter (HTML Filter only at this time) to TextUnits from the main filter (OpenXmlFilter, for example). As a result, code pairs from main filter may overlap with code pairs from subfilter.
For example, imagine there is a Word file that contains this line:
abc <i>xxx def</i>
The OpenXmlFilter puts an opening code before "abc" and a closing code after "xxx" to mark the run of text that should be rendered bold.
The HTML filter turns <i> and </i> into a code pair.
As a result, we have two overlapping runs of text
They would be expressed using Codes in TextFragment as:
where [i] means a marker + code index pair that points to the i'th Code in the TextFragment. We have a choice of 3 markers, OPENING, CLOSING and ISOLATE.
A unit test case listed below was added to TextFragmentTest to examine the behavior of TextFragment.balanceMarkers() handling overlapping code pairs.
The most desirable result of balanceMarkers() might be to use these markers:
This is desirable because XLIFF Writer can write this like:
<g id="1">abc <x id="2">xxx</g> def<x id="3">
Another acceptable result might use just ISOLATE markers to all codes.
But the result of running the following code is that these markers are used:
This would cause the XLIFF Writer to write: <g id="1">abc <g id="2">xxx</g> def</g>
This is not acceptable because it change the meaning of the text as it it were:
abc <i>xxx</i> def
Reference: https://groups.google.com/forum/#!topic/okapi-devel/UWFLMjhbc3g
@Test public void testCrossedCodePairsWithCodeIdAssignments () { // "<t>abc<u>xxx</t>def</u>" // <u> and </u> should be marked isolated so XLIFF is well-formed. TextFragment tf = new TextFragment(); Code code = new Code(TagType.OPENING, "ttag", "<t>"); code.setId(1); tf.append(code); tf.append("abc"); code = new Code(TagType.OPENING, "utag", "<u>"); code.setId(2); tf.append(code); tf.append("xxx"); code = new Code(TagType.CLOSING, "ttag", "</t>"); code.setId(1); // Same id as its opening tag tf.append(code); tf.append("def"); code = new Code(TagType.CLOSING, "utag", "</u>"); code.setId(2); // Give it unique id tf.append(code); assertThat(tf.getCodes().size(), equalTo(4)); assertThat(tf.getCodes().get(0).getId(), equalTo(1)); assertThat(tf.getCodes().get(1).getId(), equalTo(2)); assertThat(tf.getCodes().get(2).getId(), equalTo(1)); assertThat(tf.getCodes().get(3).getId(), equalTo(2)); String t1 = tf.getCodedText(); assertThat(t, equalTo("\uE101\uE110abc\uE103\uE111xxx\uE102\uE112def\uE103\uE113")); // Actual: "\uE101\uE110abc\uE101\uE111xxx\uE102\uE112def\uE102\uE113" }
See also: issue
#792 -
- changed status to open
add Kuro's test case to dev (Ignored for now)
- Log in to comment
Updating the unit test case.