TextFragment.balanceMarkers() does not handle overlapping Code pairs properly

This was observed during development of the subfiltering step. The subfiltering step will apply a subfilter (HTML Filter only at this time) to TextUnits from the main filter (OpenXmlFilter, for example). As a result, code pairs from main filter may overlap with code pairs from subfilter, resulting in what we call overlapping runs situation.

For example, imagine there is a Word file that contains this line:

abc xxx def

The OpenXmlFilter puts an opening code before &~~#34~~;abc&~~#34~~; and a closing code after &~~#34~~;xxx&~~#34~~; to mark the run of text that should be rendered bold.

The HTML filter turns and into a code pair.

As a result, we have two overlapping runs of text

They would be expressed using Codes in TextFragment as:

0abc 1xxx2 def3

where [i] means a marker + code index pair that points to the i&~~#39~~;th Code in the TextFragment. We have a choice of 3 markers, OPENING, CLOSING and ISOLATE.

A unit test case listed below was added to TextFragmentTest to examine the behavior of TextFragment.balanceMarkers() handling overlapping code pairs.

The most desirable result of balanceMarkers() might be to use these markers:

This is desirable because XLIFF Writer can write this like: <g id="1">abc <x id="2">xxx</g> def<x id="3">

Another acceptable result might use just ISOLATE markers to all codes.

But the result of running the following code is that these markers are used:

This would cause the XLIFF Writer to write: <g id=&~~#34~~;1&~~#34~~;>abc <g id=&~~#34~~;2&~~#34~~;>xxx</g> def</g>

This is not acceptable because it change the meaning of the text as it it were:

abc xxx def

Reference: https://groups.google.com/forum/#!topic/okapi-devel/UWFLMjhbc3g

    @Test
    public void testCrossedCodePairsWithCodeIdAssignments () {
        // &#34;&lt;t&gt;abc&lt;u&gt;xxx&lt;/t&gt;def&lt;/u&gt;&#34;
        // &lt;u&gt; and &lt;/u&gt; should be marked isolated so XLIFF is well-formed.
        TextFragment tf = new TextFragment();
        Code code = new Code(TagType.OPENING, &#34;ttag&#34;, &#34;&lt;t&gt;&#34;);
        code.setId(1);
        tf.append(code);
        tf.append(&#34;abc&#34;);
        code = new Code(TagType.OPENING, &#34;utag&#34;, &#34;&lt;u&gt;&#34;);
        code.setId(2);
        tf.append(code);
        tf.append(&#34;xxx&#34;);
        code = new Code(TagType.CLOSING, &#34;ttag&#34;, &#34;&lt;/t&gt;&#34;);
        code.setId(1); // Same id as its opening tag
        tf.append(code);
        tf.append(&#34;def&#34;);
        code = new Code(TagType.CLOSING, &#34;utag&#34;, &#34;&lt;/u&gt;&#34;);
        code.setId(2); // Give it unique id
        tf.append(code);

        assertThat(tf.getCodes().size(), equalTo(4));
        assertThat(tf.getCodes().get(0).getId(), equalTo(1));
        assertThat(tf.getCodes().get(1).getId(), equalTo(2));
        assertThat(tf.getCodes().get(2).getId(), equalTo(1));
        assertThat(tf.getCodes().get(3).getId(), equalTo(2));
        String t1 = tf.getCodedText();
        assertThat(t, equalTo(&#34;\uE101\uE110abc\uE103\uE111xxx\uE102\uE112def\uE103\uE113&#34;)); 
        // Actual: &#34;\uE101\uE110abc\uE101\uE111xxx\uE102\uE112def\uE102\uE113&#34;
    }

Comments (7)