TextFragment.balanceMarkers() does not handle overlapping Code pairs properly

Issue #791 open
Kuro Kurosaka created an issue

This was observed during development of the subfiltering step. The subfiltering step will apply a subfilter (HTML Filter only at this time) to TextUnits from the main filter (OpenXmlFilter, for example). As a result, code pairs from main filter may overlap with code pairs from subfilter, resulting in what we call overlapping runs situation.

For example, imagine there is a Word file that contains this line:

abc <i>xxx def</i>

The OpenXmlFilter puts an opening code before &#34;abc&#34; and a closing code after &#34;xxx&#34; to mark the run of text that should be rendered bold.

The HTML filter turns <i> and </i> into a code pair.

As a result, we have two overlapping runs of text

They would be expressed using Codes in TextFragment as:

0abc 1xxx2 def3

where [i] means a marker + code index pair that points to the i&#39;th Code in the TextFragment. We have a choice of 3 markers, OPENING, CLOSING and ISOLATE.

A unit test case listed below was added to TextFragmentTest to examine the behavior of TextFragment.balanceMarkers() handling overlapping code pairs.

The most desirable result of balanceMarkers() might be to use these markers:

This is desirable because XLIFF Writer can write this like: &lt;g id=&#34;1&#34;&gt;abc &lt;x id=&#34;2&#34;&gt;xxx&lt;/g&gt; def&lt;x id=&#34;3&#34;&gt;

Another acceptable result might use just ISOLATE markers to all codes.

But the result of running the following code is that these markers are used:

This would cause the XLIFF Writer to write: <g id=&#34;1&#34;>abc <g id=&#34;2&#34;>xxx</g> def</g>

This is not acceptable because it change the meaning of the text as it it were:

abc <i>xxx</i> def

Reference: https://groups.google.com/forum/#!topic/okapi-devel/UWFLMjhbc3g

    @Test
    public void testCrossedCodePairsWithCodeIdAssignments () {
        // &#34;&lt;t&gt;abc&lt;u&gt;xxx&lt;/t&gt;def&lt;/u&gt;&#34;
        // &lt;u&gt; and &lt;/u&gt; should be marked isolated so XLIFF is well-formed.
        TextFragment tf = new TextFragment();
        Code code = new Code(TagType.OPENING, &#34;ttag&#34;, &#34;&lt;t&gt;&#34;);
        code.setId(1);
        tf.append(code);
        tf.append(&#34;abc&#34;);
        code = new Code(TagType.OPENING, &#34;utag&#34;, &#34;&lt;u&gt;&#34;);
        code.setId(2);
        tf.append(code);
        tf.append(&#34;xxx&#34;);
        code = new Code(TagType.CLOSING, &#34;ttag&#34;, &#34;&lt;/t&gt;&#34;);
        code.setId(1); // Same id as its opening tag
        tf.append(code);
        tf.append(&#34;def&#34;);
        code = new Code(TagType.CLOSING, &#34;utag&#34;, &#34;&lt;/u&gt;&#34;);
        code.setId(2); // Give it unique id
        tf.append(code);

        assertThat(tf.getCodes().size(), equalTo(4));
        assertThat(tf.getCodes().get(0).getId(), equalTo(1));
        assertThat(tf.getCodes().get(1).getId(), equalTo(2));
        assertThat(tf.getCodes().get(2).getId(), equalTo(1));
        assertThat(tf.getCodes().get(3).getId(), equalTo(2));
        String t1 = tf.getCodedText();
        assertThat(t, equalTo(&#34;\uE101\uE110abc\uE103\uE111xxx\uE102\uE112def\uE103\uE113&#34;)); 
        // Actual: &#34;\uE101\uE110abc\uE101\uE111xxx\uE102\uE112def\uE102\uE113&#34;
    }

See also: issue #792

Comments (7)

  1. Kuro Kurosaka reporter
    • edited description

    Edited to separate the idempotent issue and the improper handling of overlapping code pairs.

  2. Kuro Kurosaka reporter
    • edited description

    This was observed during development of the subfiltering step. The subfiltering step will apply a subfilter (HTML Filter only at this time) to TextUnits from the main filter (OpenXmlFilter, for example). As a result, code pairs from main filter may overlap with code pairs from subfilter.

    For example, imagine there is a Word file that contains this line:

    abc <i>xxx def</i>

    The OpenXmlFilter puts an opening code before "abc" and a closing code after "xxx" to mark the run of text that should be rendered bold.

    The HTML filter turns <i> and </i> into a code pair.

    As a result, we have two overlapping runs of text issue-791-overlaping-runs.png

    They would be expressed using Codes in TextFragment as:

    0abc 1xxx2 def3

    where [i] means a marker + code index pair that points to the i'th Code in the TextFragment. We have a choice of 3 markers, OPENING, CLOSING and ISOLATE.

    A unit test case listed below was added to TextFragmentTest to examine the behavior of TextFragment.balanceMarkers() handling overlapping code pairs.

    The most desirable result of balanceMarkers() might be to use these markers:

    This is desirable because XLIFF Writer can write this like: <g id="1">abc <x id="2">xxx</g> def<x id="3">

    Another acceptable result might use just ISOLATE markers to all codes.

    But the result of running the following code is that these markers are used:

    This would cause the XLIFF Writer to write: <g id="1">abc <g id="2">xxx</g> def</g>

    This is not acceptable because it change the meaning of the text as it it were:

    abc <i>xxx</i> def

    Reference: https://groups.google.com/forum/#!topic/okapi-devel/UWFLMjhbc3g

        @Test
        public void testCrossedCodePairsWithCodeIdAssignments () {
            // "<t>abc<u>xxx</t>def</u>"
            // <u> and </u> should be marked isolated so XLIFF is well-formed.
            TextFragment tf = new TextFragment();
            Code code = new Code(TagType.OPENING, "ttag", "<t>");
            code.setId(1);
            tf.append(code);
            tf.append("abc");
            code = new Code(TagType.OPENING, "utag", "<u>");
            code.setId(2);
            tf.append(code);
            tf.append("xxx");
            code = new Code(TagType.CLOSING, "ttag", "</t>");
            code.setId(1); // Same id as its opening tag
            tf.append(code);
            tf.append("def");
            code = new Code(TagType.CLOSING, "utag", "</u>");
            code.setId(2); // Give it unique id
            tf.append(code);
    
            assertThat(tf.getCodes().size(), equalTo(4));
            assertThat(tf.getCodes().get(0).getId(), equalTo(1));
            assertThat(tf.getCodes().get(1).getId(), equalTo(2));
            assertThat(tf.getCodes().get(2).getId(), equalTo(1));
            assertThat(tf.getCodes().get(3).getId(), equalTo(2));
            String t1 = tf.getCodedText();
            assertThat(t, equalTo("\uE101\uE110abc\uE103\uE111xxx\uE102\uE112def\uE103\uE113")); 
            // Actual: "\uE101\uE110abc\uE101\uE111xxx\uE102\uE112def\uE102\uE113"
        }
    

    See also: issue #792

  3. Log in to comment