OpenXML filter not stripping change tracking markup in some cases

Issue #768 resolved
Chase Tingley created an issue

I have an example of a DOCX file in which some <w:ins> markup is exposed as an inline code. The behavior of the filter is supposed to be that we strip this stuff automatically (assuming "automatically accept revisions" is enabled, otherwise we throw an exception).

The file is a customer file, so this will need a clean testcase.

Comments (8)

  1. Denis Konovalyenko

    @tingley , it seems that this is the case with the "complex field implementations" processing. E.g. (the full version of this snippet is going to be attached as a file):

                <w:r>
                    <w:t xml:space="preserve">The first run.</w:t>
                </w:r>
                <w:r w:rsidR="000C2A81" w:rsidRPr="000C2A81">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:fldChar w:fldCharType="begin"/>
                </w:r>
                <w:r w:rsidR="000C2A81" w:rsidRPr="000C2A81">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:instrText xml:space="preserve"> HYPERLINK "https://www.hyperlink.com" </w:instrText>
                </w:r>
                <w:r w:rsidR="000C2A81" w:rsidRPr="000C2A81">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:fldChar w:fldCharType="separate"/>
                </w:r>
                <w:ins w:id="1" w:author="Denis Konovalyenko" w:date="2019-01-07T20:11:00Z">
                    <w:r w:rsidRPr="000C2A81">
                        <w:rPr>
                            <w:b/>
                        </w:rPr>
                        <w:t>inside the ins revision</w:t>
                    </w:r>
                </w:ins>
                <w:r w:rsidR="000C2A81" w:rsidRPr="000C2A81">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:fldChar w:fldCharType="end"/>
                </w:r>
                <w:r w:rsidRPr="000C2A81">
                    <w:rPr>
                        <w:b/>
                    </w:rPr>
                    <w:t>The last run.</w:t>
                </w:r>
    

    The above part is extracted as

    The first run.<g id="1"><x id="2"/>inside the ins revision<x id="3"/>The last run.</g>
    

    The <x id="2"/> code contains not only the <ins> tag but all events from <w:fldChar w:fldCharType="begin"/> to </w:rPr> just before <w:t>inside the ins revision</w:t> while the <x id="3"/> contains the rest, starting from </w:r> and ending by <w:fldChar w:fldCharType="end"/>.

    I assume, if we even make the filter to recognise the revisions inside such complex fields, it will not change the whole picture at all (the x code will remain). Also, if we really want to make the filter smarter in the processing of the complex fields (with covering of entailed runs, revisions, etc), it would probably require a good amount of time and significant efforts, but the usage of this would not be that frequent, in my opinion.

    Could you please let me know whether this is still the priority for you?

  2. Denis Konovalyenko

    @tingley , I would like to clarify that there can be roughly 2 possible ways of dealing with the revisions inside the complex fields.

    A lighter one. The revisions are accepted and they are not shown after the merge. However, there would be the same sort of x codes present after the extraction.

    A harder one. The revisions are accepted in the scope of a new way a complex field is represented after the extraction (i.e. it becomes a kind of a run container - need to think that out well though). Here we can have some variations on whether fldSimple tags are recognised or not, and whether all possible fields are supported or not (the full scope of this can be found in the "Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference" under the "17.16.1 Syntax" section).

    When I said that it would require a lot of time to make the implementation, I was thinking about the harder way. Will try to move on with the lighter variant and make the changes.

  3. Chase Tingley reporter

    Simply accepting the insertions and leaving the field codes intact seems like it should be fine. Maybe I am not understanding something about how the bug is happening, but I don't see why we need to change the field code behavior.

    Let's try the "light" approach, and see if that solves the corruption. We can also assess the resulting segment quality.

  4. Log in to comment