OpenXML filter not stripping change tracking markup in some cases
I have an example of a DOCX file in which some <w:ins>
markup is exposed as an inline code. The behavior of the filter is supposed to be that we strip this stuff automatically (assuming "automatically accept revisions" is enabled, otherwise we throw an exception).
The file is a customer file, so this will need a clean testcase.
Comments (8)
-
-
- attached 768.docx
The mentioned markup snippet in its full version.
-
- attached 768-2.docx
A more complicated example of what might happen with the revisions inside the complex fields.
-
@tingley , I would like to clarify that there can be roughly 2 possible ways of dealing with the revisions inside the complex fields.
A lighter one. The revisions are accepted and they are not shown after the merge. However, there would be the same sort of
x
codes present after the extraction.A harder one. The revisions are accepted in the scope of a new way a complex field is represented after the extraction (i.e. it becomes a kind of a run container - need to think that out well though). Here we can have some variations on whether
fldSimple
tags are recognised or not, and whether all possible fields are supported or not (the full scope of this can be found in the "Ecma Office Open XML Part 1 - Fundamentals And Markup Language Reference" under the "17.16.1 Syntax" section).When I said that it would require a lot of time to make the implementation, I was thinking about the harder way. Will try to move on with the lighter variant and make the changes.
-
reporter Simply accepting the insertions and leaving the field codes intact seems like it should be fine. Maybe I am not understanding something about how the bug is happening, but I don't see why we need to change the field code behavior.
Let's try the "light" approach, and see if that solves the corruption. We can also assess the resulting segment quality.
-
@tingley, the related pull request #282 has been opened. This is the "light" approach.
-
The pull request #282 has been merged.
-
- changed status to resolved
- Log in to comment
@tingley , it seems that this is the case with the "complex field implementations" processing. E.g. (the full version of this snippet is going to be attached as a file):
The above part is extracted as
The
<x id="2"/>
code contains not only the<ins>
tag but all events from<w:fldChar w:fldCharType="begin"/>
to</w:rPr>
just before<w:t>inside the ins revision</w:t>
while the<x id="3"/>
contains the rest, starting from</w:r>
and ending by<w:fldChar w:fldCharType="end"/>
.I assume, if we even make the filter to recognise the revisions inside such complex fields, it will not change the whole picture at all (the
x
code will remain). Also, if we really want to make the filter smarter in the processing of the complex fields (with covering of entailed runs, revisions, etc), it would probably require a good amount of time and significant efforts, but the usage of this would not be that frequent, in my opinion.Could you please let me know whether this is still the priority for you?