OpenXML: Allow for aggressive tag cleaning to streamline converted PDFs

This is a bit esoteric, but it's come out of the testing that the MateCat team has been done. It's a common practice for them to translate PDF by converting it to DOCX and then returning a translated DOCX. This conversion produces a lot of spurious formatting markup in order to precisely reproduce the positioning seen in the PDF. In particular, we are seeing two types of problem:

Overuse of the <w:spacing> property, frequently with very small values, to precisely reproduce character spacing from the PDF. This can occur multiple times within a single word, and is destructive to the segments produced due to the proliferation of inline tags
Somewhat more rarely, the use of the <w:vertAlign> property to micromanage the vertical alignment of text. In particular, in PDF conversion it is frequently applied to whitespace, which breaks up segments for no reason.

Both of these properties have valid uses, so stripping them all the time would be over-zealous. Instead, we propose adding an option to clean up this formatting information more aggressively. If the option (disabled by default) is selected, we will:

Strip <w:spacing> information from text runs.
Strip <w:vertAlign> information that is applied to runs consisting only of whitespace.

Comments (5)