OpenXML: Allow for aggressive tag cleaning to streamline converted PDFs

Issue #484 resolved
Chase Tingley created an issue

This is a bit esoteric, but it's come out of the testing that the MateCat team has been done. It's a common practice for them to translate PDF by converting it to DOCX and then returning a translated DOCX. This conversion produces a lot of spurious formatting markup in order to precisely reproduce the positioning seen in the PDF. In particular, we are seeing two types of problem:

  • Overuse of the <w:spacing> property, frequently with very small values, to precisely reproduce character spacing from the PDF. This can occur multiple times within a single word, and is destructive to the segments produced due to the proliferation of inline tags
  • Somewhat more rarely, the use of the <w:vertAlign> property to micromanage the vertical alignment of text. In particular, in PDF conversion it is frequently applied to whitespace, which breaks up segments for no reason.

Both of these properties have valid uses, so stripping them all the time would be over-zealous. Instead, we propose adding an option to clean up this formatting information more aggressively. If the option (disabled by default) is selected, we will:

  • Strip <w:spacing> information from text runs.
  • Strip <w:vertAlign> information that is applied to runs consisting only of whitespace.

Comments (5)

  1. Jim Hargrave (OLD)

    I don't see a problem adding an option to more aggressively remove certain formatting. It may be a common use case (beyond PDF) to convert formats to DocX. In this case some of the absolute formatting introduced by the conversion process needs to be stripped to aid translation.

    You could use the "PreprocessingFilter" (see lib-preprocessing) if the above filter changes make the normal extraction too complex.

  2. Chase Tingley reporter

    Thanks Jim. Since the issue is specific tags in specific contexts, I think it's easier for the filter to deal with these directly, rather than to expose them as events and then dig back into them a second time via lib-preprocessing. We already do a bunch of stuff like this in the ParagraphSimplifier code (which is an internal preprocessing stage).

  3. Log in to comment