OpenXML: Some tabs get lost in DOCX and are not always represented by tabs in extracted text

Issue #441 resolved
Former user created an issue

Original issue 441 created by @ysavourel on 2015-02-03T12:12:49.000Z:

a) In the attached example, some tabs get lost in extraction.
b) The tabs characters are also not always represented by an actual tab in the extracted file.

See also original report:
https://groups.yahoo.com/neo/groups/okapitools/conversations/messages/4526

Comments (16)

  1. Former user Account Deleted

    Comment 1. originally posted by s.kar...@24technology.de on 2015-02-03T13:49:32.000Z:

    b) goes also for soft returns. They are not represented by soft returns (but by x-tags).

  2. Chase Tingley

    The issue of tabs/returns being lost is fixed by the fixes to issue 458 and issue 467. The option to expose them as literal text is a valid option. I am going to try to sort out the changes in the pull request.

  3. Chase Tingley

    Fix Issue #441: rework to handle changes to run merging

    The changes in M28 to the run merging and related code
    (ParagraphSimplifier, etc) have changed the way this feature needs
    to be implemented.  When the appropriate option is set, tab and br
    elements are now converted directly to the corresponding character
    during paragraph simplification, and the element dropped.  The
    writer will regenerate br elements, and leave tabs inline as
    characters.
    
    This also updates the unittests for changes that have happened as
    a result of the paragraph simplification updates, as well as adding
    one more testcase (tabstyles.docx).
    

    → <<cset 4f3ad5b19daa>>

  4. Christopher Cudennec Account Deactivated

    Hi Chase!

    I tested the latest version of your changes (as you might have seen in the dev-mailinglist). I came across a failing test on our side that tests the new options. In my case a "line separator" was not inserted for a PPTX document.

    I took a look at your code changes to try to find the reason and I think I got something. Take a look at the following snip from my PPTX file:

    <a:p><a:r><a:rPr lang="de-DE" dirty="0" smtClean="0"/><a:t>Text mit einem Punkt.</a:t></a:r><a:br><a:rPr lang="de-DE" dirty="0" smtClean="0"/></a:br><a:r><a:rPr lang="de-DE" dirty="0" smtClean="0"/><a:t>Und einem SLB.</a:t></a:r></a:p>
    

    It consists of a paragraph with two text runs. The linebreak is located between the two runs.

    If I understand the code correctly ParagraphSimplifier will replace "br" elements by linebreaks only if the "br" element is a child of the "r" element. That's why I don't see the line separator for the document.

    Can you re-check that piece of code for PPTX documents?

  5. Chase Tingley

    Hi Christopher,

    I do think there's a bug here -- I think the filter may be losing these linebreaks at least some of the time.

    However, I think there's a problem with converting these <a:br> elements outside of a run into linebreak characters. According to the OpenXML reference (page 3185 / section 21.1.2.2.1), <a:br> can contain run properties information that will be applied to any text that is subsequently typed on that line. It won't be easy to convert the element to a line break character and back while preserving that metadata.

  6. Christopher Cudennec Account Deactivated

    I'm afraid I don't understand your answer completely.

    Do you think the bug is located in the filter itself or in ParagraphSimplifier?

    Can I help you solving the problem? I think I have to spend some more time with your new code to get a better understanding what the filter now does. We take great interest in getting the new feature with the next release :-).

    Cheers,

    Christopher

  7. Chase Tingley

    I think it's probably in the ParagraphSimplifier, but I'm not sure.

    If you'd like to take a look, go ahead. I may find time in the next day or two, but I may not get to it until next week.

    The issue I was trying to explain is that in some cases, treating those <a:br> elements as a literal '\n' can actually cause data loss. It's because <a:br> can contain child properties like this:

    <a:br>
      <a:rPr><!-- ... run properties ... --></a:rPr>
    </a:br>
    

    This isn't true of <w:br/>, it's only for DrawingML. Preserving these properties probably requires treating the br as a tag, or else being a little bit sneaky.

  8. Christopher Cudennec Account Deactivated

    Does ParagraphSimplifier replace a "br" by the string literal '\n'? "Our" version of the filter just added the literal after the tag that represents the "br".

  9. Chase Tingley

    Yes, currently it substitutes \n for the tag and then replaces \n with the tag when writing the target back out. I know this is not exactly the same behavior you submitted, but having both seemed very strange to me -- they could be moved independent of each other by a translator.

  10. Christopher Cudennec Account Deactivated

    Hi Chase,

    I took a look at the specification and I must say that the explanation is quite odd:

    This sets the formatting of text for the line break so that if text is later inserted there that a new run can be generated with the correct formatting.

    Do you know a good use case for that feature?

    I don't think it hurts to loose the "rPr" of the "br". When replacing the linebreak in Powerpoint with text it will be formatted like the previous run.

  11. Christopher Cudennec Account Deactivated

    Hi Chase,

    I made some code changes in another branch: https://bitbucket.org/24t/okapi/branch/openxml-441-2 We will create a pull request after some more testing.

    Basically I changed two things:

    • handle "br" elements between runs
    • strip "dirty" and "smtClean" attributes with a value of "0" when "cleanupAggressively" is enabled

    Cheers,

    Christopher

  12. Log in to comment