OpenXML: Some tabs get lost in DOCX and are not always represented by tabs in extracted text

Issue #441 resolved

Former user created an issue 2015-02-03

Original issue 441 created by @ysavourel on 2015-02-03T12:12:49.000Z:

a) In the attached example, some tabs get lost in extraction.
b) The tabs characters are also not always represented by an actual tab in the extracted file.

Comments (16)

Former user Account Deleted
Comment 1. originally posted by s.kar...@24technology.de on 2015-02-03T13:49:32.000Z:

b) goes also for soft returns. They are not represented by soft returns (but by x-tags).
- 2015-02-03T13:49:32+00:00
Chase Tingley
- changed title to OpenXML: Some tabs get lost in DOCX and are not always represented by tabs in extracted text
- edited description
- 2015-05-02T17:26:01+00:00
Chase Tingley
- assigned issue to
  
  Chase Tingley
- 2015-06-03T12:19:03+00:00
Chase Tingley
The issue of tabs/returns being lost is fixed by the fixes to issue 458 and issue 467. The option to expose them as literal text is a valid option. I am going to try to sort out the changes in the pull request.
- 2015-06-27T23:03:05+00:00

Chase Tingley

changed status to resolved

Fix Issue ~~#441~~: rework to handle changes to run merging

The changes in M28 to the run merging and related code
(ParagraphSimplifier, etc) have changed the way this feature needs
to be implemented.  When the appropriate option is set, tab and br
elements are now converted directly to the corresponding character
during paragraph simplification, and the element dropped.  The
writer will regenerate br elements, and leave tabs inline as
characters.

This also updates the unittests for changes that have happened as
a result of the paragraph simplification updates, as well as adding
one more testcase (tabstyles.docx).

→ <<cset 4f3ad5b19daa>>

2015-07-28T06:27:18+00:00

Chase Tingley
I've merged the pull request by hand with some additional changes. Thanks, Christopher!
- 2015-07-28T06:29:41+00:00
Christopher Cudennec Account Deactivated
Hi Chase!

I tested the latest version of your changes (as you might have seen in the dev-mailinglist). I came across a failing test on our side that tests the new options. In my case a "line separator" was not inserted for a PPTX document.

I took a look at your code changes to try to find the reason and I think I got something. Take a look at the following snip from my PPTX file:
```
<a:p><a:r><a:rPr lang="de-DE" dirty="0" smtClean="0"/><a:t>Text mit einem Punkt.</a:t></a:r><a:br><a:rPr lang="de-DE" dirty="0" smtClean="0"/></a:br><a:r><a:rPr lang="de-DE" dirty="0" smtClean="0"/><a:t>Und einem SLB.</a:t></a:r></a:p>
```
It consists of a paragraph with two text runs. The linebreak is located between the two runs.

If I understand the code correctly ParagraphSimplifier will replace "br" elements by linebreaks only if the "br" element is a child of the "r" element. That's why I don't see the line separator for the document.

Can you re-check that piece of code for PPTX documents?
- 2015-08-03T14:08:08+00:00
Chase Tingley
Thanks Christopher, I'll take a look.
- 2015-08-04T05:50:09+00:00
Chase Tingley
Hi Christopher,

I do think there's a bug here -- I think the filter may be losing these linebreaks at least some of the time.

However, I think there's a problem with converting these <a:br> elements outside of a run into linebreak characters. According to the OpenXML reference (page 3185 / section 21.1.2.2.1), <a:br> can contain run properties information that will be applied to any text that is subsequently typed on that line. It won't be easy to convert the element to a line break character and back while preserving that metadata.
- 2015-08-04T22:12:45+00:00
Christopher Cudennec Account Deactivated
I'm afraid I don't understand your answer completely.

Do you think the bug is located in the filter itself or in ParagraphSimplifier?

Can I help you solving the problem? I think I have to spend some more time with your new code to get a better understanding what the filter now does. We take great interest in getting the new feature with the next release :-).

Cheers,

Christopher
- 2015-08-05T10:08:17+00:00
Chase Tingley
I think it's probably in the ParagraphSimplifier, but I'm not sure.

If you'd like to take a look, go ahead. I may find time in the next day or two, but I may not get to it until next week.

The issue I was trying to explain is that in some cases, treating those <a:br> elements as a literal '\n' can actually cause data loss. It's because <a:br> can contain child properties like this:
```
<a:br>
  <a:rPr></a:rPr>
</a:br>
```
This isn't true of <w:br/>, it's only for DrawingML. Preserving these properties probably requires treating the br as a tag, or else being a little bit sneaky.
- 2015-08-05T21:53:33+00:00
Christopher Cudennec Account Deactivated
Does ParagraphSimplifier replace a "br" by the string literal '\n'? "Our" version of the filter just added the literal after the tag that represents the "br".
- 2015-08-06T05:22:25+00:00
Chase Tingley
Yes, currently it substitutes \n for the tag and then replaces \n with the tag when writing the target back out. I know this is not exactly the same behavior you submitted, but having both seemed very strange to me -- they could be moved independent of each other by a translator.
- 2015-08-06T05:24:33+00:00
ysavourel
From: https://groups.yahoo.com/neo/groups/okapitools/conversations/messages/4702

I was testing issue ~~#441~~ with m28-Snapshot. I took the document from ~~#441~~ and made a roundtrip. Then I tried to open the merged Word file and I got this error (see Word file attached). After having a look at the xlf file I noticed that some text is missing (see xlf attached). This happened while converting the Word file into xlf. Do you also have this problem?
- 2015-08-06T12:09:24+00:00
Christopher Cudennec Account Deactivated
Hi Chase,

I took a look at the specification and I must say that the explanation is quite odd:

This sets the formatting of text for the line break so that if text is later inserted there that a new run can be generated with the correct formatting.

Do you know a good use case for that feature?

I don't think it hurts to loose the "rPr" of the "br". When replacing the linebreak in Powerpoint with text it will be formatted like the previous run.
- 2015-08-11T07:45:53+00:00
Christopher Cudennec Account Deactivated
Hi Chase,

I made some code changes in another branch: https://bitbucket.org/24t/okapi/branch/openxml-441-2 We will create a pull request after some more testing.

Basically I changed two things:
- handle "br" elements between runs
- strip "dirty" and "smtClean" attributes with a value of "0" when "cleanupAggressively" is enabled
Cheers,

Christopher
- 2015-08-13T09:45:17+00:00
Log in to comment

Assignee: Chase Tingley

Type: bug

Priority: minor

Status: resolved

Milestone: –

Version: –

Votes: 0

Watchers: 3