OpenXML Filter doesn't merge PPTX runs that differ only by lang attribute

Issue #775 new
Xu Lihang created an issue

When translating a pptx file, I found that there are so many tags and removing them in the target text will have almost no effect on the target file.

Like this sentence:

4、术语不统一,如点击用Touch、Tap还是Click

Okapi produces this:

<source><g id="1">4</g><g id="2">、术语不统一,如点击用</g><g id="3">Touch</g><g id="4">、</g><g id="5">Tap</g><g id="6">还是</g><g id="7">Click</g></source>

while SDL Trados produces this:

<source>4、术语不统一,如点击用Touch、Tap还是Click</source>

By looking at the internal xml file, I find that Word has different tags for English text and Chinese text.

<a:p><a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>4</a:t></a:r>
<a:r><a:rPr lang="zh-CN" altLang="en-US" dirty="0" smtClean="0"/><a:t>、术语不统一,如点击用</a:t></a:r>
<a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>Touch</a:t></a:r>
<a:r><a:rPr lang="zh-CN" altLang="en-US" dirty="0" smtClean="0"/><a:t></a:t></a:r>
<a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>Tap</a:t></a:r>
<a:r><a:rPr lang="zh-CN" altLang="en-US" dirty="0" smtClean="0"/><a:t>还是</a:t></a:r>
<a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>Click</a:t></a:r><a:endParaRPr lang="zh-CN" altLang="en-US" dirty="0"/></a:p>

So I sugguest tags which only differ in attributes like lang should be removed.

Comments (9)

  1. Chase Tingley

    Ah interesting, we do merge similar tags like this in some other cases, but I don't think we consider lang.

  2. Denis Konovalyenko

    @xulihang, @tingley, as far as I remember, the DOCX format differs from PPTX mostly the way the styles are exposed. The DOCX is more into elements and the PPTX is into attributes. So, the mentioned DOCX lang is handled in the scope of rPr properties, however, we might not take care of the PPTX attributes...

  3. Denis Konovalyenko

    While I am fresh on this would like to share some thoughts.

    1. There would probably be nice to take care of altLang attribute as well.
    2. It has been revealed that the endParaRPr element representing end paragraph run properties in the scope of a paragraph has the same attributes to tackle (lang , altLang and others).
  4. Log in to comment