- changed title to OpenXML Filter generates too many tags
OpenXML Filter doesn't merge PPTX runs that differ only by lang attribute
When translating a pptx file, I found that there are so many tags and removing them in the target text will have almost no effect on the target file.
Like this sentence:
4、术语不统一,如点击用Touch、Tap还是Click
Okapi produces this:
<source><g id="1">4</g><g id="2">、术语不统一,如点击用</g><g id="3">Touch</g><g id="4">、</g><g id="5">Tap</g><g id="6">还是</g><g id="7">Click</g></source>
while SDL Trados produces this:
<source>4、术语不统一,如点击用Touch、Tap还是Click</source>
By looking at the internal xml file, I find that Word has different tags for English text and Chinese text.
<a:p><a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>4</a:t></a:r>
<a:r><a:rPr lang="zh-CN" altLang="en-US" dirty="0" smtClean="0"/><a:t>、术语不统一,如点击用</a:t></a:r>
<a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>Touch</a:t></a:r>
<a:r><a:rPr lang="zh-CN" altLang="en-US" dirty="0" smtClean="0"/><a:t>、</a:t></a:r>
<a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>Tap</a:t></a:r>
<a:r><a:rPr lang="zh-CN" altLang="en-US" dirty="0" smtClean="0"/><a:t>还是</a:t></a:r>
<a:r><a:rPr lang="en-US" altLang="zh-CN" dirty="0" smtClean="0"/><a:t>Click</a:t></a:r><a:endParaRPr lang="zh-CN" altLang="en-US" dirty="0"/></a:p>
So I sugguest tags which only differ in attributes like lang should be removed.
Comments (9)
-
reporter -
reporter - edited description
-
Ah interesting, we do merge similar tags like this in some other cases, but I don't think we consider
lang
. -
- marked as bug
-
- changed title to OpenXML Filter doesn't merge runs that differ only by lang attribute
-
@xulihang, @tingley, as far as I remember, the DOCX format differs from PPTX mostly the way the styles are exposed. The DOCX is more into elements and the PPTX is into attributes. So, the mentioned DOCX
lang
is handled in the scope ofrPr
properties, however, we might not take care of the PPTX attributes... -
-
Good point @DenisKonovalyenko; I updated the title
-
While I am fresh on this would like to share some thoughts.
- There would probably be nice to take care of
altLang
attribute as well. - It has been revealed that the
endParaRPr
element representing end paragraph run properties in the scope of a paragraph has the same attributes to tackle (lang
,altLang
and others).
- There would probably be nice to take care of
- Log in to comment