Open XML (PPTX) does not always respect font properties (all caps)
Occasionally, the “All Caps” font property has been propagated to other areas that do not originally have that property marked. Yves believes that this might be happening because of tag simplification in the filter.
In the attached files, the source has 17 instances of the “cap” attribute (the attribute that marks whether the text within its scope should be in “All Caps”) in slide1.xml. The (machine translated) Spanish has 8. In addition to having fewer “cap” attribute mentions, they are sometimes wrong - see the lists in the slide, where they are not in all caps in the source, but they are in the target.
Comments (10)
-
-
- changed milestone to 1.41.0
-
assigned issue to
-
@Dale Eggett , @Chase Tingley , it looks like we do try to simplify/minify styles on parsing in
net.sf.okapi.filters.openxml.RunParser#parseRunProperties
method:runBuilder.setRunProperties( direct.minified( this.styleDefinitions.combinedRunProperties( this.paragraphStyle, runStyle, new RunProperties.Empty() ) ) );
While it works great for WordprocessingML, it may fail for DrawingML due to incomplete styles hierarchy processing. Namelly, the combined run properties are appeared empty in most cases (except notes), which consequenlty allows the minification to remove the significant properties (
cap=”none”
in this issue case).Furthermore, in order to properlly gather the styles, the following parts have to be considered:
- slideX.xml (the xth slide)
- slideLayoutX.xml (the layout corresponding to the slide in question)
- slideMasterX.xml (the master corresponding to the layout in question)
- themeX.xml (and any theme overrides if the slide has one)
- presentation.xml (this is the default)
In particular, the styling information for text can be in one or more of the following:
a:r/a:rPr
(currenlty, these are programmaticaly moved to the next level if possible)a:pPr/a:defRPr
in slideX.xml and slideLayoutX.xmla:lstStyle/(a:defPPr | a:lvlXpPr)/a:defRPr
in slideX.xml and slideLayoutX.xmlp:txStyles/(p:titleStyle | p:bodyStyle | p:otherStyle)/a:lvlXpPr/a:defRPr
in slideMastertX.xmla:objectDefaults/(a:spDef | a:lnDef | a:txDef)/a:lstStyle/(a:defPPr | a:lvlXpPr)/a:defRPr
in themeX.xmlp:defaultTextStyle/(a:defPPr | a:lvlXpPr)/a:defRPr
in presentation.xml
Besides the fact that there is no clear information on the styles lookup order, it has been revealed that there is a “suprisingly” deviated from the spec implementation of MS Powerpoint for the
a:lstStyle/(a:defPPr | a:lvlXpPr)/a:defRPr
case. Thea:lstStyle/a:defPPr
is not taken into consideration at all and thea:lstStyle/a:lvl1pPr
is used instead! LibreOffice tries to comply with the spec (a:lstStyle/a:defPPr
) but only when there is no thea:lstStyle/a:lvl1pPr
available… And I guess, we have no other options rather than follow the MS Powerpoint or LibreOffice here. Could you please let me know your preferences then? -
Ugh. I guess we need to align with the implementation rather than the standard, then. @Denis Konovalyenko - how much of that style hierarchy parsing do we currently do for PPTX?
-
@Chase Tingley , I think we are parsing every aforementioned style -
defaultTextStyle
,lstStyle
,bodyStyle
,otherStyle
,titleStyle
,notesStyle
. However, they are just read for the immediate convertion toMarkup
, which is used later for the font mapping (issue#958)… So, I guess we would need to be in more control on the order the styles are read and chosen for lookup at the time of the minification. And that is going to be more a sort of try and fail experiment. -
A deaper dive into the styles resolution order has given the following results.
Firstly, the presentation default text style (
p:defaultTextStyle
) is only considered by LibreOffice (MS Office emerges a repairing document dialog). It has to be used if a slide is not associated with a master slide or if no styling information has been otherwise specified for the text within the presentation slide. I was not managed to find out a way of creating such documnet structure with the help of the application (PowerPoint or Impress), thus, the default-text-style.pptx layouts and slide masters were manually removed.A related
default-text-style.pptx
document is attached.presentation.xml:
<p:defaultTextStyle> <a:defPPr> <a:defRPr lang="ru-RU" cap="all"/> </a:defPPr> <a:lvl1pPr marL="0" algn="l" defTabSz="914400" rtl="0" eaLnBrk="1" latinLnBrk="0" hangingPunct="1"> <a:defRPr sz="1800" kern="1200"> <a:solidFill> <a:schemeClr val="tx1"/> </a:solidFill> <a:latin typeface="+mn-lt"/> <a:ea typeface="+mn-ea"/> <a:cs typeface="+mn-cs"/> </a:defRPr> </a:lvl1pPr> </p:defaultTextStyle>
slide1.xml:
<a:p> <a:pPr algn="ctr"> <a:lnSpc> <a:spcPct val="90000"/> </a:lnSpc> </a:pPr> <a:r> <a:rPr b="0" lang="ru-RU" sz="6000" spc="-1" strike="noStrike"> <a:solidFill> <a:srgbClr val="000000"/> </a:solidFill> <a:latin typeface="Calibri Light"/> </a:rPr> <a:t>Title 1</a:t> </a:r> <a:endParaRPr b="0" lang="ru-RU" sz="6000" spc="-1" strike="noStrike"> <a:solidFill> <a:srgbClr val="000000"/> </a:solidFill> <a:latin typeface="Calibri"/> </a:endParaRPr> </a:p>
LibreOffice rendering:
Secondly, theme
a:objectDefaults/(a:spDef|a:lnDef|a:txDef)/a:lstStyle/a:lvlXpPr/a:defRPr
styles provide default information which can only be used to format new insertions of shapes or texts into a document. Thus, it does not affect the way the styles are resolved when a document is being read.Thirdly, the remaining
slide/slideLayout/slideMaster
parts of thetexts.pptx
documenthave been examined and the following styles resolution order has been revealed (the paragraph style level (
lvl
) is not specified):- Slide run properties (
a:rPr
) - Slide paragraph properties (
a:pPr/a:defRPr
) - Slide shape paragraph properties (
a:lstStyle/a:lvl1pPr/a:defRPr
). Implementation note: consider thea:lstStyle/a:defPPr/a:defRPr
when they are absent. - Slide layout shape paragraph properties in (see
for more information)#3 - Slide master shape paragraph properties (see
for more information)#3 - Slide master
p:txStyles/(p:titleStyle|p:bodyStyle|p:otherStyle)/a:lvl1pPr/a:defRPr
. Implementation note: consider thea:defPPr/a:defRPr
when there are noa:lvl1pPr/a:defRPr
. - Presentation defaults -
p:defaultTextStyle/a:lvlXpPr/a:defRPr
And the last but not least, the way the
a:lstStyle
formatting from a slide layout or a slide master is applied to a particular text in slide is going to be found out (may be complicated).As always, for more information please refer to the additional documents attached.
- Slide run properties (
-
- attached slide-master-title-style.png
- attached slide-master-other-style.pptx
- attached slide-master-other-style.png
- attached slide-master-lst-style-0-lvl.pptx
- attached slide-master-lst-style-0-lvl.png
- attached slide-master-lst-style.png
- attached slide-master-body-style.png
- attached slide-master-title-style.pptx
- attached slide-master-lst-style.pptx
- attached slide-master-body-style.pptx
- attached slide-layout-title-lst-style.png
- attached slide-layout-sub-title-lst-style.pptx
- attached slide-layout-title-lst-style.pptx
- attached slide-layout-sub-title-lst-style.png
- attached default-text-style.pptx
-
The
a:lstStyle
formatting in slide, slide layout and master slide shapes is connected viaidx
andtype
attribute values ofp:sp/p:nvSpPr/p:nvPr/p:ph
element. Matching placeholders in slide and slide layout, and in slide layout and slide master allow the correspondingp:sp/p:txBody/a:lstStyle
"merging".For references:
idx
- specifies the index of the placeholder
type
- specifies what content type the placeholder is to containAvailable
type
values:<simpleType name="ST_PlaceholderType"> <restriction base="xsd:token"> <enumeration value="title"/> <enumeration value="body"/> <enumeration value="ctrTitle"/> <enumeration value="subTitle"/> <enumeration value="dt"/> <enumeration value="sldNum"/> <enumeration value="ftr"/> <enumeration value="hdr"/> <enumeration value="obj"/> <enumeration value="chart"/> <enumeration value="tbl"/> <enumeration value="clipArt"/> <enumeration value="dgm"/> <enumeration value="media"/> <enumeration value="sldImg"/> <enumeration value="pic"/> </restriction> </simpleType>
-
A related pull request #464 was opened.
-
- changed status to resolved
The pull request #464 was merged.
- Log in to comment
We do condense properties during parsing, so the count mismatches aren’t a cause for concern on their own. However, it does sound like we may be incorrectly merging the properties in this case…. @Denis Konovalyenko