Open XML (PPTX) does not always respect font properties (all caps)

Issue #999 resolved
Dale Eggett created an issue

Occasionally, the “All Caps” font property has been propagated to other areas that do not originally have that property marked. Yves believes that this might be happening because of tag simplification in the filter.

In the attached files, the source has 17 instances of the “cap” attribute (the attribute that marks whether the text within its scope should be in “All Caps”) in slide1.xml. The (machine translated) Spanish has 8. In addition to having fewer “cap” attribute mentions, they are sometimes wrong - see the lists in the slide, where they are not in all caps in the source, but they are in the target.

Comments (10)

  1. Chase Tingley

    We do condense properties during parsing, so the count mismatches aren’t a cause for concern on their own. However, it does sound like we may be incorrectly merging the properties in this case…. @Denis Konovalyenko

  2. Denis Konovalyenko

    @Dale Eggett , @Chase Tingley , it looks like we do try to simplify/minify styles on parsing in net.sf.okapi.filters.openxml.RunParser#parseRunProperties method:

            runBuilder.setRunProperties(
                direct.minified(
                    this.styleDefinitions.combinedRunProperties(
                        this.paragraphStyle,
                        runStyle,
                        new RunProperties.Empty()
                    )
                )
            );
    

    While it works great for WordprocessingML, it may fail for DrawingML due to incomplete styles hierarchy processing. Namelly, the combined run properties are appeared empty in most cases (except notes), which consequenlty allows the minification to remove the significant properties (cap=”none” in this issue case).

    Furthermore, in order to properlly gather the styles, the following parts have to be considered:

    • slideX.xml (the xth slide)
    • slideLayoutX.xml (the layout corresponding to the slide in question)
    • slideMasterX.xml (the master corresponding to the layout in question)
    • themeX.xml (and any theme overrides if the slide has one)
    • presentation.xml (this is the default)

    In particular, the styling information for text can be in one or more of the following:

    • a:r/a:rPr (currenlty, these are programmaticaly moved to the next level if possible)
    • a:pPr/a:defRPr in slideX.xml and slideLayoutX.xml
    • a:lstStyle/(a:defPPr | a:lvlXpPr)/a:defRPr in slideX.xml and slideLayoutX.xml
    • p:txStyles/(p:titleStyle | p:bodyStyle | p:otherStyle)/a:lvlXpPr/a:defRPr in slideMastertX.xml
    • a:objectDefaults/(a:spDef | a:lnDef | a:txDef)/a:lstStyle/(a:defPPr | a:lvlXpPr)/a:defRPr in themeX.xml
    • p:defaultTextStyle/(a:defPPr | a:lvlXpPr)/a:defRPr in presentation.xml

    Besides the fact that there is no clear information on the styles lookup order, it has been revealed that there is a “suprisingly” deviated from the spec implementation of MS Powerpoint for the a:lstStyle/(a:defPPr | a:lvlXpPr)/a:defRPr case. The a:lstStyle/a:defPPr is not taken into consideration at all and the a:lstStyle/a:lvl1pPr is used instead! LibreOffice tries to comply with the spec (a:lstStyle/a:defPPr) but only when there is no the a:lstStyle/a:lvl1pPr available… And I guess, we have no other options rather than follow the MS Powerpoint or LibreOffice here. Could you please let me know your preferences then?

  3. Chase Tingley

    Ugh. I guess we need to align with the implementation rather than the standard, then. @Denis Konovalyenko - how much of that style hierarchy parsing do we currently do for PPTX?

  4. Denis Konovalyenko

    @Chase Tingley , I think we are parsing every aforementioned style - defaultTextStyle, lstStyle, bodyStyle, otherStyle, titleStyle, notesStyle. However, they are just read for the immediate convertion to Markup, which is used later for the font mapping (issue #958)… So, I guess we would need to be in more control on the order the styles are read and chosen for lookup at the time of the minification. And that is going to be more a sort of try and fail experiment.

  5. Denis Konovalyenko

    A deaper dive into the styles resolution order has given the following results.

    Firstly, the presentation default text style (p:defaultTextStyle) is only considered by LibreOffice (MS Office emerges a repairing document dialog). It has to be used if a slide is not associated with a master slide or if no styling information has been otherwise specified for the text within the presentation slide. I was not managed to find out a way of creating such documnet structure with the help of the application (PowerPoint or Impress), thus, the default-text-style.pptx layouts and slide masters were manually removed.

    A related default-text-style.pptx document is attached.

    presentation.xml:

        <p:defaultTextStyle>
            <a:defPPr>
                <a:defRPr lang="ru-RU" cap="all"/>
            </a:defPPr>
            <a:lvl1pPr marL="0" algn="l" defTabSz="914400" rtl="0" eaLnBrk="1" latinLnBrk="0"
                       hangingPunct="1">
                <a:defRPr sz="1800" kern="1200">
                    <a:solidFill>
                        <a:schemeClr val="tx1"/>
                    </a:solidFill>
                    <a:latin typeface="+mn-lt"/>
                    <a:ea typeface="+mn-ea"/>
                    <a:cs typeface="+mn-cs"/>
                </a:defRPr>
            </a:lvl1pPr>
        </p:defaultTextStyle>
    

    slide1.xml:

                        <a:p>
                            <a:pPr algn="ctr">
                                <a:lnSpc>
                                    <a:spcPct val="90000"/>
                                </a:lnSpc>
                            </a:pPr>
                            <a:r>
                                <a:rPr b="0" lang="ru-RU" sz="6000" spc="-1" strike="noStrike">
                                    <a:solidFill>
                                        <a:srgbClr val="000000"/>
                                    </a:solidFill>
                                    <a:latin typeface="Calibri Light"/>
                                </a:rPr>
                                <a:t>Title 1</a:t>
                            </a:r>
                            <a:endParaRPr b="0" lang="ru-RU" sz="6000" spc="-1" strike="noStrike">
                                <a:solidFill>
                                    <a:srgbClr val="000000"/>
                                </a:solidFill>
                                <a:latin typeface="Calibri"/>
                            </a:endParaRPr>
                        </a:p>
    

    LibreOffice rendering:

    Secondly, theme a:objectDefaults/(a:spDef|a:lnDef|a:txDef)/a:lstStyle/a:lvlXpPr/a:defRPr styles provide default information which can only be used to format new insertions of shapes or texts into a document. Thus, it does not affect the way the styles are resolved when a document is being read.

    Thirdly, the remaining slide/slideLayout/slideMaster parts of the texts.pptx document

    have been examined and the following styles resolution order has been revealed (the paragraph style level (lvl) is not specified):

    1. Slide run properties (a:rPr)
    2. Slide paragraph properties (a:pPr/a:defRPr)
    3. Slide shape paragraph properties (a:lstStyle/a:lvl1pPr/a:defRPr). Implementation note: consider the a:lstStyle/a:defPPr/a:defRPr when they are absent.
    4. Slide layout shape paragraph properties in (see #3 for more information)
    5. Slide master shape paragraph properties (see #3 for more information)
    6. Slide master p:txStyles/(p:titleStyle|p:bodyStyle|p:otherStyle)/a:lvl1pPr/a:defRPr. Implementation note: consider the a:defPPr/a:defRPr when there are no a:lvl1pPr/a:defRPr.
    7. Presentation defaults - p:defaultTextStyle/a:lvlXpPr/a:defRPr

    And the last but not least, the way the a:lstStyle formatting from a slide layout or a slide master is applied to a particular text in slide is going to be found out (may be complicated).

    As always, for more information please refer to the additional documents attached.

  6. Denis Konovalyenko
  7. Denis Konovalyenko

    The a:lstStyle formatting in slide, slide layout and master slide shapes is connected via idx and type attribute values of p:sp/p:nvSpPr/p:nvPr/p:ph element. Matching placeholders in slide and slide layout, and in slide layout and slide master allow the corresponding p:sp/p:txBody/a:lstStyle "merging".

    For references:

    idx - specifies the index of the placeholder
    type - specifies what content type the placeholder is to contain

    Available type values:

    <simpleType name="ST_PlaceholderType">
        <restriction base="xsd:token">
        <enumeration value="title"/>
        <enumeration value="body"/>
        <enumeration value="ctrTitle"/>
        <enumeration value="subTitle"/>
        <enumeration value="dt"/>
        <enumeration value="sldNum"/>
        <enumeration value="ftr"/>
        <enumeration value="hdr"/>
        <enumeration value="obj"/>
        <enumeration value="chart"/>
        <enumeration value="tbl"/>
        <enumeration value="clipArt"/>
        <enumeration value="dgm"/>
        <enumeration value="media"/>
        <enumeration value="sldImg"/>
        <enumeration value="pic"/>
        </restriction>
    </simpleType>
    

  8. Log in to comment