Inability to parse sheets with pivot tables using external data sources

Issue #857 resolved
Gregory Bronner
created an issue

I'm testing 2.5-a2, and trying to get the pivot tables to work. I ran

load_workbook

on a spreadsheet (sorry -- can't supply a copy!) that uses external data sources for pivot caches.

I modified the code to print out the source of the offending pivot cache, as well as the exception chain.

The xml definition of the pivot cache was [snipped data excluded]:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http:// chemas.openxmlformats.org/officeDocument/2006/relationships" saveData="0" refreshedBy="Ruffin, Dare M (F D)" refreshedDate="42852.493215046299" createdVersion="4" refreshedVersion="4" minRefreshableVersion="3" recordCount="0" supportSubquery="1" supportAdvancedDrill="1"><cacheSource type="external" connectionId=" "/><cacheFields count="46">



<map measureGroup="0" dimension="11"/><map measureGroup="0" dimension="12"/><map measureGroup="0" dimension="13"/></maps><extLst><ext uri="{725AE2AE-9491-48be-B2B4-4EB974FC3084}" xmlns:x14="http://schemas.microsoft.com/office/spreadsheetml/2009/9/main"><x14:pivotCacheDefinition supportSubqueryNonVisual="1" supportSubqueryCalcMem="1" supportAddCalcMems="1"/></ext></extLst></pivotCacheDefinition>

The stack trace was:

File "test_generate_spreadsheet2.py", line 20, in <module>
    openpyxl.load_workbook(os.getcwd() + "/../templates/packet_demo.xlsm")
  File ".../openpyxl-2.5.0a2/openpyxl/reader/excel.py", line 223, in load_workbook
    pivot_caches = parser.pivot_caches
  File ".../openpyxl-2.5.0a2/openpyxl/packaging/workbook.py", line 123, in pivot_caches
    cache = get_rel(self.archive, self.rels, id=c.id, cls=CacheDefinition)
  File ".../openpyxl-2.5.0a2/openpyxl/packaging/relationship.py", line 159, in get_rel
    obj = cls.from_tree(tree)
  File ".../openpyxl-2.5.0a2/openpyxl/descriptors/serialisable.py", line 83, in from_tree
    obj = desc.expected_type.from_tree(el)
  File ".../openpyxl-2.5.0a2/openpyxl/descriptors/serialisable.py", line 99, in from_tree
    return cls(**attrib)
  File ".../openpyxl-2.5.0a2/openpyxl/pivot/cache.py", line 919, in __init__
    self.worksheetSource = worksheetSource
  File ".../openpyxl-2.5.0a2/openpyxl/descriptors/base.py", line 44, in __set__
    raise TypeError('expected ' + str(self.expected_type) + " got " + str(type(value)))
TypeError: expected <class 'openpyxl.pivot.cache.WorksheetSource'> got <type 'NoneType'>

The issue seems to be that the expectation is that the data from the pivot cache comes from an Worksheet. Sometimes it doesn't, but the parser is both unable to parse the actual source of the data, and to ignore it.

This represents a regression from the current release, as previous versions (2.4.5) simply ignored these pivot tables, and parsed the file successfully.

Comments (13)

  1. CharlieC

    Thanks for the report. It's not a regression because the stuff was ignored before, whereas now it is being parsed. I think I may have an example with an external data source so I can probably workaround the exception but won't help much as external data sources are not covered by the pivot table specification.

    Extension lists are an attempt to provide forward compatibility in OOXML when MS comes up with new features. They're usually silently ignored in openpyxl so I'll check the pivot code to see where this isn't happening at the moment.

  2. CharlieC

    I can avoid the problem with not having a worksheet source. Not sure what Excel will make of the resulting file. But with the source you provided (slightly modified to be valid XML) I don't get an exception relating to the extension lists. Be great if you can provide some sample source for this to help me track it down (on a separate bug report).

  3. Gregory Bronner reporter

    I think that there's still a bit more to get this working properly.

    I found a number of elements that could be None/missing besides the one you fixed:

    class CacheHierarchy(Serialisable):
    
        uniqueName = String()
        caption = String(allow_none=True)
        measure = Bool()
        set = Bool()
        parentSet = Integer(allow_none=True)
        iconSet = Integer(allow_none=True)
        attribute = Bool()
        time = Bool()
        keyAttribute = Bool()
        defaultMemberUniqueName = String(allow_none=True)
        allUniqueName = String(allow_none=True)
        allCaption = String(allow_none=True)
        dimensionUniqueName = String(allow_none=True)
        displayFolder = String(allow_none=True)
        measureGroup = String(allow_none=True)
        measures = Bool()
        count = Integer()
        oneField = Bool()
        memberValueDatatype = Integer(allow_none=True)
        unbalanced = Bool(allow_none=True)
        unbalancedGroup = Bool(allow_none=True)
        hidden = Bool()
        fieldsUsage = Typed(expected_type=FieldsUsage, allow_none=True)
        groupLevels = Typed(expected_type=GroupLevels, allow_none=True)
        extLst = Typed(expected_type=ExtensionList, allow_none=True)
    

    It appears since the caches don't have records, trying to get relationships for them fails (though the failure is an unhandled exception in get_rel: I think that this modification to packaging/workbook.py helps, though I ran into the other issue with the extLst when parsing my spreadsheet.

        @property
        def pivot_caches(self):
            """
            Get PivotCache objects
            """
            d = {}
            for c in self.caches:
                cache = get_rel(self.archive, self.rels, id=c.id, cls=CacheDefinition)
                if cache.deps==[]:
                    records=[]
                else:
                    records = get_rel(self.archive, cache.deps, cache.id, RecordList)
                cache.records = records
                d[c.cacheId]  = cache
            return d
    
  4. Gregory Bronner reporter

    I figured it out -- with luck my firm will let me submit the code.

    Basically, for external pivot caches, there is no _rels relationship. that links the pivotCacheDefinition to the pivotCacheRecord. However, they have extremely similar names, so one can simply load the records by directly loading the pivotCacheRecord file corresponding to the pivotCacheDefinition target.

  5. CharlieC

    Sounds interesting but it would be important to find the explicit links. Although I'm not sure quite how much of the pivotCacheRecord we need to keep around.

  6. Gregory Bronner reporter

    I exploded the entire file and didn't find links. In that file, there are 35 pivot tables. 31 are internally sourced, and have entries in _rels, while the other 4 are externally sourced and don't have links.

    Excel appears to keep a cache of records in the pivotCacheRecords entries -- presumably this is to avoid having to load external data at startup, and there appears to be a 1-1 mapping between pivotCacheDefinition and pivotCacheRecord items.

    Also, there's a bug when writing these out -- it hits the NotImplementedError in the @Property tagname in serializable.py

  7. CharlieC

    Thanks for the investigation. It's possible that the external sources are handled by the external links part of the workbook, which also use caching for the same reason. However, as caching can consume a lot of memory it would be useful to be able to disable it. I think that in most situations, repopulating the cache is going to be required anyway. The specification might contain some pointers.

    The exception basically means that the tagname for a class needs adding and that I haven't come across it in any of the files thus far. Tests obviously required.

  8. Gregory Bronner reporter

    Thanks! I set a breakpoint and figured out the mapping

    The tagname for CacheHierarchy is cacheHierarchy

    Same thing with FieldUsage and FieldsUsage and MeasureGroup and CalculatedItem

    PivotDimension has a tagname of dimension

    MeasureDimensionMap has a tagname of map

    In table.py, PivotHierarchy has a tagname of pivotHierarchy

    In RowHierarchiesUsage, the count function references colHierarchiesUsage

    After opening and saving, I got a corrupt file.

    Some of the differences were that drawings in excel use tags of the form <c:v>, while openpyxl uses <v>, and that openpyxl doesn't seem to be writing external pivot table cache definitions properly (they seem to come out as internal), and most of the pivot cache rels files reference pivotCacheRecords1.xml, which suggests the indexing is wrong

  9. CharlieC

    I should probably patch the code generator to include the tagnames.

    <c:v> versues <v> is just an XML namespace issue, I use defaults where I can. Drawings are PITA because the namespace changes from charts to the more abstract DrawingML with significantly different semantics. DrawingML is vast and will probably never be complete.

  10. Log in to comment