Get value from filter-modified header cell when iterating read-only

Issue #388 duplicate
Chris Lott
created an issue

Using openpyxl 2.1.2 on win7/64bit, python 2.7.8

I am working with a large XLSX workbook with a sheet on which the first row (column names) is modified to be a "Filter" for every column. Forgive if I'm stating the obvious, but this means there's a drop-down button to the right of the header-cell text and when you click it, Excel offers a dialog showing the values in the column with checkboxes so you can filter values in or out. Enable this feature on a column by selecting the column, and on the Data tab, click on the Filter button.

If I load the file using the read-write version of the API (use_iterators = False), iterate over the rows using the workbook object's rows attribute, and iterate over cells in the first row, each header-cell text value is made available to me in the "value" attribute of the cell object. That works fine.

If I load the file using the read-only version of the API (use_iterators = True) to get an IterableWorksheet, iterate over the rows using the iter_rows() method, and then iterate over the cells, for the header row the value attribute is always None.

The spreadsheet is so large that I really need the speedy, small-footprint reader feature -- reading it takes simply enormous amount of memory using the read/write API.

I attached a trivial file just to show the filter feature on the first column. I'm really sorry to be that difficult user, but this sheet does NOT reproduce the problem :( Of course the sheet with the problem is large, proprietary and generally too secret even to discuss :) However! if I turn off the "Filter" feature on the first column of that large sheet (leaving the other heads as Filters), the column names ALL become available in the cell.value attribute. Go figure!

Please investigate. Thanks for listening.

Comments (8)

  1. CharlieC

    Please provide some sample code (for both modes showing the difference and the precise problem is) as this will help narrow down the problem. read_only is optimised partly by leaving stuff out. I don't know off-hand if filters are supported at all or whether it's another problem, such as the filter criteria being defined elsewhere.

  2. Chris Lott reporter

    Glad to post this simple iterate-and-print code. The problem is simply that cell.value yields None (empty string - '') on the first row of some workbooks.

    Very sorry that I can't post the troublesome spreadsheet :(

  3. CharlieC

    Can you look at the source of the spreadsheet? It would be interesting to see the information for a single cell. For example, in the file you submitted the value for A1 which has a filter is

        <x:c r="A1" t="s">
          <x:v>0</x:v>
        </x:c>
    

    Does the value for your worksheet look similar to this? Has the file been created in Excel or does it come from a library?

  4. Chris Lott reporter

    I think you're asking the right question about the file creator.

    Here's the content of the troublesome cell in the original sheet, original file (there is no "v"):

    <c r="A1" t="inlineStr"><is><t>Heading</t></is></c>

    After the workbook is saved by Excel (after making a change on a different sheet), here's the content of the cell:

    <c r="A1" t="s"><v>202</v></c>

    And of course after the round-trip thru excel, the iterate-and-print solution works perfectly.

    So what's an inline string? I'm very willing to believe the sheet was created by something that doesn't use Excel's string table feature.

  5. CharlieC

    Yeah, there's an, ahem, optimisation available for writers that improve streaming performance (dumping data) by writing so-called "inline strings". It's a different optimisation to the standard attempt to nearly remove (but not quite) remove duplicate strings complicated by the fact that it also allows formatting to be embedded (yay what fun!) in among the text. It's also used in a non-optimised way to apply different formats within a single cell. We currently don't support it in read-only mode because its add complexity to something which minimises memory use by JIT and currently has a hard dependency upon the "shared strings" when accessing values. If this sounds a little crazy then that's because it is! Welcome to the wonderful world of Office OpenXML. See also #386

    It's probably pretty easy to copy the relevant code between the implementations but the behaviour of ReadOnlyCells will have to change as well.

    Hoping someone will submit a PR on this because at the moment I'd rather shove matches under my toenails! ;-)

  6. Log in to comment