33% time spent in IndexedList.add()

Issue #1208 resolved
Ian Ward created an issue

Hello and thank you for all your work on this library! I've recently upgraded to 2.5.14 and I'm running some code that generates a 10k-line excel template for data entry and I've found that 33% of my run time is spent in IndexedList.add. I wonder if there is a problem with IndexedList or the way I'm using openpyxl.

Profiling I see most of my calls that trigger IndexedList.add come from an inner loop that styles each cell used to enter data:

https://github.com/open-data/ckanext-recombinant/blob/faster/ckanext/recombinant/write_excel.py#L359-L365

        xl_format = datastore_type[field['datastore_type']].xl_format
        alignment = openpyxl.styles.Alignment(wrap_text=True)
        protection = openpyxl.styles.Protection(locked=False)
        for (c,) in sheet[validation_range]:
            c.number_format = xl_format
            c.alignment = alignment
            c.protection = protection

where validation_range is a whole column of 10000 cells for entering data.

Is there a better way to accomplish the same thing, or is there something we can do to reduce the amount of work required by the IndexedList.add operation?

Comments (4)

  1. CharlieC

    Styles are implemented using pointers (to reduce memory use, presumably) so every time you assign a format there is a check to see whether it is already in the workbook. As you're noting this can get quite slow as you're seeing.

    The way around this is to use NamedStyles which use a much simpler assignment.

  2. Ian Ward reporter

    I switched to NamedSyle and was surprised when it was even slower. This change made my time jump from 21s in add to 114s in the __set__ method (add no longer in the profile)

    https://github.com/open-data/ckanext-recombinant/commit/706b48ea8f0ba72e915610f21fe2dbd21914b261

            xl_format = datastore_type[field['datastore_type']].xl_format
            alignment = openpyxl.styles.Alignment(wrap_text=True)
            col_style = NamedStyle(
                name='reco_{0}{1}'.format(resource_num, col_letter),
                number_format=xl_format,
                alignment=alignment,
                protection=openpyxl.styles.Protection(locked=False))
            book.add_named_style(col_style)
            for (c,) in sheet[validation_range]:
                c.style = col_style
    

    But once I changed c.style = col_style to c.style = col_style.name the time dropped down to just 5s in __set__. This is down to 10% of my overall run time so that's a pretty big improvement.

    Thank you for your help.

  3. Log in to comment