I have an existing workbook that has a number of pivot tables based on data being imported from a SQLite database. Each table goes into a sheet in the workbook and then the pivot tables are refreshed based on the new data. Thus, I'm not able to use (as far as I understand it) write_only=True since the workbook is not being created from scratch; sheets are being updated in the workbook. I've tried a couple of approaches:
- Just write the rows and save at the end - above about 100,000 rows this causes a memory error and failure. Note this is on a 16 GB Win10 machine; see attached error log.
- Do incremental saves every 100,0000 rows. Basically do a workbook.save(xxx) and then do a load_workbook to get it back and add the next 100,000 rows. This works (no errors/exceptions) but generates a corrupted XLSX file.
The next thing I'm considering doing is dividing this up into incremental runs of the Python script, i.e. have each invocation of the script add 100,000 rows and look in the SQLite table to determine how many runs are needed. Pass the starting row number as a parameter so if the starting row number > 0 it will append rows to the sheet, then exit. The advantage here is each run is separate, but it seems like a pretty kludgy workaround.
Attached are four files:
- A simple demo_openpyxl_bug.py script that is based on my actual script but eliminates the dependencies on SQL Alchemy/SQLite and just adds a static row over and over, 200,000 times.
- The output of running this
- The starting XLSX file to pass in to repro the bug (SurveyorTest-Original.xlsx); copy this to SurveyorTest.xlsx and run the command shown
- The resulting (corrupted) XLSX created by running the script (SurveyorTest.xlsx)