Close file when done

Issue #673 resolved
Matt Katz
created an issue

We love openpyxl where I work, particularly when using petl with it.

However, it seems there is no way to close an openpyxl workbook and destroy it.

This means that if later I want to move a file that I've read and dealt with, I can't - the file is in use. I have to kill an excel process to do it.

It would be better to close the file on __exit__

That way if I use it in a with block it auto shuts the file at the end of the block.

import openpyxl
from pathlib import Path
inputFile = Path(r'\\server\path\to\file.xlsx').resolve()

wb = openpyxl.load_workbook(str(inputFile), use_iterators=True)
ws = wb.get_sheet_by_name(wb.get_sheet_names()[0])
for row in ws.iter_rows():
    print('doing important work')
del(ws)
del(wb)
inputFile.rename( inputFile.parent / 'processed' / inputFile.name)

This throws a PermissionError: [WinError 32] The process cannot access the file because it is being used by another process:

I would expect this to work though, and it would be a really nice pattern for use:

import openpyxl
from pathlib import Path
inputFile = Path(r'\\server\path\to\file.xlsx').resolve()

with openpyxl.load_workbook(str(inputFile), use_iterators=True) as wb:
  with wb.get_sheet_by_name(wb.get_sheet_names()[0]) as ws:
    for row in ws.iter_rows():
        print('doing important work')
inputFile.rename( inputFile.parent / 'processed' / inputFile.name)

Comments (16)

  1. CharlieC

    @Eric Gazoni Do you want to have a look at this? I seem to remember something about the file handler getting registered. I've no real problem with a .close() method being made available (for all three modes) though we should try and work out what it should do.

    @Matt Katz not sure if we can just add an __exit__ method. Best thing is to write a PR (with tests) for us to look at.

  2. Eric Gazoni

    We have 3 situations where we lock files

    1. When writing a write-only workbook This is handled when using save() and we have an atexit hook to clean up all temp files

    2. When using load_workbook() in "memory" mode but at the end of the load, we close the archive, so it shouldn't keep any reference to an a filesystem resource

    3. When When using load_workbook() using the use_iterators mode , we keep a reference to the archive for lazy-loading, and this requires explicit closing, which is not implemented yet.

    Adding a context manager would make sense in the 3rd case, not much in the two others, as it's already "clean", but it would give a symmetric API to all cases, so I'm good with that. The __exit__ would just do nothing (or make extra checks) when the situation is already under control, and do something in optimized mode.

    +1 for a PR with tests :-)

  3. Emilio Silva

    I'm opening multiple files using load_workbook() in read only mode, and I get these warnings:

    /usr/lib/python3.5/xml/etree/ElementTree.py:1237: ResourceWarning: unclosed file <_io.BufferedReader name='test.xlsx'>
      self._parser.feed(data)
    

    It concerns me especially because I'm trying to reduce the resource consumption of my script. Is there anything I can do so far? Thank you.

  4. CharlieC

    @Emilio Silva that's just the way it works: we have to keep a file handle on the zip archive open in order to be able to read the contents on demand. The warning is new but shouldn't cause problems unless you have hundreds of files open at once (I think the limit on Windows is around 1000).

  5. Matt Katz reporter

    Thanks Charlie! Does this mean that I should attempt to use the code when merged like so?

    import openpyxl
    from pathlib import Path
    inputFile = Path(r'\\server\path\to\file.xlsx').resolve()
    
    with openpyxl.load_workbook(str(inputFile), use_iterators=True) as wb:
      with wb.get_sheet_by_name(wb.get_sheet_names()[0]) as ws:
        for row in ws.iter_rows():
            print('doing important work')
      wb.close()
    inputFile.rename( inputFile.parent / 'processed' / inputFile.name)
    
  6. CharlieC

    Something like that, yes: still no context manager and FFS update your use of the API! ;-) You should try it with a checkout.

    Just noticed that tests are failing on Windows but this is unrelated (zip archives need special handling on Windows and not just the file locks you've been seeing).

  7. Herr Bengtsson

    Hi @CharlieC

    Could you please expand on how to handle the .zip archives on Windows? Because I can't get the following example to work on Windows:

    import openpyxl
    from pathlib import Path
    inputFile = Path(r'\\server\path\to\file.xlsx').resolve()
    
    wb = openpyxl.load_workbook(str(inputFile), read_only=True)
    ws = wb.get_sheet_by_name(wb.get_sheet_names()[0])
    for row in ws.iter_rows():
        print('doing important work')
    wb.close()
    del(ws)
    del(wb)
    inputFile.rename( inputFile.parent / 'processed' / inputFile.name)
    

    With Python 3.6.0 and openpyxl 2.4.5 I still get the same issue, PermissionError: [WinError 32] The process cannot access the file because it is being used by another process:

    Thank you

  8. Benoit Pierre

    I don't understand why this was marked as resolved as the issue is still present, even when wb.close() is called:

    import os
    import sys
    
    import openpyxl
    
    wb = openpyxl.load_workbook(sys.argv[1], read_only=True)
    ws = wb.get_sheet_by_name(wb.get_sheet_names()[0])
    for row in ws.iter_rows():
        print('doing important work')
    wb.close()
    
    fd_dir = '/proc/%u/fd' % os.getpid()
    for fd_name in os.listdir(fd_dir):
      print(fd_name, '-> ', end='')
      try:
        print(os.readlink('%s/%s' % (fd_dir, fd_name)))
      except FileNotFoundError:
        print()
    

    The above example will show there's still an open file descriptor on the document.

    The problem is ZipFile tracks call to open with a reference counter, so a call to close must be done for each open call.

    I fixed the issue (in that case, there may be other calls to open needing a corresponding call to close) with this patch:

    diff --git a/openpyxl/worksheet/read_only.py b/openpyxl/worksheet/read_only.py
    index fd533569..9f12588a 100644
    --- a/openpyxl/worksheet/read_only.py
    +++ b/openpyxl/worksheet/read_only.py
    @@ -3,6 +3,8 @@ from __future__ import absolute_import
    
     """ Read worksheets on-demand
     """
    +from contextlib import contextmanager
    +from io import BytesIO
    
     # compatibility
     from openpyxl.compat import (
    @@ -26,6 +28,14 @@ from openpyxl.worksheet.dimensions import SheetDimension
     from openpyxl.cell.read_only import ReadOnlyCell, EMPTY_CELL
    
    
    +@contextmanager
    +def autoclose(thing):
    +    try:
    +        yield thing
    +    finally:
    +        if hasattr(thing, 'close'):
    +            thing.close()
    +
     def read_dimension(source):
         if hasattr(source, "encode"):
             return
    @@ -70,7 +80,8 @@ class ReadOnlyWorksheet(object):
             self.shared_strings = shared_strings
             self.base_date = parent_workbook.excel_base_date
             self.xml_source = xml_source
    -        dimensions = read_dimension(self.xml_source)
    +        with autoclose(self.xml_source) as xml:
    +            dimensions = read_dimension(xml)
             if dimensions is not None:
                 self.min_column, self.min_row, self.max_column, self.max_row = dimensions
    
    @@ -114,26 +125,27 @@ class ReadOnlyWorksheet(object):
                 empty_row = []
             row_counter = min_row
    
    -        p = iterparse(self.xml_source, tag=[ROW_TAG], remove_blank_text=True)
    -        for _event, element in p:
    -            if element.tag == ROW_TAG:
    -                row_id = int(element.get("r", row_counter))
    +        with autoclose(self.xml_source) as xml:
    +            p = iterparse(xml, tag=[ROW_TAG], remove_blank_text=True)
    +            for _event, element in p:
    +                if element.tag == ROW_TAG:
    +                    row_id = int(element.get("r", row_counter))
    
    -                # got all the rows we need
    -                if max_row is not None and row_id > max_row:
    -                    break
    +                    # got all the rows we need
    +                    if max_row is not None and row_id > max_row:
    +                        break
    
    -                # some rows are missing
    -                for row_counter in range(row_counter, row_id):
    -                    row_counter += 1
    -                    yield empty_row
    +                    # some rows are missing
    +                    for row_counter in range(row_counter, row_id):
    +                        row_counter += 1
    +                        yield empty_row
    
    -                # return cells from a row
    -                if min_row <= row_id:
    -                    yield tuple(self._get_row(element, min_col, max_col, row_counter=row_counter))
    -                    row_counter += 1
    +                    # return cells from a row
    +                    if min_row <= row_id:
    +                        yield tuple(self._get_row(element, min_col, max_col, row_counter=row_counter))
    +                        row_counter += 1
    
    -                element.clear()
    +                    element.clear()
    
    
         def _get_row(self, element, min_col=1, max_col=None, row_counter=None):
    
  9. Log in to comment