openpyxl leaks file handle in py py implementation

Issue #832 resolved
C. W. created an issue

Here is the code to reproduce:

import openpyxl
import psutil

proc = psutil.Process()
open_files_l1 = proc.open_files()
print("number of open file handles: %s" % len(open_files_l1))
b = openpyxl.load_workbook(
    filename="test.xlsx", data_only=True, read_only=True)
open_files_l2 = proc.open_files()
print("number of open file handles: %s" % len(open_files_l2))
assert len(open_files_l1) == len(open_files_l2)

And here is the result

$ python
number of open file handles: 0
number of open file handles: 2
[popenfile(path='/Users//github/pyexcel-xlsx/test.xlsx', fd=5), popenfile(path='/Users//github/pyexcel-xlsx/test.xlsx', fd=6)]
Traceback (most recent call last):
  File "", line 14, in <module>
    assert len(open_files_l1) == len(open_files_l2)
$ python
Python 2.7.13 (1aa2d8e03cdf, Mar 31 2017, 10:20:53)
[PyPy 5.7.1 with GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Comments (17)

  1. C. W. reporter

    Here is a related issue:

    And the following build found the bug described in this issue:

    where two failing test cases try to verify these two points:

    1. without calling workbook.close() function, one file handle is expected to be open
    2. after calling workbook.close() function, no file handle is open.

    Interestingly, openpyxl had a different behaviour in python 3.6 as well, where the file handle is not leaked as it is expected in item 1.

  2. CharlieC

    Thanks for the report but if this seems related to PyPy then I'm not sure if there's much we can do.

  3. CharlieC

    Revisiting this I can confirm the problem with Python 3.6 but I'm stumped as to what's actually going on.

    The code for closing an archive is test here:

    If I extend this test using psutil then it confirms that the process no longer has any open files.

    But if I add your virtually identical test to tests/ then it reliably reports the file as still being open.

    FWIW this is the test:

    def test_file_handles(datadir):
        import psutil
        proc = psutil.Process()
        open_files = proc.open_files()
        wb = load_workbook("sample.xlsx", read_only=True)
        ws =
        for row in ws.iter_rows():
        assert proc.open_files() == open_files
  4. Stefan

    I ran into the same problem and noticed that

    import gc; gc.collect()

    gets rid of the file descriptor. This also explains why the issue first arose in pypy, because pypy does garbage collection differently. Pypy actually documents a similar case here. It would also explain why it is so difficult to test, because you cannot know when garbage collection runs.

    From the above symptoms, it looks like some function lets a variable go out of scope, implicitly expecting that garbage collection will clean up before anyone can notice.

  5. Stefan

    The culprit is the ReadOnlyWorksheet class, see here.

    In the __init__ function, it uses self.xml_source, which is a property that opens a ZipExtFile unless it was already set. I fixed it locally like this:

    --- a/openpyxl/worksheet/   Sat Aug 04 17:48:15 2018 +0200
    +++ b/openpyxl/worksheet/   Wed Aug 08 16:58:29 2018 +0200
    @@ -70,7 +70,13 @@
             self.shared_strings = shared_strings
             self.base_date = parent_workbook.epoch
             self.xml_source = xml_source
    -        dimensions = read_dimension(self.xml_source)
    +        try:
    +            source = self.xml_source
    +            dimensions = read_dimension(source)
    +        finally:
    +            from zipfile import ZipExtFile
    +            if isinstance(source, ZipExtFile):
    +                source.close()
             if dimensions is not None:
                 self.min_column, self.min_row, self.max_column, self.max_row = dimensions

    Certainly not the most elegant way of fixing it, but at least it does not break the other unit tests.

    A real fix should probably take care of the issue that the xml_source property leaks a file descriptor every single time it is used.

  6. CharlieC

    @snord thanks very much for investigating this and proposing a relatively simple solution. Would it also work simply passing xml_source to the dimensions reader?

  7. CharlieC

    Right, I forgot about the hack we use to allow testing with bits of XML. I can probably fix this so that we always access an archive and this should allow us to clean things up. Any chance of a unit test to demonstrate the problem or is the above test the simplest assuming we're running on PyPy (which will be dog slow when it comes to writing)

  8. CharlieC

    I've pushed the suggested change. Would be great if someone could check that it works as it should on their environment.

  9. C. W. reporter

    First of all, @Charlie, nice to have met you in the FAQ session in EuroPython2018.

    I have been reading the progress you two have been making. I will verify your changes and post my results here.

  10. CharlieC

    @simoscofield I'd wished you mentioned this at Europython! Would have been good to look at this together any maybe come up with a solution.

  11. Stefan

    With this test I can reproduce the issue on Python 3.6 and test the fix. Curiously, the gc.collect() is necessary, without it I ended up having less open files after the test than before...

    --- a/openpyxl/tests/       Sat Aug 04 17:48:15 2018 +0200
    +++ b/openpyxl/tests/       Fri Aug 10 14:02:37 2018 +0200
    @@ -85,6 +85,39 @@
         sheet2 = wb['Sheet2 - Numbers']
         assert sheet2.calculate_dimension() == 'D1:AA30'
    +import gc
    +import os
    +def count_open_fds():
    +    count = 0
    +    for i in range(10000):
    +        try:
    +            os.fstat(i)
    +        except Exception:
    +            pass
    +        else:
    +            count += 1
    +    return count
    +def test_file_descriptor_leak(datadir):
    +    datadir.join("genuine").chdir()
    +    try:
    +        gc.disable()
    +        gc.collect()
    +        num_fds_before = count_open_fds()
    +        wb = load_workbook(filename="sample.xlsx", read_only=True)
    +        wb.close()
    +        num_fds_after = count_open_fds()
    +    finally:
    +        gc.enable()
    +    assert num_fds_after == num_fds_before
     def test_nonstandard_name(datadir):

    It makes a couple of assumptions (no multi-threading, all FDs are below 10000). On the other hand, it adds no new dependencies and should(!) also work on MacOS and Windows. Alternatively, we could use psutil to implement count_open_fds as shown by C. W.

  12. Stefan

    I installed the fixed version with pip install hg+ and this got rid of the ResourceWarnings that Python was throwing at me.

  13. Log in to comment