openpyxl leaks file handle in py py implementation

Issue #832 resolved
C. W.
created an issue

Here is the code to reproduce:

import openpyxl
import psutil


proc = psutil.Process()
open_files_l1 = proc.open_files()
print("number of open file handles: %s" % len(open_files_l1))
b = openpyxl.load_workbook(
    filename="test.xlsx", data_only=True, read_only=True)
b.close()
open_files_l2 = proc.open_files()
print("number of open file handles: %s" % len(open_files_l2))
print(open_files_l2)
assert len(open_files_l1) == len(open_files_l2)

And here is the result

$ python test.py
number of open file handles: 0
number of open file handles: 2
[popenfile(path='/Users//github/pyexcel-xlsx/test.xlsx', fd=5), popenfile(path='/Users//github/pyexcel-xlsx/test.xlsx', fd=6)]
Traceback (most recent call last):
  File "test.py", line 14, in <module>
    assert len(open_files_l1) == len(open_files_l2)
AssertionError
$ python
Python 2.7.13 (1aa2d8e03cdf, Mar 31 2017, 10:20:53)
[PyPy 5.7.1 with GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Comments (16)

  1. C. W. reporter

    Here is a related issue: https://bitbucket.org/openpyxl/openpyxl/issues/673/workbook-context-manager#comment-36565768 https://github.com/pyexcel/pyexcel-xlsx/issues/14

    And the following build found the bug described in this issue:

    https://travis-ci.org/pyexcel/pyexcel-xlsx/jobs/238518570

    where two failing test cases try to verify these two points:

    1. without calling workbook.close() function, one file handle is expected to be open
    2. after calling workbook.close() function, no file handle is open.

    Interestingly, openpyxl had a different behaviour in python 3.6 as well, where the file handle is not leaked as it is expected in item 1.

  2. CharlieC

    Revisiting this I can confirm the problem with Python 3.6 but I'm stumped as to what's actually going on.

    The code for closing an archive is test here: https://bitbucket.org/openpyxl/openpyxl/src/0f8537998f95c9d8d44913c2edb367516b799d4a/openpyxl/reader/tests/test_excel.py?at=default&fileviewer=file-view-default#test_excel.py-120

    If I extend this test using psutil then it confirms that the process no longer has any open files.

    But if I add your virtually identical test to tests/test_iter.py then it reliably reports the file as still being open.

    FWIW this is the test:

    def test_file_handles(datadir):
        datadir.join("genuine").chdir()
        import psutil
    
        proc = psutil.Process()
        open_files = proc.open_files()
    
        wb = load_workbook("sample.xlsx", read_only=True)
        ws = wb.active
    
        for row in ws.iter_rows():
            pass
    
        wb.close()
    
        assert proc.open_files() == open_files
    
  3. Stefan

    I ran into the same problem and noticed that

    import gc; gc.collect()
    

    gets rid of the file descriptor. This also explains why the issue first arose in pypy, because pypy does garbage collection differently. Pypy actually documents a similar case here. It would also explain why it is so difficult to test, because you cannot know when garbage collection runs.

    From the above symptoms, it looks like some function lets a variable go out of scope, implicitly expecting that garbage collection will clean up before anyone can notice.

  4. Stefan

    The culprit is the ReadOnlyWorksheet class, see here.

    In the __init__ function, it uses self.xml_source, which is a property that opens a ZipExtFile unless it was already set. I fixed it locally like this:

    --- a/openpyxl/worksheet/read_only.py   Sat Aug 04 17:48:15 2018 +0200
    +++ b/openpyxl/worksheet/read_only.py   Wed Aug 08 16:58:29 2018 +0200
    @@ -70,7 +70,13 @@
             self.shared_strings = shared_strings
             self.base_date = parent_workbook.epoch
             self.xml_source = xml_source
    -        dimensions = read_dimension(self.xml_source)
    +        try:
    +            source = self.xml_source
    +            dimensions = read_dimension(source)
    +        finally:
    +            from zipfile import ZipExtFile
    +            if isinstance(source, ZipExtFile):
    +                source.close()
             if dimensions is not None:
                 self.min_column, self.min_row, self.max_column, self.max_row = dimensions
    

    Certainly not the most elegant way of fixing it, but at least it does not break the other unit tests.

    A real fix should probably take care of the issue that the xml_source property leaks a file descriptor every single time it is used.

  5. CharlieC

    Right, I forgot about the hack we use to allow testing with bits of XML. I can probably fix this so that we always access an archive and this should allow us to clean things up. Any chance of a unit test to demonstrate the problem or is the above test the simplest assuming we're running on PyPy (which will be dog slow when it comes to writing)

  6. C. W. reporter

    First of all, @Charlie Gunyon, nice to have met you in the FAQ session in EuroPython2018.

    I have been reading the progress you two have been making. I will verify your changes and post my results here.

  7. Stefan

    With this test I can reproduce the issue on Python 3.6 and test the fix. Curiously, the gc.collect() is necessary, without it I ended up having less open files after the test than before...

    --- a/openpyxl/tests/test_iter.py       Sat Aug 04 17:48:15 2018 +0200
    +++ b/openpyxl/tests/test_iter.py       Fri Aug 10 14:02:37 2018 +0200
    @@ -85,6 +85,39 @@
         sheet2 = wb['Sheet2 - Numbers']
         assert sheet2.calculate_dimension() == 'D1:AA30'
    
    +
    +import gc
    +import os
    +def count_open_fds():
    +    count = 0
    +    for i in range(10000):
    +        try:
    +            os.fstat(i)
    +        except Exception:
    +            pass
    +        else:
    +            count += 1
    +    return count
    +
    +
    +def test_file_descriptor_leak(datadir):
    +    datadir.join("genuine").chdir()
    +
    +    try:
    +        gc.disable()
    +        gc.collect()
    +        num_fds_before = count_open_fds()
    +
    +        wb = load_workbook(filename="sample.xlsx", read_only=True)
    +        wb.close()
    +
    +        num_fds_after = count_open_fds()
    +    finally:
    +        gc.enable()
    +
    +    assert num_fds_after == num_fds_before
    +
     def test_nonstandard_name(datadir):
         datadir.join('reader').chdir()
    

    It makes a couple of assumptions (no multi-threading, all FDs are below 10000). On the other hand, it adds no new dependencies and should(!) also work on MacOS and Windows. Alternatively, we could use psutil to implement count_open_fds as shown by C. W.

  8. Stefan

    I installed the fixed version with pip install hg+https://bitbucket.org/openpyxl/openpyxl@75eb6dbc8d11172bff45dd96438f20e635e7bd05 and this got rid of the ResourceWarnings that Python was throwing at me.

  9. Log in to comment