Issue #154 resolved

build_zipmanifest results should be memoized

wickman
created an issue

as far as I can tell, build_zipmanifest is not cached.

from a recent profile:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.942    4.942 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/pex.py:124(_execute_internal)
        1    0.000    0.000    3.830    3.830 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:114(activate)
        1    0.001    0.001    3.823    3.823 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:121(_activate)
        1    0.000    0.000    3.737    3.737 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:105(update_candidate_distributions)
       56    0.003    0.000    3.731    0.067 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:84(load_internal_cache)
        1    0.032    0.032    3.728    3.728 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/environment.py:63(write_zipped_internal_cache)
       55    0.002    0.000    2.947    0.054 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/_twitter_common_python/util.py:46(distribution_from_path)
       57    0.005    0.000    2.945    0.052 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/pkg_resources.py:1703(__init__)
       57    0.183    0.003    2.938    0.052 /Users/wickman/clients/science/dist/aurora_client.pex/.bootstrap/pkg_resources.py:1452(build_zipmanifest)

This is the profile for starting up a PEX file (zipped python environment, see https://mail.python.org/pipermail/distutils-sig/2014-January/023727.html ) with a number of exploded eggs inside. build_zipmanifest is called with the same archive every time we construct a Distribution via EggMetadata:

    def __init__(self, module):
        EggProvider.__init__(self,module)
        self.zipinfo = build_zipmanifest(self.loader.archive)
        self.zip_pre = self.loader.archive+os.sep

It's not an unreasonable assumption that each time you construct a new Distribution, it will either be on disk or part of its own zip archive, meaning these would not be duplicated calls.

In our case, all eggs are in a single zip. This means 57 50ms calls instead of 1 50ms call in order to run this Python application which has 57 egg dependencies.

The proposal is to cache calls to zipmanifest (perhaps invalidating should os.stat/mtime change.)

Comments (15)

  1. philip_thiem

    Not a bad idea. Should be easy enough.

    It looks like code that I had worked on due to a issue with pypy. So just adding some additional information for anyone that might beat me to the punch.

    Originally distribute used zipimports._zip_directory_cache Which cached this, except the private interface is inconsistent between pypy/cpython/win32. https://bitbucket.org/tarek/distribute/issue/27

  2. Jason R. Coombs

    This work has been merged, but I have one concern about performance. We're making a speed vs. memory tradeoff here. This change will necessarily increase the memory usage of setuptools, potentially substantially if a lot of zip packages are used, and that memory won't be reclaimed for the life of the process. I'm now starting to question whether this functionality should be enabled by default.

    As a result, I've bookmarked the proposed commits with a 'issue154' bookmark, leaving 'master' in a separate ancestry.

  3. Jason R. Coombs

    The more I think about it, the speed vs. memory tradeoff is a hard one to make at a library like this, but I think the default position should be to favor memory usage to speed. What if instead each process was able to opt-in to enable caching?

  4. philip_thiem

    Looks like the response that I sent didn't get posted. I think opt in is fine. I can update my branch there to make the cache optional. I could certainly pull that information from a command line or env, I will have to refresh my memory where configuration is stored in memory.

  5. Senthil Kumaran

    I could not find this in master/default/tip. (Plus no reference to this change in NEWS file). Will this change be available part of setuptools release?

    [localhost setuptools]$ hg grep -a 'ZipManifests' -r tip
    [localhost setuptools]$ hg grep -a 'ZipManifests'
    pkg_resources.py:2884:class ZipManifests(dict):
    pkg_resources.py:2884:zip_manifests = ZipManifests()
    
  6. Jason R. Coombs
    • changed status to open

    You're right. Although the changes were merged into the master repo, they have not been merged into the master branch from which releases are made. I'm still pondering how opt in should work for this feature.

  7. Jason R. Coombs

    I just ran an experiment on my Python 3.4 Windows 64-bit environment where I have 54 zip eggs and found that enabling this feature adds 1.4Mb (+6.6%) to the heap usage versus the status quo, but I will see almost no speed benefit from the feature as my environment has an approximately one-to-one mapping of module to zip egg.

    That's a pretty substantial cost to impose on nearly all processes, especially when the benefits will only be had is specially-crafted scenarios, such as the OP presents.

  8. Jason R. Coombs

    This feature has been released in Setuptools 5.4. Please try it out and let me know what you think. As you can see above, it's an opt-in only feature, at least for now. A pretty compelling argument would be needed to justify enabling this behavior by default.

  9. Jason R. Coombs

    I have 54 zip eggs and found that enabling this feature adds 1.4Mb (+6.6%) to the heap usage versus the status quo

    I may have been mistaken here. Analyzing the code, I would have expected 5.3 to also be storing the zip manifests for every zip file (in the zipinfo property). I suspect when I said "status quo" here, I was simply disabling the caching, not actually measuring against 5.3 (the actual status quo). Since I didn't actually log my experiment, I can't say for sure at this point, but I'm going to assume that was the case.

  10. Log in to comment