git statistics do too much work making them slow

Issue #630 resolved
Jameson Nash created an issue

in lib/celerylib/, every call to len(cs.changed) and len(cs.added) is initiating a complete extraction into memory of that git commit, instead of just computing the length. Avoiding this can dramatically speed up statistics computation time and reduce memory usage (in my repo, someone committed a large number of large files at several points).

e.g. i tested this by adding an additional method and changing the call sites:

    def len_added(self):
        if not self.parents:
           return len(list(self._get_file_nodes()))
        return len(self._get_paths_for_status('added'))

presumably added() could return instead return a lazy AddedFileNodesGenerator object, I was just not certain of this

Comments (2)

  1. Marcin Kuzminski repo owner

    Thanks for posting, the bug here is that added() method returns a list instead of AddedFileNodesGenerator, i don't know how i missed that !

  2. Log in to comment