RDF accumulate leaks memory and crashes

Issue #169 resolved
Åsmund Ervik created an issue

I am trying to make a very accurate computation of the RDF from a big-ish system of hard spheres (3D). I am using HOOMD-blue for the HPMC, and it works well, with ~22k spheres and 10e6 sweeps. I'm trying to compute RDF using data every 1000 steps, with a cutoff of 8 and a step of 0.001.

But when I try to compute the RDF using Freud following the tutorials, either online in the simulation through a callback or on a saved trajectory with many steps, Freud uses up all available memory as it is iterating through the snapshots, and then crashes with "MemoryError: std::bad_alloc" after 74 of the 1e4 snapshots in my trajectory. To my understanding, once it's finished with one snapshot, that memory should be freed.

Currently I'm working around this by running one script with an outer loop that uses subprocess.Popen to spawn another Python script that runs the RDF computation on 10 snapshots at a time, then saves the result to a .npy file and exits. In the outer loop, I then add up the RDFs and average at the end. This works fine, but it's of course an ugly hack.

I'm not doing anything special, so you should be able to replicate just by taking the example linked below, switching to 3D, increasing the number of spheres, setting rmax=8.0, dr=0.001, and doing enough callbacks (around 74 on my 64GB machine).

https://github.com/joaander/hoomd-examples/blob/master/Analysis%20-%20Quantitative%20-%20Online%20analysis%20with%20Freud.ipynb

Comments (8)

  1. Matthew Spellings

    Some notes for @vramasub and @bdice :

    Extra memory usage is due to creating the default neighbor list. In particular the NeighborList object doesn't seem to be garbage collected because it points to the CellList that created it through its base parameter. This was intended to prevent garbage collection problems, but seems to have brought its own (possibly due to some unintuitive behavior of how refcounting works inside cython?). Adding something like:

    if nlist is None:     
        nlist_.base = None
    

    to the end of RDF.accumulate seems to fix the problem.

  2. Vyas Ramasubramani

    Thanks for reporting!

    Thanks for the info Matt. I'll try and reproduce the problem and hopefully this fixes it.

  3. Vyas Ramasubramani

    @asmunder thanks again for finding this. We now have a fix on the master branch. We'll aim to make a bugfix release soon, but if you would like something immediately then feel free to clone the repo and confirm that this fix works.

  4. Matthew Spellings

    To clarify what we learned, the problem wasn't actually the circular references, it was that not enough things were happening at the python level to trigger garbage collections, which are required to clean up objects with circular references (adding periodic calls to gc.collect() should fix the observed behavior without updating freud in this case). The solution we opted for is to explicitly break the circular reference for automatically-generated neighbor lists so they get cleaned up immediately by reference counting.

  5. Log in to comment