Flux hoomd 1.1.1-r1 builds w/ freud hang instead of exit

Issue #54 resolved
Richmond Newman created an issue

Attempting to use hoomd1.1.1 with freud [0.1, 0.2, master, some obsolete version] leads to the environment hanging at the end of the script. Issue appears on flux, does not on Collins, have not tested elsewhere.

Seems to triggered by importing any freud module (I.e. from freud import trajectory).

The environment hangs when python's sys.exit(0) command is called, but not with os._exit(0), maybe suggesting there's a stray process. The former (I believe) closes the main thread, the latter exits all without cleanup.

An example script that recreates this:

from hoomd_script import *
from freud import trajectory
print("Done. Probably hangs if this is run on flux");

#import sys; sys.exit(0); #Still hangs
#import os; os._exit(0);  #Causes exit, if sys.exit(0) is commented out so it doesn't kill main first

Simon tried compiling freud master with tbb 4.1 and 4.3 (gcc 4.4.7), and still experienced hangs on exit. It does seem to work on collins, which uses boost 1.56 instead of 1.55. Arrrrrgh.

Comments (6)

  1. Chrisy Du

    I've seen similar behavior on comet, however, it happens randomly. Some jobs can exit on their own while the others are hanging there. On comet I'm using boost 1.58 and the newest master branch of freud and my own branch of hpmc.

    Update:

    After testing Richmond's script both on comet log in node and the shared queue, the hanging problem did occur so I guess comet admin uses a different rule to update me about when job finishes. So I guess boost 1.58 is ok.

  2. Joshua Anderson

    Backtraces are helpful in debugging these sorts of things. Compile in debug mode, run in a debugger, and give the process a signal when it hangs. Then you can find out where it is hanging.

    HOOMD does install some python atexit handlers that may (or may not) be ignored if you run os._exit(0).

  3. Joshua Anderson

    A full month and still no backtrace? sigh

    The problem is occurring deep within openmpi, somehow from a call originating within cufft. I have no idea how or why. It occurs even when hoomd_script is not imported - i.e. with this script:

    import freud
    

    As a workaround, launch hoomd with python3 instead of hoomd. This will be the default operation mode in the future. Note that when running hoomd this way, context.initialize() is required. For now, I have only setup PYTHONPATH for this on flux. HOOMD v2.0 will officially support this mode of operation and then it will be available on all platforms.

    Please confirm that this fixes the issue for you so that we can close the issue.

  4. Log in to comment