- changed status to resolved
Cannot get out-of-box Amp working on NERSC
I haven’t been able to run Amp consistently and without issue on Cori on NERSC (Slurm-based, distributed, secure computing cluster). Calculations done in parallel, even running on a single node with 64 procs (local connections), fail due to the parallel python scripts (like descriptor/gaussian.py) either not running on worker processes or not communicating properly via pexpect. Attached is a set of files from a recent debug run where such a failure occurred. Notice in the Amp log file that around 30 connections can be made seemingly correctly, after which one of the script fails to print “<amp-connect>” and pexpect raises an EOF exception after a timeout period.
I have been facing this issue for a number of weeks and have only gotten Amp to run completely successfully once, in parallel, on NERSC, since starting. I have not been able to recreate that run, but I have the log files and a single trained .amp calc showing that it can work on NERSC.
For the attached set of files, I did the following:
- created a new conda env
- loaded numpy and then Amp into it
- tested importing Amp in the interactive python console: worked without raising errors
- ran python script from login node {load DFT relaxation trajectory from .db, create Amp NN+Gaussian model, train}: failed at parallel step
- ran python script as batch job with multiple nodes: failed at parallel step
- rain python script as batch job with single node: failed at parallel step
I have tried using the envcommand parameter to load my conda environment on worker nodes and I have tried using SSH key certifications on NERSC (because at first, pexpect.pxssh didn’t even establish connections at all), but they both failed.
I don’t believe I am familiar enough with parallel programming to root out these issues I am having on my own, so I humbly ask for some guidance. At this point, I don’t care if it is my inability to use Amp properly that’s holding me back, or there actually is some compatibility issue with NERSC, I will just be relieved when the problem is solved.
Thank you for your help!
Best regards - Eric Musa
Comments (1)
-
reporter - Log in to comment
Figured it out, there were issues with the number of child processes spawned exceeding the number of physical cores on a node, and there were some other unrelated issues with "assign_cores" parsing the SLURM-set environment variables in an unfamiliar way.