Cannot get out-of-box Amp working on NERSC

Issue #229 resolved
Eric Musa created an issue

I haven’t been able to run Amp consistently and without issue on Cori on NERSC (Slurm-based, distributed, secure computing cluster). Calculations done in parallel, even running on a single node with 64 procs (local connections), fail due to the parallel python scripts (like descriptor/gaussian.py) either not running on worker processes or not communicating properly via pexpect. Attached is a set of files from a recent debug run where such a failure occurred. Notice in the Amp log file that around 30 connections can be made seemingly correctly, after which one of the script fails to print “<amp-connect>” and pexpect raises an EOF exception after a timeout period.

I have been facing this issue for a number of weeks and have only gotten Amp to run completely successfully once, in parallel, on NERSC, since starting. I have not been able to recreate that run, but I have the log files and a single trained .amp calc showing that it can work on NERSC.

For the attached set of files, I did the following:

  1. created a new conda env
  2. loaded numpy and then Amp into it
  3. tested importing Amp in the interactive python console: worked without raising errors
  4. ran python script from login node {load DFT relaxation trajectory from .db, create Amp NN+Gaussian model, train}: failed at parallel step
  5. ran python script as batch job with multiple nodes: failed at parallel step
  6. rain python script as batch job with single node: failed at parallel step

I have tried using the envcommand parameter to load my conda environment on worker nodes and I have tried using SSH key certifications on NERSC (because at first, pexpect.pxssh didn’t even establish connections at all), but they both failed.

I don’t believe I am familiar enough with parallel programming to root out these issues I am having on my own, so I humbly ask for some guidance. At this point, I don’t care if it is my inability to use Amp properly that’s holding me back, or there actually is some compatibility issue with NERSC, I will just be relieved when the problem is solved.

Thank you for your help!

Best regards - Eric Musa

Comments (1)

  1. Eric Musa reporter

    Figured it out, there were issues with the number of child processes spawned exceeding the number of physical cores on a node, and there were some other unrelated issues with "assign_cores" parsing the SLURM-set environment variables in an unfamiliar way.

  2. Log in to comment