amp crashes when more than 30 child processes are launched

Issue #182 resolved
Former user created an issue

hello,

I've started using amp and noticed that running the example scripts, it crashes when more than 30 child processes are launched. everything looks ok when I restrict the number of tasks per nodes explicitly to be 30 (using export SLURM_NTASKS_PER_NODE=30).

any suggestions?

Comments (7)

  1. andrew_peterson repo owner

    Can you provide the exact error message you get? Also are you specifying the number of cores manually or letting it discover it automatically from the environment variables?

  2. sebastien hamel

    sorry about the anonymous tag. i forgot to log in. here is the error message: python ./amp_example_script.py /g/g99/hamel2/python/lib/python2.7/site-packages/ase/lattice/surface.py:17: UserWarning: Moved to ase.build warnings.warn('Moved to ase.build') Traceback (most recent call last): File "./amp_example_script.py", line 38, in <module> calc.train(images='training.traj') File "/g/g99/hamel2/amp/amp/init.py", line 311, in train parallel=self.parallel) File "/g/g99/hamel2/amp/amp/model/neuralnetwork.py", line 230, in fit result = self.regressor.regress(model=self, log=log) File "/g/g99/hamel2/amp/amp/regression/__init__.py", line 85, in regress self.optimizer_kwargs) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 889, in fmin_bfgs res = _minimize_bfgs(f, x0, args, fprime, callback=callback, opts) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 943, in _minimize_bfgs gfk = myfprime(x0) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper return function(*(wrapper_args + args)) File "/g/g99/hamel2/amp/amp/model/neuralnetwork.py", line 320, in get_lossprime lossprime=True)['dloss_dparameters'] File "/g/g99/hamel2/amp/amp/model/init.py", line 569, in get_loss self._initialize(args={'lossprime': lossprime, 'd': self.d}) File "/g/g99/hamel2/amp/amp/model/init.py", line 363, in _initialize setup_publisher=True) File "/g/g99/hamel2/amp/amp/utilities.py", line 223, in setup_parallel parallel['envcommand'])) File "/g/g99/hamel2/amp/amp/utilities.py", line 257, in start_workers child.expect('<amp-connect>') File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/spawnbase.py", line 327, in expect timeout, searchwindowsize, async) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/spawnbase.py", line 355, in expect_list return exp.expect_loop(timeout) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/expect.py", line 104, in expect_loop return self.eof(e) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/expect.py", line 50, in eof raise EOF(msg) pexpect.exceptions.EOF: End Of File (EOF). Exception style platform. <pexpect.pty_spawn.spawn object at 0x2aaab394a910> command: /g/g99/hamel2/python/bin/python args: ['/g/g99/hamel2/python/bin/python', '-m', 'amp.model', '30', 'quartz1922:38879'] buffer (last 100 chars): '' before (last 100 chars): 'ges/numpy/core/init.py", line 16, in <module>\r\n from . import multiarray\r\nKeyboardInterrupt\r\n' after: <class 'pexpect.exceptions.EOF'> match: None match_index: None exitstatus: None flag_eof: True pid: 172997 child_fd: 75 closed: False timeout: 30 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 2000 ignorecase: False searchwindowsize: None delaybeforesend: 0.05 delayafterclose: 0.1 delayafterterminate: 0.1 searcher: searcher_re: 0: re.compile("<amp-connect>")

  3. sebastien hamel

    this happens if I let amp choose the number of tasks itself. and as I said I can prevent this crash by setting SLURM_TASKS_PER_NODE=30. This is on a cluster with 36 cores per node.

    i don't see this error on a different cluster we have with 16 cores per node.

  4. andrew_peterson repo owner

    That error's pretty tough to read without formatting. (Perhaps best to put it in a code block next time.) But it looks like pexpect is timing out before all the workers have started. Does the log file list the /tmp files made by the workers? I would start by looking in those files.

  5. sebastien hamel

    sorry about the formatting. FYI, I tried increasing the timeout value for the pexpect.spawn call in amp/utilities.py and while it does register the increased timeout value in the error message (reading timeout: 300 instead of timeout: 30), the same crash happens. I'll keep digging.

  6. Log in to comment