amp crashes when more than 30 child processes are launched

sebastien hamel

sorry about the anonymous tag. i forgot to log in. here is the error message: python ./amp_example_script.py /g/g99/hamel2/python/lib/python2.7/site-packages/ase/lattice/surface.py:17: UserWarning: Moved to ase.build warnings.warn('Moved to ase.build') Traceback (most recent call last): File "./amp_example_script.py", line 38, in <module> calc.train(images='training.traj') File "/g/g99/hamel2/amp/amp/init.py", line 311, in train parallel=self.parallel) File "/g/g99/hamel2/amp/amp/model/neuralnetwork.py", line 230, in fit result = self.regressor.regress(model=self, log=log) File "/g/g99/hamel2/amp/amp/regression/__init__.py", line 85, in regress self.optimizer_kwargs) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 889, in fmin_bfgs res = _minimize_bfgs(f, x0, args, fprime, callback=callback, opts) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 943, in _minimize_bfgs gfk = myfprime(x0) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper return function(*(wrapper_args + args)) File "/g/g99/hamel2/amp/amp/model/neuralnetwork.py", line 320, in get_lossprime lossprime=True)['dloss_dparameters'] File "/g/g99/hamel2/amp/amp/model/init.py", line 569, in get_loss self._initialize(args={'lossprime': lossprime, 'd': self.d}) File "/g/g99/hamel2/amp/amp/model/init.py", line 363, in _initialize setup_publisher=True) File "/g/g99/hamel2/amp/amp/utilities.py", line 223, in setup_parallel parallel['envcommand'])) File "/g/g99/hamel2/amp/amp/utilities.py", line 257, in start_workers child.expect('<amp-connect>') File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/spawnbase.py", line 327, in expect timeout, searchwindowsize, async) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/spawnbase.py", line 355, in expect_list return exp.expect_loop(timeout) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/expect.py", line 104, in expect_loop return self.eof(e) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/expect.py", line 50, in eof raise EOF(msg) pexpect.exceptions.EOF: End Of File (EOF). Exception style platform. <pexpect.pty_spawn.spawn object at 0x2aaab394a910> command: /g/g99/hamel2/python/bin/python args: ['/g/g99/hamel2/python/bin/python', '-m', 'amp.model', '30', 'quartz1922:38879'] buffer (last 100 chars): '' before (last 100 chars): 'ges/numpy/core/init.py", line 16, in <module>\r\n from . import multiarray\r\nKeyboardInterrupt\r\n' after: <class 'pexpect.exceptions.EOF'> match: None match_index: None exitstatus: None flag_eof: True pid: 172997 child_fd: 75 closed: False timeout: 30 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 2000 ignorecase: False searchwindowsize: None delaybeforesend: 0.05 delayafterclose: 0.1 delayafterterminate: 0.1 searcher: searcher_re: 0: re.compile("<amp-connect>")

2017-12-20T20:13:45+00:00

Comments (7)

andrew_peterson repo owner
Can you provide the exact error message you get? Also are you specifying the number of cores manually or letting it discover it automatically from the environment variables?
- 2017-12-20T20:05:05+00:00
sebastien hamel
sorry about the anonymous tag. i forgot to log in. here is the error message: python ./amp_example_script.py /g/g99/hamel2/python/lib/python2.7/site-packages/ase/lattice/surface.py:17: UserWarning: Moved to ase.build warnings.warn('Moved to ase.build') Traceback (most recent call last): File "./amp_example_script.py", line 38, in <module> calc.train(images='training.traj') File "/g/g99/hamel2/amp/amp/init.py", line 311, in train parallel=self.parallel) File "/g/g99/hamel2/amp/amp/model/neuralnetwork.py", line 230, in fit result = self.regressor.regress(model=self, log=log) File "/g/g99/hamel2/amp/amp/regression/__init__.py", line 85, in regress self.optimizer_kwargs) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 889, in fmin_bfgs res = _minimize_bfgs(f, x0, args, fprime, callback=callback, opts) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 943, in _minimize_bfgs gfk = myfprime(x0) File "/g/g99/hamel2/python/lib/python2.7/site-packages/scipy/optimize/optimize.py", line 292, in function_wrapper return function(*(wrapper_args + args)) File "/g/g99/hamel2/amp/amp/model/neuralnetwork.py", line 320, in get_lossprime lossprime=True)['dloss_dparameters'] File "/g/g99/hamel2/amp/amp/model/init.py", line 569, in get_loss self._initialize(args={'lossprime': lossprime, 'd': self.d}) File "/g/g99/hamel2/amp/amp/model/init.py", line 363, in _initialize setup_publisher=True) File "/g/g99/hamel2/amp/amp/utilities.py", line 223, in setup_parallel parallel['envcommand'])) File "/g/g99/hamel2/amp/amp/utilities.py", line 257, in start_workers child.expect('<amp-connect>') File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/spawnbase.py", line 327, in expect timeout, searchwindowsize, async) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/spawnbase.py", line 355, in expect_list return exp.expect_loop(timeout) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/expect.py", line 104, in expect_loop return self.eof(e) File "/g/g99/hamel2/python/lib/python2.7/site-packages/pexpect/expect.py", line 50, in eof raise EOF(msg) pexpect.exceptions.EOF: End Of File (EOF). Exception style platform. <pexpect.pty_spawn.spawn object at 0x2aaab394a910> command: /g/g99/hamel2/python/bin/python args: ['/g/g99/hamel2/python/bin/python', '-m', 'amp.model', '30', 'quartz1922:38879'] buffer (last 100 chars): '' before (last 100 chars): 'ges/numpy/core/init.py", line 16, in <module>\r\n from . import multiarray\r\nKeyboardInterrupt\r\n' after: <class 'pexpect.exceptions.EOF'> match: None match_index: None exitstatus: None flag_eof: True pid: 172997 child_fd: 75 closed: False timeout: 30 delimiter: <class 'pexpect.exceptions.EOF'> logfile: None logfile_read: None logfile_send: None maxread: 2000 ignorecase: False searchwindowsize: None delaybeforesend: 0.05 delayafterclose: 0.1 delayafterterminate: 0.1 searcher: searcher_re: 0: re.compile("<amp-connect>")
- 2017-12-20T20:13:45+00:00
sebastien hamel
this happens if I let amp choose the number of tasks itself. and as I said I can prevent this crash by setting SLURM_TASKS_PER_NODE=30. This is on a cluster with 36 cores per node.

i don't see this error on a different cluster we have with 16 cores per node.
- 2017-12-20T20:16:45+00:00
sebastien hamel
i meant SLURM_NTASKS_PER_NODE=30. forgot the "N" is the previous messages.
- 2017-12-21T19:42:31+00:00
andrew_peterson repo owner
That error's pretty tough to read without formatting. (Perhaps best to put it in a code block next time.) But it looks like pexpect is timing out before all the workers have started. Does the log file list the /tmp files made by the workers? I would start by looking in those files.
- 2017-12-22T17:42:16+00:00
sebastien hamel
sorry about the formatting. FYI, I tried increasing the timeout value for the pexpect.spawn call in amp/utilities.py and while it does register the increased timeout value in the error message (reading timeout: 300 instead of timeout: 30), the same crash happens. I'll keep digging.
- 2017-12-22T19:23:47+00:00
andrew_peterson repo owner
- changed status to resolved
Closing for now -- perhaps was a system-dependent issue? Please re-open if problems persist.
- 2018-07-27T20:34:59+00:00
Log in to comment