grlsd fails on ARCHER when PILOTSIZE > 24
Issue #10
resolved
I am using the latest NTL-9 setup workflow script from: https://bitbucket.org/extasy-project/extasy-experiments/src/980ac8d2dee63fa090e84bafcbb93830c360e1d9/grlsd-adaptive-on-archer/?at=master
My software versions are:
(extasy-env) macpro-ib:grlsd-test ibethune$ ensemblemd-version
0.3.14-35-g391c807
(extasy-env) macpro-ib:grlsd-test ibethune$ radicalpilot-version
0.40.1
If I run with PILOTSIZE=24 all is well, but if I set to 48 (i.e. 2 nodes), then I get an error shortly after the pilot starts:
Starting pattern execution ok2016-03-28 10:40:09,603: radical.enmd.simulation_analysis_loop.static.default: MainProcess : MainThread : INFO : Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'
2016-03-28 10:40:09,603: radical.enmd.simulation_analysis_loop.static.default: MainProcess : MainThread : INFO : Waiting for pilot on epsrc.archer to go Active
Job waiting on queue...2016-03-28 10:41:08,304: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : INFO : Resource epsrc.archer state has changed to Failed
2016-03-28 10:41:08,304: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Resource error:
2016-03-28 10:41:08,304: radical.enmd.SingleClusterEnvironment: MainProcess : Thread-1 : ERROR : Pattern execution FAILED.
2016-03-28 10:41:08,304: radical.pilot : MainProcess : Thread-1 : ERROR : sys.exit from callback
Traceback (most recent call last):
File "/Users/ibethune/Desktop/extasy/extasy-env/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
cb(self._shared_data[pilot_id]['facade_object'](), new_state)
File "/Users/ibethune/Desktop/extasy/extasy-env/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 168, in pilot_state_cb
sys.exit(1)
SystemExit: 1
2016-03-28 10:41:08,447: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : ERROR : Fatal error during execution: .
Fatal error during execution: .
Starting Deallocation..
2016-03-28 10:41:08,447: radical.enmd.SingleClusterEnvironment: MainProcess : MainThread : INFO : Deallocating Cluster
There is the following at the end of the agent_0.out (up to that point all looks OK):
FAILED startup
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 553, in bootstrap_3
logger = log)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 219, in c
reate
return impl(cfg, logger)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/pbspro.py", line 20, in
__init__
LRMS.__init__(self, cfg, logger)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 163, in _
_init__
% (str(cores_avail), str(self.requested_cores)))
bootstrap_3 done
atexit
And in agent_0.bootstrap_3.log:
2016-03-28 10:41:09,066: agent_0.bootstrap_3 : MainProcess : MainThread : INFO : python.interpreter version: 2.7.6 (default, Mar 10 2014, 14:13:45) [GCC 4.8.1 20130531 (Cray Inc.)]
2016-03-28 10:41:09,080: agent_0.bootstrap_3 : MainProcess : MainThread : INFO : pid: 1342
2016-03-28 10:41:09,080: agent_0.bootstrap_3 : MainProcess : MainThread : INFO : tid: MainThread
2016-03-28 10:41:09,080: agent_0.bootstrap_3 : MainProcess : MainThread : INFO : start
2016-03-28 10:41:09,277: agent_0.bootstrap_3 : MainProcess : MainThread : ERROR : Error running agent: agent_0
Traceback (most recent call last):
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 553, in bootstrap_3
logger = log)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 219, in create
return impl(cfg, logger)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/pbspro.py", line 20, in __init__
LRMS.__init__(self, cfg, logger)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 163, in __init__
% (str(cores_avail), str(self.requested_cores)))
ValueError: Not enough cores available (24) to satisfy allocation request (48).
2016-03-28 10:41:09,280: agent_0.bootstrap_3 : MainProcess : MainThread : ERROR : FAILED startup
2016-03-28 10:41:09,281: agent_0.bootstrap_3 : MainProcess : MainThread : ERROR : File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 553, in bootstrap_3
logger = log)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 219, in create
return impl(cfg, logger)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/pbspro.py", line 20, in __init__
LRMS.__init__(self, cfg, logger)
File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 163, in __init__
% (str(cores_avail), str(self.requested_cores)))
Comments (3)
-
reporter -
reporter -
assigned issue to
Right now, it I install ensembledmd from pip, and get the released version of RP, this bug still exists...
(extasy-env2) bash-3.2$ pip install radical.ensemblemd ... (extasy-env2) bash-3.2$ ensemblemd-version 0.3.14 (extasy-env2) bash-3.2$ radicalpilot-version 0.40.1
Can we have a new release of RP before the tutorial?
-
assigned issue to
-
reporter - changed status to resolved
This is now working with RP 0.40.2. Thanks!
- Log in to comment
Fixed in RP devel branch, leave open until it's released