grlsd fails on ARCHER when PILOTSIZE > 24

Issue #10 resolved
Iain Bethune created an issue

I am using the latest NTL-9 setup workflow script from: https://bitbucket.org/extasy-project/extasy-experiments/src/980ac8d2dee63fa090e84bafcbb93830c360e1d9/grlsd-adaptive-on-archer/?at=master

My software versions are:

(extasy-env) macpro-ib:grlsd-test ibethune$ ensemblemd-version 
0.3.14-35-g391c807
(extasy-env) macpro-ib:grlsd-test ibethune$ radicalpilot-version 
0.40.1

If I run with PILOTSIZE=24 all is well, but if I set to 48 (i.e. 2 nodes), then I get an error shortly after the pilot starts:

Starting pattern execution                                                    ok2016-03-28 10:40:09,603: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : MainThread     : INFO    : Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'

--------------------------------------------------------------------------------
Executing simulation-analysis loop with 1 iterations on 48 allocated core(s) on 'epsrc.archer'

2016-03-28 10:40:09,603: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : MainThread     : INFO    : Waiting for pilot on epsrc.archer to go Active
Job waiting on queue...2016-03-28 10:41:08,304: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : INFO    : Resource epsrc.archer state has changed to Failed
2016-03-28 10:41:08,304: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Resource error: 
2016-03-28 10:41:08,304: radical.enmd.SingleClusterEnvironment: MainProcess                     : Thread-1       : ERROR   : Pattern execution FAILED.
2016-03-28 10:41:08,304: radical.pilot       : MainProcess                     : Thread-1       : ERROR   : sys.exit from callback
Traceback (most recent call last):
  File "/Users/ibethune/Desktop/extasy/extasy-env/lib/python2.7/site-packages/radical/pilot/controller/pilot_manager_controller.py", line 258, in call_callbacks
    cb(self._shared_data[pilot_id]['facade_object'](), new_state)
  File "/Users/ibethune/Desktop/extasy/extasy-env/lib/python2.7/site-packages/radical/ensemblemd/single_cluster_environment.py", line 168, in pilot_state_cb
    sys.exit(1)
SystemExit: 1
2016-03-28 10:41:08,447: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : ERROR   : Fatal error during execution: .
Fatal error during execution: .
Starting Deallocation..
2016-03-28 10:41:08,447: radical.enmd.SingleClusterEnvironment: MainProcess                     : MainThread     : INFO    : Deallocating Cluster

There is the following at the end of the agent_0.out (up to that point all looks OK):

FAILED startup
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 553, in bootstrap_3
    logger = log)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 219, in c
reate
    return impl(cfg, logger)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/pbspro.py", line 20, in 
__init__
    LRMS.__init__(self, cfg, logger)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888
.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 163, in _
_init__
    % (str(cores_avail), str(self.requested_cores)))

bootstrap_3 done
atexit

And in agent_0.bootstrap_3.log:

2016-03-28 10:41:09,066: agent_0.bootstrap_3 : MainProcess                     : MainThread     : INFO    : python.interpreter   version: 2.7.6 (default, Mar 10 2014, 14:13:45) [GCC 4.8.1 20130531 (Cray Inc.)]
2016-03-28 10:41:09,080: agent_0.bootstrap_3 : MainProcess                     : MainThread     : INFO    :                      pid: 1342
2016-03-28 10:41:09,080: agent_0.bootstrap_3 : MainProcess                     : MainThread     : INFO    :                      tid: MainThread
2016-03-28 10:41:09,080: agent_0.bootstrap_3 : MainProcess                     : MainThread     : INFO    : start
2016-03-28 10:41:09,277: agent_0.bootstrap_3 : MainProcess                     : MainThread     : ERROR   : Error running agent: agent_0
Traceback (most recent call last):
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 553, in bootstrap_3
    logger = log)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 219, in create
    return impl(cfg, logger)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/pbspro.py", line 20, in __init__
    LRMS.__init__(self, cfg, logger)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 163, in __init__
    % (str(cores_avail), str(self.requested_cores)))
ValueError: Not enough cores available (24) to satisfy allocation request (48).
2016-03-28 10:41:09,280: agent_0.bootstrap_3 : MainProcess                     : MainThread     : ERROR   : FAILED startup
2016-03-28 10:41:09,281: agent_0.bootstrap_3 : MainProcess                     : MainThread     : ERROR   :   File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/bin/radical-pilot-agent-multicore.py", line 553, in bootstrap_3
    logger = log)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 219, in create
    return impl(cfg, logger)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/pbspro.py", line 20, in __init__
    LRMS.__init__(self, cfg, logger)
  File "/work/e290/e290/e290ib/radical.pilot.sandbox/rp.session.macpro-ib.epcc.ed.ac.uk.ibethune.016888.0012-pilot.0000/rp_install/lib/python2.7/site-packages/radical/pilot/agent/rm/base.py", line 163, in __init__
    % (str(cores_avail), str(self.requested_cores)))

Comments (3)

  1. Iain Bethune reporter

    Right now, it I install ensembledmd from pip, and get the released version of RP, this bug still exists...

    (extasy-env2) bash-3.2$ pip install radical.ensemblemd
    ...
    (extasy-env2) bash-3.2$ ensemblemd-version 
    0.3.14
    (extasy-env2) bash-3.2$ radicalpilot-version 
    0.40.1
    

    Can we have a new release of RP before the tutorial?

  2. Log in to comment