- changed title to Coco-amber and gromacs-lsdmap broken on ARCHER
Coco-amber and gromacs-lsdmap broken on ARCHER
Issue #11
resolved
Following the instructions from RTD, using the latest public release of enmd:
(extasy-test)[ibethune@workflow coam-on-archer]$ python extasy_amber_coco.py --RPconfig archer.rcfg --Kcon cocoamber.wcfg
================================================================================
EnsembleMD (0.3.14)
================================================================================
Starting Allocation ok
Verifying pattern ok
Starting pattern execution ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 2 iterations on 24 allocated core(s) on 'epsrc.archer'
Job waiting on queue...
Job is now running !
pre_loop() not defined. Skipping.
Iteration 1: Waiting for simulation tasks: custom.amber to complete2016-05-11 10:49:54,805: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : ComputeUnit error: STDERR:
Unit 5 Error on OPEN: min.in
STOP PMEMD Terminated Abnormally!
, STDOUT:
Unit 5 Error on OPEN: min.in
STOP PMEMD Terminated Abnormally!
2016-05-11 10:49:54,806: radical.enmd.simulation_analysis_loop.static.default: MainProcess : Thread-3 : ERROR : Pattern execution FAILED.
2016-05-11 10:49:54,806: radical.pilot : MainProcess : Thread-3 : ERROR : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 262, in run
self.call_unit_state_callbacks(unit_id, new_state)
File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 199, in call_unit_state_callbacks
cb(self._shared_data[unit_id]['facade_object'], new_state)
File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 478, in execute_pattern
resource._umgr.wait_units()
File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 698, in wait_units
time.sleep (0.5)
KeyboardInterrupt
Starting Deallocation done
(extasy-test)[ibethune@workflow coam-on-archer]$
In one of the failing CUs:
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0005-pilot.0000> cat unit.000014/STDERR
Unit 5 Error on OPEN: min.in
STOP PMEMD Terminated Abnormally!
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0005-pilot.0000> ls -ltr unit.000014/
total 12
-rwx------ 1 d130ib d130 732 May 11 15:49 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 d130ib d130 17 May 11 15:49 penta.top -> $SHARED/penta.top
lrwxrwxrwx 1 d130ib d130 17 May 11 15:49 penta.crd -> $SHARED/penta.crd
lrwxrwxrwx 1 d130ib d130 14 May 11 15:49 min.in -> $SHARED/min.in
lrwxrwxrwx 1 d130ib d130 17 May 11 15:49 min1.rst7 -> $SHARED/penta.crd
-rw------- 1 d130ib d130 207 May 11 15:49 STDOUT
-rw------- 1 d130ib d130 143 May 11 15:49 STDERR
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0005-pilot.0000> ls -ltr \$SHARED/
total 0
The $SHARED looks odd to me, I don't remember seeing that before?
Comments (6)
-
reporter -
reporter Gromacs-lsdmap is also broken in a similar way. The first CU fails because the file 'spliter.py' is not available:
d130ib@eslogin003:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0006-pilot.0000/unit.000000> cat STDERR python: can't open file 'spliter.py': [Errno 2] No such file or directory d130ib@eslogin003:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0006-pilot.0000/unit.000000> ls -l total 12 -rwx------ 1 d130ib d130 628 May 11 16:50 radical_pilot_cu_launch_script.sh -rw------- 1 d130ib d130 74 May 11 16:50 STDERR -rw------- 1 d130ib d130 130 May 11 16:50 STDOUT
Looks like something has gone wrong relating to the data staging / copying of files...
-
0.3.14 (released) version doesn't support stage in shared data and hence you see this error. You shouldn't see this in master or 0.4-RC0.
-
reporter Amber-coco is working with 0.4-RC0...
-
reporter ... and so is gromacs-lsdmap! I'll leave this open until we have a release version.
-
reporter - changed status to resolved
Fixed in RP 0.40.02
- Log in to comment