Coco-amber and gromacs-lsdmap broken on ARCHER

Issue #11 resolved
Iain Bethune created an issue

Following the instructions from RTD, using the latest public release of enmd:

(extasy-test)[ibethune@workflow coam-on-archer]$ python extasy_amber_coco.py --RPconfig archer.rcfg --Kcon cocoamber.wcfg 

================================================================================
 EnsembleMD (0.3.14)                                                            
================================================================================

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
--------------------------------------------------------------------------------
Executing simulation-analysis loop with 2 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
pre_loop() not defined. Skipping.
Iteration 1: Waiting for simulation tasks: custom.amber to complete2016-05-11 10:49:54,805: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: 
  Unit    5 Error on OPEN: min.in                                                                          
STOP PMEMD Terminated Abnormally!
, STDOUT: 
  Unit    5 Error on OPEN: min.in                                                                          
STOP PMEMD Terminated Abnormally!

2016-05-11 10:49:54,806: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2016-05-11 10:49:54,806: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 262, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/unit_manager_controller.py", line 199, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 141, in unit_state_cb
    sys.exit(1)
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/static.py", line 478, in execute_pattern
    resource._umgr.wait_units()
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/unit_manager.py", line 698, in wait_units
    time.sleep (0.5)
KeyboardInterrupt

        Starting Deallocation                                               done 
(extasy-test)[ibethune@workflow coam-on-archer]$

In one of the failing CUs:

d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0005-pilot.0000> cat unit.000014/STDERR 

  Unit    5 Error on OPEN: min.in                                                                          
STOP PMEMD Terminated Abnormally!
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0005-pilot.0000> ls -ltr unit.000014/
total 12
-rwx------ 1 d130ib d130 732 May 11 15:49 radical_pilot_cu_launch_script.sh
lrwxrwxrwx 1 d130ib d130  17 May 11 15:49 penta.top -> $SHARED/penta.top
lrwxrwxrwx 1 d130ib d130  17 May 11 15:49 penta.crd -> $SHARED/penta.crd
lrwxrwxrwx 1 d130ib d130  14 May 11 15:49 min.in -> $SHARED/min.in
lrwxrwxrwx 1 d130ib d130  17 May 11 15:49 min1.rst7 -> $SHARED/penta.crd
-rw------- 1 d130ib d130 207 May 11 15:49 STDOUT
-rw------- 1 d130ib d130 143 May 11 15:49 STDERR
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0005-pilot.0000> ls -ltr \$SHARED/
total 0

The $SHARED looks odd to me, I don't remember seeing that before?

Comments (6)

  1. Iain Bethune reporter

    Gromacs-lsdmap is also broken in a similar way. The first CU fails because the file 'spliter.py' is not available:

    d130ib@eslogin003:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0006-pilot.0000/unit.000000> cat STDERR
    python: can't open file 'spliter.py': [Errno 2] No such file or directory
    d130ib@eslogin003:/work/d130/d130/d130ib/radical.pilot.sandbox/rp.session.workflow.iu.xsede.org.ibethune.016932.0006-pilot.0000/unit.000000> ls -l
    total 12
    -rwx------ 1 d130ib d130 628 May 11 16:50 radical_pilot_cu_launch_script.sh
    -rw------- 1 d130ib d130  74 May 11 16:50 STDERR
    -rw------- 1 d130ib d130 130 May 11 16:50 STDOUT
    

    Looks like something has gone wrong relating to the data staging / copying of files...

  2. Vivek Balasubramanian

    0.3.14 (released) version doesn't support stage in shared data and hence you see this error. You shouldn't see this in master or 0.4-RC0.

  3. Log in to comment