Coco-amber and gromacs-lsdmap broken on ARCHER

Issue #11 resolved
Iain Bethune created an issue

Following the instructions from RTD, using the latest public release of enmd:

(extasy-test)[ibethune@workflow coam-on-archer]$ python --RPconfig archer.rcfg --Kcon cocoamber.wcfg 

 EnsembleMD (0.3.14)                                                            

Starting Allocation                                                           ok
        Verifying pattern                                                     ok
        Starting pattern execution                                            ok
Executing simulation-analysis loop with 2 iterations on 24 allocated core(s) on 'epsrc.archer'

Job waiting on queue...
Job is now running !
pre_loop() not defined. Skipping.
Iteration 1: Waiting for simulation tasks: custom.amber to complete2016-05-11 10:49:54,805: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : ComputeUnit error: STDERR: 
  Unit    5 Error on OPEN:                                                                          
STOP PMEMD Terminated Abnormally!
  Unit    5 Error on OPEN:                                                                          
STOP PMEMD Terminated Abnormally!

2016-05-11 10:49:54,806: radical.enmd.simulation_analysis_loop.static.default: MainProcess                     : Thread-3       : ERROR   : Pattern execution FAILED.
2016-05-11 10:49:54,806: radical.pilot       : MainProcess                     : Thread-3       : ERROR   : unit manager controller thread caught system exit -- forcing application shutdown
Traceback (most recent call last):
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/", line 262, in run
    self.call_unit_state_callbacks(unit_id, new_state)
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/controller/", line 199, in call_unit_state_callbacks
    cb(self._shared_data[unit_id]['facade_object'], new_state)
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/", line 141, in unit_state_cb
SystemExit: 1
Execution interuptedTraceback (most recent call last):
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/ensemblemd/exec_plugins/simulation_analysis_loop/", line 478, in execute_pattern
  File "/home/ibethune/extasy-test/lib/python2.7/site-packages/radical/pilot/", line 698, in wait_units
    time.sleep (0.5)

        Starting Deallocation                                               done 
(extasy-test)[ibethune@workflow coam-on-archer]$

In one of the failing CUs:

d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/> cat unit.000014/STDERR 

  Unit    5 Error on OPEN:                                                                          
STOP PMEMD Terminated Abnormally!
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/> ls -ltr unit.000014/
total 12
-rwx------ 1 d130ib d130 732 May 11 15:49
lrwxrwxrwx 1 d130ib d130  17 May 11 15:49 -> $SHARED/
lrwxrwxrwx 1 d130ib d130  17 May 11 15:49 penta.crd -> $SHARED/penta.crd
lrwxrwxrwx 1 d130ib d130  14 May 11 15:49 -> $SHARED/
lrwxrwxrwx 1 d130ib d130  17 May 11 15:49 min1.rst7 -> $SHARED/penta.crd
-rw------- 1 d130ib d130 207 May 11 15:49 STDOUT
-rw------- 1 d130ib d130 143 May 11 15:49 STDERR
d130ib@eslogin007:/work/d130/d130/d130ib/radical.pilot.sandbox/> ls -ltr \$SHARED/
total 0

The $SHARED looks odd to me, I don't remember seeing that before?

  1. Iain Bethune reporter

    Gromacs-lsdmap is also broken in a similar way. The first CU fails because the file '' is not available:

    d130ib@eslogin003:/work/d130/d130/d130ib/radical.pilot.sandbox/> cat STDERR
    python: can't open file '': [Errno 2] No such file or directory
    d130ib@eslogin003:/work/d130/d130/d130ib/radical.pilot.sandbox/> ls -l
    total 12
    -rwx------ 1 d130ib d130 628 May 11 16:50
    -rw------- 1 d130ib d130  74 May 11 16:50 STDERR
    -rw------- 1 d130ib d130 130 May 11 16:50 STDOUT

    Looks like something has gone wrong relating to the data staging / copying of files...

  2. Vivek Balasubramanian

    0.3.14 (released) version doesn't support stage in shared data and hence you see this error. You shouldn't see this in master or 0.4-RC0.

