gmxcoco workflow failing on Blue Waters

Issue #18 resolved
Charlie Laughton created an issue

Running the gmxcoco workflow fails at the first coco step. In the offending unit.XXXXXX directory I see the following:

laughton@h2ologin1:~/scratch/radical.pilot.sandbox/rp.session.poirot.pharm.nottingham.ac.uk.charlie.016954.0001-pilot.0000/unit.000064> more radical_pilot_cu_launch_script.sh 
#!/bin/sh

# Change to working directory for unit
cd /mnt/c/scratch/sciteam/laughton/radical.pilot.sandbox/rp.session.poirot.pharm
.nottingham.ac.uk.charlie.016954.0001-pilot.0000/unit.000064
# Pre-exec commands
module use --append /projects/sciteam/gkd/modules
module load openmpi
module load bwpy
source /projects/sciteam/gkd/virtenvs/coco_test/bin/activate
export PATH=$PATH:/projects/sciteam/gkd/virtenvs/coco_test/bin
export PYTHONPATH=$PYTHONPATH:/projects/sciteam/gkd/virtenvs/coco_test/lib/pytho
n-2.7/site-packages
# Environment variables
export RP_SESSION_ID=rp.session.poirot.pharm.nottingham.ac.uk.charlie.016954.000
1 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_1 RP_SPAWNER_ID=agent_1.AgentExecutin
gComponent.0.child RP_UNIT_ID=unit.000064
# The command to run
/projects/sciteam/gkd/openmpi/20151210-DYN/bin/orte-submit  --hnp "3918069760.0;
tcp://10.128.99.114:48656" -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH -np 16 -host
 10620,10620,10620,10620,10620,10620,10620,10620,10620,10620,10620,10620,10620,1
0620,10620,10620 pyCoCo "--grid" "30" "--dims" "3" "--frontpoints" "16" "--topfi
le" "md-0_0.gro" "--mdfile" "*.xtc" "--output" "coco_out_0.gro" "--logfile" "coc
o.log" "--selection" "name CA" 
RETVAL=$?
# Exit the script with the return code from the command
exit $RETVAL

laughton@h2ologin1:~/scratch/radical.pilot.sandbox/rp.session.poirot.pharm.nottingham.ac.uk.charlie.016954.0001-pilot.0000/unit.000064> more STDERR
MPI functionality is now available through bwpy-mpi.
To enable MPI packages `module load bwpy-mpi` after bwpy
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
Thu Jun  2 02:31:46 2016: [PE_0]:inet_listen_socket_setup:inet_setup_listen_sock
et: bind failed port 1371 listen_sock = 8 Address already in use
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_inet_listen_socket_setup:socket setup fail
ed
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_inet_listen_socket_setup (full) 
returned -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_alps_sync:alps response not OKAY
Thu Jun  2 02:31:46 2016: [PE_0]:_pmi_init:_pmi_alps_sync failed -1
[Thu Jun  2 02:31:46 2016] [c18-5c0s1n0] Fatal error in PMPI_Init_thread: Other 
MPI error, error stack:
MPIR_Init_thread(525): 
MPID_Init(225).......: channel initialization failed
MPID_Init(598).......:  PMI2 init failed: 1 
laughton@h2ologin1:~/scratch/radical.pilot.sandbox/rp.session.poirot.pharm.nottingham.ac.uk.charlie.016954.0001-pilot.0000/unit.000064> 

Comments (15)

  1. Andre Merzky

    Hi Charlie,

    we suspect that this is caused by a mismatch in the used MPI4PY version. The RP support on BW is based on ORTE, which is based on OpenMPI. That implies that only MPI applications compiled against the same version of OpenMPI are supported. We are aware that this is quite a limitation, and are looking into relaxing that, but the limitation is likely to stay around for several months.

    Could you please check what MPI4PY installation you use, to ensure that this is indeed the problem?

    You can use our OpenMPI installation via:

      module switch PrgEnv-cray PrgEnv-gnu
      module load bwpy
      module use --append /projects/sciteam/gkd/modules
      module load openmpi
    

    Best, Andre.

  2. Charlie Laughton reporter

    Hi Andre,

    Are you saying I should NOT be loading module bwpy-mpi?

    I followed your instructions below, then started a python shell and tried to ‘import mpi4py’, but I got a ‘no such package’ error.

    I then did ‘module load bwpy-mpi’ and tried again, this time it loaded OK. ‘help(mpi4py)’ gives:

    Help on package mpi4py:

    NAME mpi4py - This is the MPI for Python package.

    FILE /sw/xe/bwpy-mpi/0.2.0/usr/lib/python2.7/site-packages/mpi4py/init.py

    Does this help?

    Cheers,

    Charlie

  3. marksant

    Hi Charlie,

    Indeed, we need to use an mpi4py that is compiled against OpenMPI.

    Back in december we started with this organisation and Vivek and the Rice folks started doing that, I don't know what the exact status of that is now.

  4. Charlie Laughton reporter

    Well thanks for the confirmation of the issue. This could be a big problem, I was planning to burn some serious BW time doing big gmxcoco runs, I may have to rethink…

    This message and any attachment are intended solely for the addressee and may contain confidential information. If you have received this message in error, please send it back to me, and immediately delete it.

    Please do not use, copy or disclose the information contained in this message or in any attachment. Any views or opinions expressed by the author of this email do not necessarily reflect the views of the University of Nottingham.

    This message has been checked for viruses but the contents of an attachment may still contain software viruses which could damage your computer system, you are advised to perform your own checks. Email communications with the University of Nottingham may be monitored as permitted by UK legislation.

  5. Andre Merzky

    Vivek, can you help Charlie to set this up? This should not turn out to be a major usability issue for BW, or so I hope...

    Thanks, Andre.

  6. Vivek Balasubramanian

    Hey Charlie,

    Giving it another run now. We did use coco for the runs for the extasy paper, let me see if anything has changed in the last few days that is causing the error.

  7. Vivek Balasubramanian

    Could you try again with the following in the pre-exec for "ncsa.bw" in kernel_defs/coco.py:

    "pre_exec"      : [
                                     "module load bwpy",
                                     "source /projects/sciteam/gkd/virtenvs/coco_test/bin/activate"
    ],
    
  8. Charlie Laughton reporter

    Hi Vivek - ok I'll try this a bit later today. Can I just check though: this script appears to load bwpy but not bwpy-mpi - is that right? Coco uses mpi4py.

    Charlie

  9. marksant

    Yes, thats the whole intent. bwpy-mpi is linked against the cray mpi libraries and thats what we are trying to prevent.

  10. Charlie Laughton reporter

    Hi Vivek,

    The job fails, but with a different error. All 16 units running the first grompp step complete OK, but then all 16 units running mdrun fail with the same error message in STDERR:

    laughton@h2ologin2:~/scratch/radical.pilot.sandbox/rp.session.poirot.pharm.nottingham.ac.uk.charlie.016961.0000-pilot.0000/unit.000016> more STDERR [nid19353:03494] Process received signal [nid19353:03494] Signal: Segmentation fault (11) [nid19353:03494] Signal code: Address not mapped (1) [nid19353:03494] Failing at address: (nil) [nid19353:03494] [ 0] /lib64/libpthread.so.0(+0xf850)[0x2aaaab90f850] [nid19353:03494] End of error message /mnt/c/scratch/sciteam/laughton/radical.pilot.sandbox/rp.session.poirot.pharm.no ttingham.ac.uk.charlie.016961.0000-pilot.0000/unit.000016/radical_pilot_cu_launc h_script.sh: line 16: 3494 Segmentation fault (core dumped) /projects/scit eam/gkd/openmpi/20151210-DYN/bin/orte-submit --hnp "2286747648.0;tcp://10.128.10 8.209:52151" -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH -np 32 -host 19357,19357,1 9357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,193 57,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357 ,19357,19357,19357 gmx_mpi mdrun "-deffnm" "md-0_0" laughton@h2ologin2:~/scratch/radical.pilot.sandbox/rp.session.poirot.pharm.nottingham.ac.uk.charlie.016961.0000-pilot.0000/unit.000016>

    For reference, here is the run script:

    laughton@h2ologin2:~/scratch/radical.pilot.sandbox/rp.session.poirot.pharm.nottingham.ac.uk.charlie.016961.0000-pilot.0000/unit.000016> more radical_pilot_cu_launch_script.sh #!/bin/sh

    Change to working directory for unit

    cd /mnt/c/scratch/sciteam/laughton/radical.pilot.sandbox/rp.session.poirot.pharm .nottingham.ac.uk.charlie.016961.0000-pilot.0000/unit.000016

    Pre-exec commands

    export PATH=$PATH:/projects/sciteam/gkd/gromacs/5.1.1/20151210_OMPI20151210-DYN/ install-cpu/bin export GROMACS_LIB=/projects/sciteam/gkd/gromacs/5.1.1/20151210_OMPI20151210-DYN /install-cpu/lib64 export GROMACS_INC=/projects/sciteam/gkd/gromacs/5.1.1/20151210_OMPI20151210-DYN /install-cpu/include export GROMACS_BIN=/projects/sciteam/gkd/gromacs/5.1.1/20151210_OMPI20151210-DYN /install-cpu/bin export GROMACS_DIR=/projects/sciteam/gkd/gromacs/5.1.1/20151210_OMPI20151210-DYN /install-cpu export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/projects/sciteam/gkd/gromacs/5.1.1/2015 1210_OMPI20151210-DYN/install-cpu/lib64

    Environment variables

    export RP_SESSION_ID=rp.session.poirot.pharm.nottingham.ac.uk.charlie.016961.000 0 RP_PILOT_ID=pilot.0000 RP_AGENT_ID=agent_1 RP_SPAWNER_ID=agent_1.AgentExecutin gComponent.0.child RP_UNIT_ID=unit.000016

    The command to run

    /projects/sciteam/gkd/openmpi/20151210-DYN/bin/orte-submit --hnp "2286747648.0; tcp://10.128.108.209:52151" -x LD_LIBRARY_PATH -x PATH -x PYTHONPATH -np 32 -hos t 19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357, 19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19357,19 357,19357,19357,19357,19357,19357 gmx_mpi mdrun "-deffnm" "md-0_0" RETVAL=$?

    Exit the script with the return code from the command

    exit $RETVAL

  11. Andre Merzky

    Vivek, can you look into that, since this seems to work for you? Please ping Mark if you get stuck though...

  12. Vivek Balasubramanian

    Hey Charlie,

    I tried out the tarball that you sent this morning. It ran successfully. Do you have any settings configured in your bashrc/bash_profile that might possibly be conflicting ?

    Process received signal [nid19353:03494] Signal: Segmentation fault (11) [nid19353:03494] Signal code: Address not mapped (1) [nid19353:03494] Failing at address: (nil) [nid19353:03494] [ 0] /lib64/libpthread.so.0(+0xf850)[0x2aaaab90f850] [nid19353:03494]

    I think I might have received something like this a while back. But I couldn't reproduce it. Could you confirm if this error repeats for you ?

  13. Charlie Laughton reporter

    Hi Vivek,

    Good news – I cleaned out my .bashrc file and now it seems to be working, at least for short runs. I will test at scale shortly.

    Many thanks for your help,

  14. Vivek Balasubramanian

    That's great ! If possible, could you paste the content that you think might have conflicted/ caused this error ? Just to know what causes this issue and avoid/debug in the future.

  15. Log in to comment