Deadlock: Native HDF5, 2 ranks, Intel MPI

Issue #92 new
Alex created an issue

Observed a deadlock in MOAB on 2 ranks, when using load_file() and Intel tools. Considering this good news for issue #61, since it might triggered be the same problem at a much smaller scale. Wrote a reproducer, which is attached to this issue.

I was able to reproduce the same behavior on three systems (2x CentOS 7 and a Debian 8), all running Intel Parallel Studio XE 2019 Cluster Edition:

alex@deb:~/Dropbox/tmp/18_10_28_moab_reproducer/meshes$ icpc --version
icpc (ICC) 19.0.0.117 20180804
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

alex@deb:~/Dropbox/tmp/18_10_28_moab_reproducer/meshes$ icc --version
icc (ICC) 19.0.0.117 20180804
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

alex@deb:~/Dropbox/tmp/18_10_28_moab_reproducer/meshes$ mpiexec --version
Intel(R) MPI Library for Linux* OS, Version 2019 Build 20180829 (id: 15f5d6c0c)
Copyright 2003-2018, Intel Corporation.

When using GCC + OpenMPI, everything ran smoothly in my tests.

Here's the mesh info:

alex@deb:~/Dropbox/tmp/18_10_28_moab_reproducer/meshes$ ../libs/bin/h5minfo tet4_300_2.h5m
tet4_300_2.h5m:
  Entities:
    Nodes:
      dimension : 3
      entities  : 172122 [1 - 172122]
      dense tags: "GLOBAL_ID"
    Tet4 (Tet):
      nodes per element: 4
      entities         : 942369 [227399 - 1169767]
      no adjencies     
      dense tags       : "GLOBAL_ID", "PARALLEL_PARTITION"
    Tri3 (Tri):
      nodes per element: 3
      entities         : 55276 [172123 - 227398]
      no adjencies     
      dense tags       : "GLOBAL_ID"
    Sets:
      entities  : 18 [1169768 - 1169785]
      dense tags: "GLOBAL_ID"
  Tags:
    "DIRICHLET_SET":
      type    : integer
      size    : 1 (4 bytes)
      flags   : 1
      default : -1
      mesh val: -1
      tables  : (none)
    "GEOM_DIMENSION":
      type    : integer
      size    : 1 (4 bytes)
      flags   : 1
      default : -1
      mesh val: -1
      tables  : (sparse)
    "GLOBAL_ID":
      type    : integer
      size    : 1 (4 bytes)
      flags   : 2
      default : 0
      mesh val: 0
      tables  : Sets, Nodes, Tet4, Tri3
    "MATERIAL_SET":
      type    : integer
      size    : 1 (4 bytes)
      flags   : 1
      default : -1
      mesh val: -1
      tables  : (sparse)
    "NEUMANN_SET":
      type    : integer
      size    : 1 (4 bytes)
      flags   : 1
      default : -1
      mesh val: -1
      tables  : (none)
    "PARALLEL_PARTITION":
      type    : integer
      size    : 1 (4 bytes)
      flags   : 1
      default : -1
      mesh val: -1
      tables  : (sparse), Tet4

Here's the output of the reproducer:

alex@deb:~/Dropbox/tmp/18_10_28_moab_reproducer$ VT_DEADLOCK_TIMEOUT=120 VT_CHECK_TRACING=on mpiexec -check_mpi -n 2 ./a.out

[0] INFO: CHECK LOCAL:EXIT:SIGNAL ON
[0] INFO: CHECK LOCAL:EXIT:BEFORE_MPI_FINALIZE ON
[0] INFO: CHECK LOCAL:MPI:CALL_FAILED ON
[0] INFO: CHECK LOCAL:MEMORY:OVERLAP ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_MODIFICATION ON
[0] INFO: CHECK LOCAL:MEMORY:INACCESSIBLE ON
[0] INFO: CHECK LOCAL:MEMORY:ILLEGAL_ACCESS OFF
[0] INFO: CHECK LOCAL:MEMORY:INITIALIZATION OFF
[0] INFO: CHECK LOCAL:REQUEST:ILLEGAL_CALL ON
[0] INFO: CHECK LOCAL:REQUEST:NOT_FREED ON
[0] INFO: CHECK LOCAL:REQUEST:PREMATURE_FREE ON
[0] INFO: CHECK LOCAL:DATATYPE:NOT_FREED ON
[0] INFO: CHECK LOCAL:BUFFER:INSUFFICIENT_BUFFER ON
[0] INFO: CHECK GLOBAL:DEADLOCK:HARD ON
[0] INFO: CHECK GLOBAL:DEADLOCK:POTENTIAL ON
[0] INFO: CHECK GLOBAL:DEADLOCK:NO_PROGRESS ON
[0] INFO: CHECK GLOBAL:MSG:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:MSG:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:MSG:PENDING ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATATYPE:MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:DATA_TRANSMISSION_CORRUPTED ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:SIZE_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:REDUCTION_OPERATION_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:ROOT_MISMATCH ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:INVALID_PARAMETER ON
[0] INFO: CHECK GLOBAL:COLLECTIVE:COMM_FREE_MISMATCH ON
[0] INFO: maximum number of errors before aborting: CHECK-MAX-ERRORS 1
[0] INFO: maximum number of reports before aborting: CHECK-MAX-REPORTS 0 (= unlimited)
[0] INFO: maximum number of times each error is reported: CHECK-SUPPRESSION-LIMIT 10
[0] INFO: timeout for deadlock detection: DEADLOCK-TIMEOUT 120s
[0] INFO: timeout for deadlock warning: DEADLOCK-WARNING 300s
[0] INFO: maximum number of reported pending messages: CHECK-MAX-PENDING 20

welcome to main from rank welcome to main from rank 1 of 2 ranks
0 of 2 ranks
initializing MOAB on rank 0
initializing MOAB on rank 1
MOAB api: MOAB API version 1.01 impl: MOAB 5.0.1
MOAB api: MOAB API version 1.01 impl: MOAB 5.0.1
loading mesh on rank: 0
loading mesh on rank: 1
  1  ReadPara(0.00 s) Setting up...
  0  1  ReadParaRead mode is READ_PART
  ReadPara(0.00 s) Setting up...
  0  ReadParaRead mode is READ_PART
  1  ReadPara(0.00 s) Reading file: "meshes/tet4_300_2.h5m"
  0  ReadPara(0.00 s) Reading file: "meshes/tet4_300_2.h5m"
  0  ReadPara(4.23 s) Resolving shared entities.
  1  ReadPara(4.23 s) Resolving shared entities.
  0  ParallelComm(4.23 s) Resolving shared entities.
  1  ParallelComm(4.23 s) Resolving shared entities.
  0  ParallelComm(7.67 s) Found skin, now resolving.
  1  ParallelComm(7.69 s) Found skin, now resolving.
  0  ParallelComm(9.41 s)  shared verts size 3304 
  1  ParallelComm(9.41 s)  shared verts size 3304 
  1  ParallelComm resolve shared ents:  proc verts  Vertex 1-86399,
  0  ParallelComm resolve shared ents:  proc verts  Vertex 1-89027,
  1  ParallelComm(10.30 s) Iface:  0
  0  ParallelComm(10.33 s) Iface:  1
  0  ParallelComm(10.35 s) Entering exchange_ghost_cells with num_layers = 0
  0  ParallelComm(10.35 s) Irecv, 0<-1, buffer ptr = 0x35596d0, tag=1, size=1024, incoming1=1
  0  ParallelComm(10.35 s) allsent ents compactness (size) = 0.064197 (15951)
  0  ParallelComm(10.35 s) Sent ents compactness (size) = 0.064197 (15951)
  0  ParallelComm(10.35 s) estimate buffer size for 15951 entities: 433452 
  0  ParallelComm(10.39 s) after some pack int  446648 
  0  ParallelComm(10.40 s) Packed 9627 ents of type Edge
  0  ParallelComm(10.40 s) after some pack int  600692 
  0  ParallelComm(10.41 s) Packed 6324 ents of type Tri
  0  ParallelComm(10.41 s) Irecv, 0<-1, buffer ptr = 0x7ffcb34e6b68, tag=0, size=4, incoming1=2
  0  ParallelComm(10.41 s) Isend, 0->1, buffer ptr = 0xa315930, tag=1, size=1024
  0  ParallelComm(10.41 s) Waitany, p=0, , recv_ent_reqs= 0x2c00000e 0x2c000000 0x2c00000f
  1  ParallelComm(10.42 s) Entering exchange_ghost_cells with num_layers = 0
  1  ParallelComm(10.42 s) Irecv, 1<-0, buffer ptr = 0x8d6b570, tag=1, size=1024, incoming1=1
  1  ParallelComm(10.43 s) allsent ents compactness (size) = 0.063412 (15959)
  1  ParallelComm(10.43 s) Sent ents compactness (size) = 0.063412 (15959)
  1  ParallelComm(10.43 s) estimate buffer size for 15959 entities: 433676 
  1  ParallelComm(10.46 s) after some pack int  446872 
  1  ParallelComm(10.47 s) Packed 9631 ents of type Edge
  1  ParallelComm(10.47 s) after some pack int  600980 
  1  ParallelComm(10.48 s) Packed 6328 ents of type Tri
  1  ParallelComm(10.48 s) Irecv, 1<-0, buffer ptr = 0x7ffd1cfefee8, tag=0, size=4, incoming1=2
  1  ParallelComm(10.48 s) Isend, 1->0, buffer ptr = 0x98925e0, tag=1, size=1024
  1  ParallelComm(10.48 s) Waitany, p=1, , recv_ent_reqs= 0x2c00000e 0x2c000000 0x2c00000f
[0] ERROR: no progress observed in any process for over 2:09 minutes, aborting application
[0] WARNING: starting emergency trace file writing

[0] ERROR: GLOBAL:DEADLOCK:HARD: fatal error
[0] ERROR:    Application aborted because no progress was observed for over 2:09 minutes,
[0] ERROR:    check for real deadlock (cycle of processes waiting for data) or
[0] ERROR:    potential deadlock (processes sending data to each other and getting blocked
[0] ERROR:    because the MPI might wait for the corresponding receive).
[0] ERROR:    [0] no progress observed for over 2:09 minutes, process is currently in MPI call:
[0] ERROR:       MPI_Waitany(count=3, *array_of_requests=0x507a710, *index=0x7ffcb34e6aec, *status=0x7ffcb34e6b88)
[0] ERROR:       _ZN4moab12ParallelComm20exchange_ghost_cellsEiiiibbPm (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ParallelComm.cpp:5624)
[0] ERROR:       _ZN4moab12ParallelComm19resolve_shared_entsEmRNS_5RangeEiiPS1_PKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ParallelComm.cpp:4050)
[0] ERROR:       _ZN4moab12ParallelComm19resolve_shared_entsEmiiPKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ParallelComm.cpp:3827)
[0] ERROR:       _ZN4moab12ReadParallel9load_fileEPPKciPKmiRSsRSt6vectorIiSaIiEEbbSA_RKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoEibiiiiii (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ReadParallel.cpp:556)
[0] ERROR:       _ZN4moab12ReadParallel9load_fileEPPKciPKmRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ReadParallel.cpp:256)
[0] ERROR:       _ZN4moab12ReadParallel9load_fileEPKcPKmRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/../src/parallel/ReadParallel.hpp:119)
[0] ERROR:       _ZN4moab4Core9load_fileEPKcPKmS2_S2_PKii (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/Core.cpp:503)
[0] ERROR:       main (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/a.out)
[0] ERROR:       __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[0] ERROR:       (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/a.out)
[0] ERROR:    [1] no progress observed for over 2:09 minutes, process is currently in MPI call:
[0] ERROR:       MPI_Waitany(count=3, *array_of_requests=0x3b71720, *index=0x7ffd1cfefe6c, *status=0x7ffd1cfeff08)
[0] ERROR:       _ZN4moab12ParallelComm20exchange_ghost_cellsEiiiibbPm (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ParallelComm.cpp:5624)
[0] ERROR:       _ZN4moab12ParallelComm19resolve_shared_entsEmRNS_5RangeEiiPS1_PKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ParallelComm.cpp:4050)
[0] ERROR:       _ZN4moab12ParallelComm19resolve_shared_entsEmiiPKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ParallelComm.cpp:3827)
[0] ERROR:       _ZN4moab12ReadParallel9load_fileEPPKciPKmiRSsRSt6vectorIiSaIiEEbbSA_RKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoEibiiiiii (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ReadParallel.cpp:556)
[0] ERROR:       _ZN4moab12ReadParallel9load_fileEPPKciPKmRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/parallel/ReadParallel.cpp:256)
[0] ERROR:       _ZN4moab12ReadParallel9load_fileEPKcPKmRKNS_11FileOptionsEPKNS_11ReaderIface10SubsetListEPKPNS_7TagInfoE (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/../src/parallel/ReadParallel.hpp:119)
[0] ERROR:       _ZN4moab4Core9load_fileEPKcPKmS2_S2_PKii (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/moab/src/Core.cpp:503)
[0] ERROR:       main (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/a.out)
[0] ERROR:       __libc_start_main (/lib/x86_64-linux-gnu/libc-2.19.so)
[0] ERROR:       (/home/alex/Dropbox/tmp/18_10_28_moab_reproducer/a.out)
[0] INFO: Writing tracefile a.out.stf in /home/alex/Dropbox/tmp/18_10_28_moab_reproducer
[0] WARNING: message logging: Intel(R) Trace Collector could not find pairs for 2 (14.3%) sends and 0 (0.0%) receives

[0] INFO: GLOBAL:DEADLOCK:HARD: found 1 time (1 error + 0 warnings), 0 reports were suppressed
[0] INFO: Found 1 problem (1 error + 0 warnings), 0 reports were suppressed.

alex@deb:~/Dropbox/tmp/18_10_28_moab_reproducer$ 

Comments (16)

  1. Vijay M

    Alex, I apologize about the delay. I'll take a look at the attached file and look at what is going on. Quite often the deadlock could happen when there are empty parts in the mesh. This should be easy to check and I'll get back to you once we understand the cause of the issue.

  2. Vijay M

    Alex, I just built your reproducer.cpp file (linked against a debug build of master) and ran it on two processes with the mesh in the distribution. It finished fine without any deadlocks on my macbook. I don't see anything obviously wrong in your configuration workflow either. Can you try to run on a different machine to verify if this still fails ? If it doesn't, we can look more closely at the configuration on the machine specifically.

  3. Alex reporter

    Thanks, at least it is good to hear that nothing obvious is off.. ;-)

    What tools did you use to build MOAB? I am only seeing the deadlock, when using Intel tools, not when using gcc+OpenMPI. I am not seeing something obviously wrong in MOAB either: Could well be a bug in MPI_Waitany within Intel MPI. Don't think that the function is used too much in other codes. If that's the case, writing a workaround wouldn't be too hard.

  4. Vijay M

    Alex, I tested with gcc+mpich and clang+mpich and didn't see any issues. I'll test with Intel today and see if I can replicate the issue.

    Have you had success replicating the issue on a different machine ?

  5. Alex reporter

    Yes and no: I tested reproducibility on three machines with Intel tools before submitting the issue, but didn't do any further tests.

  6. Vijay M

    Ok, thanks for confirming that it was reproducible on other machines with Intel. I read that in your original description but forgot about it.

    I'll check this on our cluster now and see if I can understand the cause for this issue better. @iulian07 I only see 2019 Intel compilers on MCS. Do we have parallel studio installed somewhere ?

  7. Alex reporter

    Ok, cool.

    Let me know if you can't reproduce the issue. In that case, I can generate a cloud-VM for you and give you ssh-access (e-mail).

  8. Iulian Grindeanu

    I rebuilt moab with intel compilers, on fathom's home folder, using dependencies built by buildbot; I configured moab on gnep: /homes/fathom/intel/build with a similar configuration we use on buildbot here: http://gnep.mcs.anl.gov:8010/builders/moab-download-intel/builds/921

    gnep~/intel/moab/examples/basic/18_10_28_moab_reproducer> mpiexec -np 2 ../reproducer 
    welcome to main from rank 0 of 2 ranks
    initializing MOAB on rank 0
    MOAB api: MOAB API version 1.01 impl: MOAB 5.0.1
    loading mesh on rank: 0
    welcome to main from rank 1 of 2 ranks
    initializing MOAB on rank 1
    MOAB api: MOAB API version 1.01 impl: MOAB 5.0.1
    loading mesh on rank: 1
      0  ReadPara(0.00 s) Setting up...
      0  ReadParaRead mode is READ_PART
      0  ReadPara(0.00 s) Reading file: "meshes/tet4_300_2.h5m"
      1  ReadPara(0.00 s) Setting up...
      1  ReadParaRead mode is READ_PART
      1  ReadPara(0.00 s) Reading file: "meshes/tet4_300_2.h5m"
      0  1  ReadPara(3.65 s) Resolving shared entities.
      1    ParallelComm(3.65 s) Resolving shared entities.
    ReadPara(3.65 s) Resolving shared entities.
      0  ParallelComm(3.65 s) Resolving shared entities.
      0  ParallelComm(4.99 s) Found skin, now resolving.
      1  ParallelComm(4.99 s) Found skin, now resolving.
      0  ParallelComm  1  ParallelComm(5.10 s)  shared verts size 3304 
    (5.10 s)  shared verts size 3304 
      1  ParallelComm resolve shared ents:  proc verts  Vertex 1-86399,
      0  ParallelComm resolve shared ents:  proc verts  Vertex 1-89027,
      1  ParallelComm(5.38 s) Iface:  0
      1  ParallelComm(5.39 s) Entering exchange_ghost_cells with num_layers = 0
      1  ParallelComm(5.39 s) Irecv, 1<-0, buffer ptr = 0x92234e0, tag=1, size=1024, incoming1=1
      1  ParallelComm(5.40 s) allsent ents compactness (size) = 0.063412 (15959)
      1  ParallelComm(5.40 s) Sent ents compactness (size) = 0.063412 (15959)
      1  ParallelComm(5.40 s) estimate buffer size for 15959 entities: 433676 
      1  ParallelComm(5.41 s) after some pack int  446872 
      1  ParallelComm(5.42 s) Packed 9631 ents of type Edge
      1  ParallelComm(5.42 s) after some pack int  600980 
      1  ParallelComm(5.43 s) Packed 6328 ents of type Tri
      1  ParallelComm(5.43 s) Irecv, 1<-0, buffer ptr = 0x7ffce873637c, tag=0, size=4, incoming1=2
      1  ParallelComm(5.43 s) Isend, 1->0, buffer ptr = 0x9e50390, tag=1, size=1024
      1  ParallelComm(5.43 s) Waitany, p=1, , recv_ent_reqs= 0xffffffffac000000 0x2c000000 0xffffffffac000001
      0  ParallelComm(5.44 s) Iface:  1
      0  ParallelComm(5.45 s) Entering exchange_ghost_cells with num_layers = 0
      0  ParallelComm(5.45 s) Irecv, 0<-1, buffer ptr = 0xbfb5290, tag=1, size=1024, incoming1=1
      0  ParallelComm(5.45 s) allsent ents compactness (size) = 0.064197 (15951)
      0  ParallelComm(5.45 s) Sent ents compactness (size) = 0.064197 (15951)
      0  ParallelComm(5.45 s) estimate buffer size for 15951 entities: 433452 
      0  ParallelComm(5.47 s) after some pack int  446648 
      0  ParallelComm(5.47 s) Packed 9627 ents of type Edge
      0  ParallelComm(5.48 s) after some pack int  600692 
      0  ParallelComm(5.48 s) Packed 6324 ents of type Tri
      0  ParallelComm(5.48 s) Irecv, 0<-1, buffer ptr = 0x7fff49b84b7c, tag=0, size=4, incoming1=2
      0  ParallelComm(5.48 s) Isend, 0->1, buffer ptr = 0xc9f7ab0, tag=1, size=1024
      0  1  ParallelComm(5.48 s) Waitany, p=0, , recv_ent_reqs= 0xffffffffac000002 0x2c000000 0xffffffffac000004
      0  ParallelComm  ParallelComm(5.48 s) Received from 0, count = 1024, tag = 1(5.48 s) Received from 1, count = 1024, tag = 1
      1  ParallelComm(5.48 s) Irecv, 1<-0, buffer ptr = 0x9d927b0, tag=2, size=751448, incoming1=2
      1  ParallelComm(5.48 s) Isend, 1->0, buffer ptr = 0x9d923b0, tag=0, size=4
      1  ParallelComm(5.48 s) Waitany, p=1, , recv_ent_reqs= 0x2c000000 0xffffffffac000000 0xffffffffac000001
    
      0  ParallelComm(5.48 s) Irecv, 0<-1, buffer ptr = 0xc93a060, tag=2, size=751832, incoming1=2
      0  ParallelComm(5.48 s) Isend, 0->1, buffer ptr = 0xc939c60, tag=0, size=4
      0  ParallelComm(5.48 s) Waitany, p=0, , recv_ent_reqs= 0x2c000000 0xffffffffac000002 0xffffffffac000004
      1  0  ParallelComm(5.48 s) Received from 1, count = 4, tag = 0
      0  ParallelComm(5.48 s) Isend, 0->1, buffer ptr = 0xc9f7eb0, tag=2, size=751448
      0  ParallelComm(5.48 s) Waitany, p=0, , recv_ent_reqs= 0x2c000000 0xffffffffac000002 0x2c000000
      ParallelComm(5.48 s) Received from 0, count = 4, tag = 0
      1  ParallelComm(5.48 s) Isend, 1->0, buffer ptr = 0x9e50790, tag=2, size=751832
      1  ParallelComm(5.48 s) Waitany, p=1, , recv_ent_reqs= 0x2c000000 0xffffffffac000000 0x2c000000
      0  1  ParallelComm(5.49 s) Received from 0, count = 751448, tag = 2
      ParallelComm(5.49 s) Received from 1, count = 751832, tag = 2
      0  ParallelComm(5.54 s) Total number of shared entities = 19251.
      0  ParallelComm(5.54 s) Exiting exchange_ghost_cells
      1  ParallelComm(5.55 s) Total number of shared entities = 19251.
      1  ParallelComm(5.55 s) Exiting exchange_ghost_cells
      0  ReadPara(5.56 s) Resolving shared sets.
      1  ReadPara(5.59 s) Resolving shared sets.
    loaded mesh on rank: 0
    loaded mesh on rank: 1
    finalizing from rank finalizing from rank 1 of 2 ranks
    0 of 2 ranks
    

    It seems to run fine, so I cannot reproduce with our intel compiler (version 17)

    gnep~/intel/moab/examples/basic/18_10_28_moab_reproducer> mpicxx --version icpc (ICC) 17.0.0 20160721

  9. Alex reporter

    I am assuming that this is Intel MPI (not only Intel compilers)?

    If yes: Can you send me a public ssh key to anbreuer AT uscd.edu? I'll fire up a VM which reproduces the issue then.

  10. Iulian Grindeanu

    is make check working fine for you? In your test you are just loading a mesh file partitioned in 2; does it freeze waiting for some requests?

  11. Iulian Grindeanu

    it looks like the message I sent to anbreuer AT uscd.edu came back; can you check the address? maybe it is ucsd ? Is it U California San Diego?

  12. Vijay M

    @iulian07 Were you able to resolve this ? Or is the issue still specific to Intel v19 cluster edition tools ?

  13. Alex reporter

    No, this is not resolved, but my test didn't deadlock when using impi 18. For the time being, I am using OpenMPI.

  14. Log in to comment