Errors for multi-nodes for intel mpi MPI/Intel/IMPI/4.1.3.048 and MPI/Intel/IMPI/5.0.2.044

Issue #38 invalid
Xiaohong Zheng created an issue

Whe I import mpi4py on two or more nodes with intel mpi, it always produces the following errors. For single node, it runs well. Part of the error messages are:

cn9829:UCM:6a9f:a7699aa0: 152 us(152 us): open_hca: ibv_get_device_list() failed

cn9829:UCM:6a9e:77ae5aa0: 151 us(151 us): open_hca: ibv_get_device_list() failed

cn9829:UCM:6a9f:a7699aa0: 218 us(218 us): open_hca: ibv_get_device_list() failed

cn9834:CMA:5bcc:c766daa0: 248 us(248 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?

cn9834:CMA:5bcb:a52eeaa0: 207 us(207 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?

cn9829:UCM:6a9e:77ae5aa0: 151 us(151 us): open_hca: ibv_get_device_list() failed

cn9829:UCM:6a9f:a7699aa0: 165 us(165 us): open_hca: ibv_get_device_list() failed

cn9834:SCM:5bcc:c766daa0: 212 us(212 us): open_hca: ibv_get_device_list() failed

cn9834:SCM:5bcb:a52eeaa0: 156 us(156 us): open_hca: ibv_get_device_list() failed

cn9829:CMA:6a9e:77ae5aa0: 224 us(224 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?

cn9834:SCM:5bcc:c766daa0: 221 us(221 us): open_hca: ibv_get_device_list() failed

cn9829:CMA:6a9f:a7699aa0: 193 us(193 us): open_hca: getaddr_netdev ERROR:No such device. Is eth2 configured?

cn9834:SCM:5bcb:a52eeaa0: 157 us(157 us): open_hca: ibv_get_device_list() failed

DAT Registry: sysconfdir, bad filename - /etc/rdma/compat-dapl/dat.conf, retry default at /etc/dat.conf

DAT Registry: sysconfdir, bad filename - /etc/rdma/compat-dapl/dat.conf, retry default at /etc/dat.conf

librdmacm: Warning: couldn't read ABI version.

librdmacm: Warning: assuming: 4

librdmacm: Fatal: unable to get RDMA device list

cn9829:CMA:6a9e:77ae5aa0: 239 us(239 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?

librdmacm: Warning: couldn't read ABI version.

librdmacm: Warning: assuming: 4

librdmacm: Fatal: unable to get RDMA device list

cn9829:CMA:6a9f:a7699aa0: 215 us(215 us): open_hca: getaddr_netdev ERROR:No such device. Is eth3 configured?

cn9829:SCM:6a9e:77ae5aa0: 186 us(186 us): open_hca: ibv_get_device_list() failed

cn9829:SCM:6a9f:a7699aa0: 162 us(162 us): open_hca: ibv_get_device_list() failed

cn9829:SCM:6a9e:77ae5aa0: 189 us(189 us): open_hca: ibv_get_device_list() failed

cn9829:SCM:6a9f:a7699aa0: 177 us(177 us): open_hca: ibv_get_device_list() failed

DAT Registry: sysconfdir, bad filename - /etc/rdma/compat-dapl/dat.conf, retry default at /etc/dat.conf

DAT Registry: sysconfdir, bad filename - /etc/rdma/compat-dapl/dat.conf, retry default at /etc/dat.conf

librdmacm: Warning: couldn't read ABI version.

librdmacm: Warning: assuming: 4

librdmacm: Fatal: unable to get RDMA device list

librdmacm: Warning: couldn't read ABI version.

librdmacm: Warning: assuming: 4

librdmacm: Fatal: unable to get RDMA device list

This is the run command: mpirun -n 4 --ppn 2 -f hostfile python c.py

The file c.py only has the following lines:

from mpi4py import MPI

comm = MPI.COMM_WORLD

rank = comm.Get_rank()

size = comm.Get_size()

Could you have a look at it, please?

Comments (10)

  1. Lisandro Dalcin

    Have you tried with a pure C example?

    What's the output of "ldd /path/to/site-packages/mpi4py/MPI.so" ?

  2. Xiaohong Zheng reporter

    I have not tried a pure C example.

    There is no /path/to/site-packages/mpi4py/MPI.so. However, I find the following two files under mpi4py/: MPI.cpython-35m-x86_64-linux-gnu.so
    MPI.pxd

  3. Xiaohong Zheng reporter
        linux-vdso.so.1 =>  (0x00007fff97dff000)
    
        libdl.so.2 => /lib64/libdl.so.2 (0x00002ab8bbddb000)
    
        libpython3.5m.so.1.0 => /HOME/mcgill_hongguo_1/WORKSPACE/xhzheng/anaconda3-impi2/lib/libpython3.5m.so.1.0 (0x00002ab8bbfdf000)
    
        libmpigf.so.4 => /WORK/app/osenv/ln1/impi/lib/libmpigf.so.4 (0x00002ab8bc4ce000)
    
        libmpi.so.4 => /WORK/app/osenv/ln1/impi/lib/libmpi.so.4 (0x00002ab8bc6ff000)
    
        librt.so.1 => /lib64/librt.so.1 (0x00002ab8bcd69000)
    
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00002ab8bcf71000)
    
        libc.so.6 => /lib64/libc.so.6 (0x00002ab8bd18f000)
    
        /lib64/ld-linux-x86-64.so.2 (0x0000003223400000)
    
        libutil.so.1 => /lib64/libutil.so.1 (0x00002ab8bd523000)
    
        libm.so.6 => /lib64/libm.so.6 (0x00002ab8bd726000)
    
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00002ab8bd9ab000)
    

    Thank you very much.

  4. Lisandro Dalcin

    That looks OK. I'm really not sure what's wrong. This smells like a missing library in your system, or some other related issue. Let's try something else. At the very beginning of your c.py file (right above the "from mpi4py import MPI" line), add the following lines:

    from mpi4py import dl dl.dlopen("/WORK/app/osenv/ln1/impi/lib/libmpi.so.4", dl.RTLD_NOW|dl.RTLD_GLOBAL) assert not dl.dlerror()

    If that still does not work, please try to compile and run some C code using MPI. I still think that this issue is unrelated to mpi4py, but your system.

  5. Xiaohong Zheng reporter

    I have added the following three lines before "from mpi4py import MPI":

    from mpi4py import dl

    dl.dlopen("/WORK/app/osenv/ln1/impi/lib/libmpi.so.4", dl.RTLD_NOW|dl.RTLD_GLOBAL)

    assert not dl.dlerror()

    I still got the same problems. It is very strange since for single node, there is no problem. We see it only when two or more nodes are used. I will try a C or fortran code using MPI and see what happens.

    Thank you.

  6. Lisandro Dalcin

    Well, my bet is that when you try to run on multiple nodes, you MPI implementation figures out and tries to use librdma for remote direct memory access.

  7. Xiaohong Zheng reporter

    Thank you for the hints.

    I have just tried a very simple Fortran code:

    program main

    include 'mpif.h'

    integer ierr, node, Nodes

    call MPI_INIT( ierr )

    call MPI_COMM_RANK( MPI_COMM_WORLD, node, ierr )

    call MPI_COMM_SIZE( MPI_COMM_WORLD, Nodes, ierr )

    print *, 'I am', node, 'of', Nodes

    call MPI_FINALIZE( ierr )

    end

    I still get the same problems. Your conclusion is right. It is not related to mpifpy, but to the system. It is a very large supercomputer. I do not know whether they use librdma for remote direct memory access. I will contact the technical support.

    Thank you very much again.

  8. Log in to comment