MPI/msgpickle.pxi EOFError

Issue #39 closed
Theo Steininger created an issue

Hi,

got some problems with the sendrecv method of COMM_WORLD.

Running the following code with 32 processes on 2 nodes (16 processes each)

import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.size
rank = comm.rank

shape = (20000, 2000)
data = np.random.randint(1000, size=np.prod(shape)).reshape(shape)

partner = size - 1 - rank

for i in xrange(10):
    data2 = comm.sendrecv(sendobj=data,
                                           dest=partner,
                                           source=partner)

produces the following error:

#!

Traceback (most recent call last):
  File "mpi_sendrecv.py", line 16, in <module>
    source=partner)
  File "MPI/Comm.pyx", line 1200, in mpi4py.MPI.Comm.sendrecv (src/mpi4py.MPI.c:107121)
  File "MPI/msgpickle.pxi", line 377, in mpi4py.MPI.PyMPI_sendrecv (src/mpi4py.MPI.c:43994)
  File "MPI/msgpickle.pxi", line 292, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:43053)
  File "MPI/msgpickle.pxi", line 143, in mpi4py.MPI.Pickle.load (src/mpi4py.MPI.c:41248)
EOFError

The error occurs if:

  • the array size is large enough
  • the number of repetitions (in the for loop) is large enough
  • there is internode communication involved (32 processes on 1 node do not produce the error)
  • IntelMPI is used for compiling mpi4py

I think this issue is identical to what has been discussed here https://groups.google.com/forum/#!topic/mpi4py/OJG5eZ2f-Pg and here https://github.com/dfm/emcee/issues/151

Citing the most important part from the second link:

If you read the discussion, it turns out that the problem is that when not using loadbalancing, the default strategy is to send out ALL of the tasks to the workers asynchronously and then wait for them to respond. This relies on MPI's own buffering to handle the build up of any messages. When the number of tasks is much greater than the number of workers, MPI is more likely to drop messages and cause this approach to fail.

Any ideas how to overcome those buffer issues? Thanks!

Comments (10)

  1. Lisandro Dalcin

    It seems this is not an issue in mpi4py, but in the backend MPI implementation. Unfortunately, there is nothing mpi4py can do to alleviate/workaroud the issue. This should be report upstream to Intel MPI developers.

    PS: About ideas about how to overcome this buffer issues, this is not the right place to ask for help, please submit your question to our mailing list.

  2. Alexey Krukow

    Well, it seems that it is not an issue in the Intel MPI implementation either, because this FORTRAN code works just fine on the same computing cluster.

    program pasr
      implicit none
      include 'mpif.h'
    
      integer, allocatable, dimension(:,:) :: ia_glbl_send_2d, ia_glbl_recv_2d
      integer(KIND=8) :: i8_cnt
      integer :: i_alloc
      integer :: i_rank, i_size, i_error
      integer :: i_prtnr
      integer :: a,b,i
    
      call mpi_init(i_error)
      call mpi_comm_rank(mpi_comm_world, i_rank, i_error)
      call mpi_comm_size(mpi_comm_world, i_size, i_error)
    
      a=20000
      b=2000
      allocate(ia_glbl_send_2d(a,b),stat=i_alloc) 
      if (i_alloc.ne.0)then
        write(*,*) "Can not allocate memory"
      endif
    
      allocate(ia_glbl_recv_2d(a,b),stat=i_alloc) 
      if (i_alloc.ne.0)then
        write(*,*) "Can not allocate memory"
      endif
    
      ia_glbl_send_2d=i_rank
      ia_glbl_recv_2d=0
    
      i_prtnr=i_size-1-i_rank
      i8_cnt=a*b
    
      do i=1,10
        call mpi_sendrecv(ia_glbl_send_2d,i8_cnt,MPI_INTEGER,i_prtnr,1,ia_glbl_recv_2d,i8_cnt,MPI_INTEGER,i_prtnr,1,mpi_comm_world,MPI_STATUS_IGNORE,i_error)
      enddo 
    
      call mpi_finalize(i_error)
    
    end program pasr
    

    I have increased an arrays size up to 200000, 2000 and made 100 iterations in the cycle. I used up to 64 MPI tasks distributed over 4 different nodes, the Intel MPI sendrecv works excellent.

    Cheers,

    Alexey

  3. Lisandro Dalcin

    @ak-80 Your code is not checking that the received message contents match the sent one. In the mpi4py example, the communication seems to succeed just fine. However, the received messages are likely corrupted, that's the reason Python's pickle module cannot reconstruct the Python object.

    Also, note that mpi4py does not implement (because it is not possible to do that way) MPI.Comm.sendrecv() using MPI_Sendrecv(). The actual implementation (for an MPI-3 implementation with matched probes) is:

    1. Post MPI_Isend() to destination.
    2. Do an MPI_Mprobe() from source.
    3. Allocate receive buffer large enough (use MPI_Get_count() to get buf size).
    4. Do the MPI_Mrecv() with the message handle matched in (2).
    5. Do MPI_Wait() to complete the send in (1).

    It has to be done that way because the receiving size does not know in advance the message size, so mpi4py has to probe for the incoming message to query size to allocate buffer space.

    The actual code should be not that hard to follow:

    https://bitbucket.org/mpi4py/mpi4py/src/master/src/MPI/msgpickle.pxi#msgpickle.pxi-372

    https://bitbucket.org/mpi4py/mpi4py/src/master/src/MPI/msgpickle.pxi#msgpickle.pxi-248

  4. Theo Steininger reporter

    Hi @dalcinl !

    Yes, indeed; I forgot to post the solution. The problem was related to IntelMPI and how it uses the Infiniband hardware. IntelMPI knows several different communication 'network fabrics' for inter- and intranode communication (https://software.intel.com/en-us/node/528821). The odd thing was that by switching to 'tcp' for internode communication everything seemed to work fine. I finally found that setting the I_MPI_OFA_TRANSLATION_CACHE to off using export I_MPI_OFA_TRANSLATION_CACHE=off solved the issues (https://software.intel.com/en-us/node/528827). The documentation says "The cache substantially increases performance, but may lead to correctness issues in certain situations. See product Release Notes for further details." For me, no loss of performance was measurable and there is also no note in any of the release notes... :/

    Thanks for your support!

    P.S.: I was not able to fix this issue using the DAPL fabric...

  5. Lisandro Dalcin

    @saurabhjha1 Not that I know. I would say 1) ask Cray support, and 2) send a message with a full description of your issue to our mailing list mpi4py@googlegroups.com, maybe some user can provide some insight.

  6. Log in to comment