MPI/msgpickle.pxi EOFError

Issue #39 closed

Theo Steininger created an issue 2016-04-21

Hi,

got some problems with the sendrecv method of COMM_WORLD.

Running the following code with 32 processes on 2 nodes (16 processes each)

import numpy as np
from mpi4py import MPI

comm = MPI.COMM_WORLD
size = comm.size
rank = comm.rank

shape = (20000, 2000)
data = np.random.randint(1000, size=np.prod(shape)).reshape(shape)

partner = size - 1 - rank

for i in xrange(10):
    data2 = comm.sendrecv(sendobj=data,
                                           dest=partner,
                                           source=partner)

produces the following error:

#!

Traceback (most recent call last):
  File "mpi_sendrecv.py", line 16, in <module>
    source=partner)
  File "MPI/Comm.pyx", line 1200, in mpi4py.MPI.Comm.sendrecv (src/mpi4py.MPI.c:107121)
  File "MPI/msgpickle.pxi", line 377, in mpi4py.MPI.PyMPI_sendrecv (src/mpi4py.MPI.c:43994)
  File "MPI/msgpickle.pxi", line 292, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:43053)
  File "MPI/msgpickle.pxi", line 143, in mpi4py.MPI.Pickle.load (src/mpi4py.MPI.c:41248)
EOFError

The error occurs if:

the array size is large enough
the number of repetitions (in the for loop) is large enough
there is internode communication involved (32 processes on 1 node do not produce the error)
IntelMPI is used for compiling mpi4py

I think this issue is identical to what has been discussed here https://groups.google.com/forum/#!topic/mpi4py/OJG5eZ2f-Pg and here https://github.com/dfm/emcee/issues/151

Citing the most important part from the second link:

If you read the discussion, it turns out that the problem is that when not using loadbalancing, the default strategy is to send out ALL of the tasks to the workers asynchronously and then wait for them to respond. This relies on MPI's own buffering to handle the build up of any messages. When the number of tasks is much greater than the number of workers, MPI is more likely to drop messages and cause this approach to fail.

Any ideas how to overcome those buffer issues? Thanks!

Comments (10)

Lisandro Dalcin
It seems this is not an issue in mpi4py, but in the backend MPI implementation. Unfortunately, there is nothing mpi4py can do to alleviate/workaroud the issue. This should be report upstream to Intel MPI developers.

PS: About ideas about how to overcome this buffer issues, this is not the right place to ask for help, please submit your question to our mailing list.
- 2016-04-21T13:07:58+00:00
Lisandro Dalcin
- marked as minor
- 2016-04-21T13:08:21+00:00

Alexey Krukow

Well, it seems that it is not an issue in the Intel MPI implementation either, because this FORTRAN code works just fine on the same computing cluster.

program pasr
  implicit none
  include 'mpif.h'

  integer, allocatable, dimension(:,:) :: ia_glbl_send_2d, ia_glbl_recv_2d
  integer(KIND=8) :: i8_cnt
  integer :: i_alloc
  integer :: i_rank, i_size, i_error
  integer :: i_prtnr
  integer :: a,b,i

  call mpi_init(i_error)
  call mpi_comm_rank(mpi_comm_world, i_rank, i_error)
  call mpi_comm_size(mpi_comm_world, i_size, i_error)

  a=20000
  b=2000
  allocate(ia_glbl_send_2d(a,b),stat=i_alloc) 
  if (i_alloc.ne.0)then
    write(*,*) "Can not allocate memory"
  endif

  allocate(ia_glbl_recv_2d(a,b),stat=i_alloc) 
  if (i_alloc.ne.0)then
    write(*,*) "Can not allocate memory"
  endif

  ia_glbl_send_2d=i_rank
  ia_glbl_recv_2d=0

  i_prtnr=i_size-1-i_rank
  i8_cnt=a*b

  do i=1,10
    call mpi_sendrecv(ia_glbl_send_2d,i8_cnt,MPI_INTEGER,i_prtnr,1,ia_glbl_recv_2d,i8_cnt,MPI_INTEGER,i_prtnr,1,mpi_comm_world,MPI_STATUS_IGNORE,i_error)
  enddo 

  call mpi_finalize(i_error)

end program pasr

I have increased an arrays size up to 200000, 2000 and made 100 iterations in the cycle. I used up to 64 MPI tasks distributed over 4 different nodes, the Intel MPI sendrecv works excellent.

Cheers,

Alexey

2016-04-29T07:04:26+00:00

Lisandro Dalcin
@ak-80 Your code is not checking that the received message contents match the sent one. In the mpi4py example, the communication seems to succeed just fine. However, the received messages are likely corrupted, that's the reason Python's pickle module cannot reconstruct the Python object.

Also, note that mpi4py does not implement (because it is not possible to do that way) MPI.Comm.sendrecv() using MPI_Sendrecv(). The actual implementation (for an MPI-3 implementation with matched probes) is:
1. Post MPI_Isend() to destination.
2. Do an MPI_Mprobe() from source.
3. Allocate receive buffer large enough (use MPI_Get_count() to get buf size).
4. Do the MPI_Mrecv() with the message handle matched in (2).
5. Do MPI_Wait() to complete the send in (1).
It has to be done that way because the receiving size does not know in advance the message size, so mpi4py has to probe for the incoming message to query size to allocate buffer space.

The actual code should be not that hard to follow:

https://bitbucket.org/mpi4py/mpi4py/src/master/src/MPI/msgpickle.pxi#msgpickle.pxi-372

https://bitbucket.org/mpi4py/mpi4py/src/master/src/MPI/msgpickle.pxi#msgpickle.pxi-248
- 2016-04-29T08:55:27+00:00
Lisandro Dalcin
@ultimanet Any news on your side?
- 2016-09-20T18:55:19+00:00
Theo Steininger reporter
Hi @dalcinl !

Yes, indeed; I forgot to post the solution. The problem was related to IntelMPI and how it uses the Infiniband hardware. IntelMPI knows several different communication 'network fabrics' for inter- and intranode communication (https://software.intel.com/en-us/node/528821). The odd thing was that by switching to 'tcp' for internode communication everything seemed to work fine. I finally found that setting the I_MPI_OFA_TRANSLATION_CACHE to off using export I_MPI_OFA_TRANSLATION_CACHE=off solved the issues (https://software.intel.com/en-us/node/528827). The documentation says "The cache substantially increases performance, but may lead to correctness issues in certain situations. See product Release Notes for further details." For me, no loss of performance was measurable and there is also no note in any of the release notes... :/

Thanks for your support!

P.S.: I was not able to fix this issue using the DAPL fabric...
- 2016-09-20T19:12:10+00:00
Theo Steininger reporter
- changed status to closed
- 2016-09-20T19:12:33+00:00
Lisandro Dalcin
Many thanks for your detailed update.
- 2016-09-20T20:18:10+00:00
Saurabh jha
I am having similar issue on Cray XC system. Any suggestion for Cray MPI ?
- 2017-02-19T21:36:06+00:00
Lisandro Dalcin
@saurabhjha1 Not that I know. I would say 1) ask Cray support, and 2) send a message with a full description of your issue to our mailing list mpi4py@googlegroups.com, maybe some user can provide some insight.
- 2017-02-20T08:22:00+00:00
Log in to comment

Assignee: –

Type: bug

Priority: minor

Status: closed

Votes: 1

Watchers: 4