MPI/msgpickle.pxi EOFError
Hi,
got some problems with the sendrecv method of COMM_WORLD.
Running the following code with 32 processes on 2 nodes (16 processes each)
import numpy as np
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.size
rank = comm.rank
shape = (20000, 2000)
data = np.random.randint(1000, size=np.prod(shape)).reshape(shape)
partner = size - 1 - rank
for i in xrange(10):
data2 = comm.sendrecv(sendobj=data,
dest=partner,
source=partner)
produces the following error:
#!
Traceback (most recent call last):
File "mpi_sendrecv.py", line 16, in <module>
source=partner)
File "MPI/Comm.pyx", line 1200, in mpi4py.MPI.Comm.sendrecv (src/mpi4py.MPI.c:107121)
File "MPI/msgpickle.pxi", line 377, in mpi4py.MPI.PyMPI_sendrecv (src/mpi4py.MPI.c:43994)
File "MPI/msgpickle.pxi", line 292, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:43053)
File "MPI/msgpickle.pxi", line 143, in mpi4py.MPI.Pickle.load (src/mpi4py.MPI.c:41248)
EOFError
The error occurs if:
- the array size is large enough
- the number of repetitions (in the for loop) is large enough
- there is internode communication involved (32 processes on 1 node do not produce the error)
- IntelMPI is used for compiling mpi4py
I think this issue is identical to what has been discussed here https://groups.google.com/forum/#!topic/mpi4py/OJG5eZ2f-Pg and here https://github.com/dfm/emcee/issues/151
Citing the most important part from the second link:
If you read the discussion, it turns out that the problem is that when not using loadbalancing, the default strategy is to send out ALL of the tasks to the workers asynchronously and then wait for them to respond. This relies on MPI's own buffering to handle the build up of any messages. When the number of tasks is much greater than the number of workers, MPI is more likely to drop messages and cause this approach to fail.
Any ideas how to overcome those buffer issues? Thanks!
Comments (10)
-
-
- marked as minor
-
Well, it seems that it is not an issue in the Intel MPI implementation either, because this FORTRAN code works just fine on the same computing cluster.
program pasr implicit none include 'mpif.h' integer, allocatable, dimension(:,:) :: ia_glbl_send_2d, ia_glbl_recv_2d integer(KIND=8) :: i8_cnt integer :: i_alloc integer :: i_rank, i_size, i_error integer :: i_prtnr integer :: a,b,i call mpi_init(i_error) call mpi_comm_rank(mpi_comm_world, i_rank, i_error) call mpi_comm_size(mpi_comm_world, i_size, i_error) a=20000 b=2000 allocate(ia_glbl_send_2d(a,b),stat=i_alloc) if (i_alloc.ne.0)then write(*,*) "Can not allocate memory" endif allocate(ia_glbl_recv_2d(a,b),stat=i_alloc) if (i_alloc.ne.0)then write(*,*) "Can not allocate memory" endif ia_glbl_send_2d=i_rank ia_glbl_recv_2d=0 i_prtnr=i_size-1-i_rank i8_cnt=a*b do i=1,10 call mpi_sendrecv(ia_glbl_send_2d,i8_cnt,MPI_INTEGER,i_prtnr,1,ia_glbl_recv_2d,i8_cnt,MPI_INTEGER,i_prtnr,1,mpi_comm_world,MPI_STATUS_IGNORE,i_error) enddo call mpi_finalize(i_error) end program pasr
I have increased an arrays size up to 200000, 2000 and made 100 iterations in the cycle. I used up to 64 MPI tasks distributed over 4 different nodes, the Intel MPI sendrecv works excellent.
Cheers,
Alexey
-
@ak-80 Your code is not checking that the received message contents match the sent one. In the mpi4py example, the communication seems to succeed just fine. However, the received messages are likely corrupted, that's the reason Python's pickle module cannot reconstruct the Python object.
Also, note that mpi4py does not implement (because it is not possible to do that way)
MPI.Comm.sendrecv()
usingMPI_Sendrecv()
. The actual implementation (for an MPI-3 implementation with matched probes) is:- Post
MPI_Isend()
to destination. - Do an
MPI_Mprobe()
from source. - Allocate receive buffer large enough (use
MPI_Get_count()
to get buf size). - Do the
MPI_Mrecv()
with the message handle matched in (2). - Do
MPI_Wait()
to complete the send in (1).
It has to be done that way because the receiving size does not know in advance the message size, so mpi4py has to probe for the incoming message to query size to allocate buffer space.
The actual code should be not that hard to follow:
https://bitbucket.org/mpi4py/mpi4py/src/master/src/MPI/msgpickle.pxi#msgpickle.pxi-372
https://bitbucket.org/mpi4py/mpi4py/src/master/src/MPI/msgpickle.pxi#msgpickle.pxi-248
- Post
-
@ultimanet Any news on your side?
-
reporter Hi @dalcinl !
Yes, indeed; I forgot to post the solution. The problem was related to IntelMPI and how it uses the Infiniband hardware. IntelMPI knows several different communication 'network fabrics' for inter- and intranode communication (https://software.intel.com/en-us/node/528821). The odd thing was that by switching to 'tcp' for internode communication everything seemed to work fine. I finally found that setting the
I_MPI_OFA_TRANSLATION_CACHE
to off usingexport I_MPI_OFA_TRANSLATION_CACHE=off
solved the issues (https://software.intel.com/en-us/node/528827). The documentation says "The cache substantially increases performance, but may lead to correctness issues in certain situations. See product Release Notes for further details." For me, no loss of performance was measurable and there is also no note in any of the release notes... :/Thanks for your support!
P.S.: I was not able to fix this issue using the DAPL fabric...
-
reporter - changed status to closed
-
Many thanks for your detailed update.
-
I am having similar issue on Cray XC system. Any suggestion for Cray MPI ?
-
@saurabhjha1 Not that I know. I would say 1) ask Cray support, and 2) send a message with a full description of your issue to our mailing list mpi4py@googlegroups.com, maybe some user can provide some insight.
- Log in to comment
It seems this is not an issue in mpi4py, but in the backend MPI implementation. Unfortunately, there is nothing mpi4py can do to alleviate/workaroud the issue. This should be report upstream to Intel MPI developers.
PS: About ideas about how to overcome this buffer issues, this is not the right place to ask for help, please submit your question to our mailing list.