UnpicklingError on comm.recv after iprobe
In reference to the working example below, I have an Intermittent error on the comm.recv:
_pickle.UnpicklingError: invalid load key, '\x00'.
The messages includes a numpy array, along with some other information, in a dictionary. The total message sizes in the actual code is around 1.6MB. I have made this about 8MB in the example. I have got this error on as few as 4 processor runs, though it usually occurs on much larger parallel runs.
"""Standalone comms test
Testing for message errors and correctness of data
"""
from __future__ import division
from __future__ import absolute_import
from mpi4py import MPI
import sys, os
import numpy as np
import time
comm = MPI.COMM_WORLD
num_workers = MPI.COMM_WORLD.Get_size() - 1
rank = comm.Get_rank()
worker_ranks = set(range(1,MPI.COMM_WORLD.Get_size()))
rounds = 20
total_num_mess = rounds*num_workers
array_size = int(1e6) # Size of large array in sim_specs
sim_specs = {'out': [
('arr_vals',float,array_size),
('scal_val',float), #Test if get error without this
],
}
start_time = time.time()
if rank == 0:
print("Running comms test on {} processors with {} workers".format(MPI.COMM_WORLD.Get_size(),num_workers))
status = MPI.Status()
alldone = False
mess_count = 0
while not alldone:
for w in worker_ranks:
if comm.Iprobe(source=w, tag=MPI.ANY_TAG, status=status):
D_recv = comm.recv(source=w, tag=MPI.ANY_TAG, status=status)
mess_count += 1
#To test values
x = w * 1000.0
assert np.all(D_recv['arr_vals'] == x), "Array values do not all match"
assert D_recv['scal_val'] == x + x/1e7, "Scalar values do not all match"
if mess_count >= total_num_mess:
alldone = True
print('Manager received and checked {} messages'.format(mess_count))
print('Manager finished in time', time.time() - start_time)
else:
output = np.zeros(1,dtype=sim_specs['out'])
for x in range(rounds):
x = rank * 1000.0
output.fill(x)
output['scal_val'] = x + x/1e7
comm.send(obj=output, dest=0)
System info:
A Cray CS400 cluster with KNLs.
https://www.lcrc.anl.gov/systems/resources/bebop/
Compiler: intel/17.0.4
intel-mpi/2017.3
Python 3.6.3 |Intel Corporation
CentOS Linux release 7.4.1708 (Core)
mpi4py v3.0.0
numpy 1.14.3
Env: export I_MPI_FABRICS=shm:ofa
What I have tried:
I have got the worker to dump/load the message to file using pickle with no errors. I also used MPI.pickle.dumps and MPI.pickle.loads with no errors at the worker end.
I have also tried workarounds such as catching the error and getting the worker to resend the message, but this generally results in the error being repeated.
I have tried with default pickle and using dill.
from mpi4py import MPI
import dill
MPI.pickle.__init__(dill.dumps, dill.loads)
I thought at first that using dill fixed it, as it wen't away, but it is still occurring, possibly less frequently.
Comments (5)
-
-
reporter I will try turning off matching probes. However, I'm already using OFA fabric (I dont think OFI was available). When I used the default TMI, internode comms did not work at all.
-
reporter I have tried a number of runs with the tmi fabric with matching probes disabled.
import mpi4py mpi4py.rc.recv_mprobe = False
and so far I have not got the error. This is complicated by the fact that I have a different problem running with tmi. Each of my MPI tasks is launching a job, and I cannot have more than 8 doing this at a time with tmi, or the jobs fail. I don't know why that is, but if I run within that constraint it is working.
-
OK. So then I guess you can close this issue. Fortunately, I added a way to disable matching probes! If the Intel folks cannot support matching probes with some fabrics, we should at least get a hard failure, or at least a warning printed in stderr, or something, right?
Matching probes are good for thread safety, that's the primary reason they were added in MPI-3. Note that if you disable mpi4py's use of matched probes, mpi4py should still be thread-safe because I'm using a per-communicator lock to perform regular probes. Of course, this works as long as your communicator is not being concurrently used in some other thread not using mpi4py.
-
reporter - changed status to resolved
- Log in to comment
Maybe releated to the following documented issue in Intel MPI ?
https://software.intel.com/en-us/articles/python-mpi4py-on-intel-true-scale-and-omni-path-clusters
Can you confirm whether changing the fabric solves the problem? Alternatively, you can start your script with