UnpicklingError on comm.recv after iprobe

In reference to the working example below, I have an Intermittent error on the comm.recv:

_pickle.UnpicklingError: invalid load key, '\x00'.

The messages includes a numpy array, along with some other information, in a dictionary. The total message sizes in the actual code is around 1.6MB. I have made this about 8MB in the example. I have got this error on as few as 4 processor runs, though it usually occurs on much larger parallel runs.

"""Standalone comms test

Testing for message errors and correctness of data
"""

from __future__ import division
from __future__ import absolute_import

from mpi4py import MPI
import sys, os        
import numpy as np
import time

comm = MPI.COMM_WORLD
num_workers = MPI.COMM_WORLD.Get_size() - 1
rank = comm.Get_rank()
worker_ranks = set(range(1,MPI.COMM_WORLD.Get_size()))
rounds = 20
total_num_mess = rounds*num_workers

array_size = int(1e6)   # Size of large array in sim_specs

sim_specs = {'out': [
                     ('arr_vals',float,array_size),
                     ('scal_val',float), #Test if get error without this
                    ],
             }

start_time = time.time()

if rank == 0:
    print("Running comms test on {} processors with {} workers".format(MPI.COMM_WORLD.Get_size(),num_workers))
    status = MPI.Status()
    alldone = False
    mess_count = 0
    while not alldone:
        for w in worker_ranks:
            if comm.Iprobe(source=w, tag=MPI.ANY_TAG, status=status):
                D_recv = comm.recv(source=w, tag=MPI.ANY_TAG, status=status)
                mess_count += 1

                #To test values
                x = w * 1000.0
                assert np.all(D_recv['arr_vals'] == x), "Array values do not all match"
                assert D_recv['scal_val'] == x + x/1e7, "Scalar values do not all match"
        if mess_count >= total_num_mess:
            alldone = True

    print('Manager received and checked {} messages'.format(mess_count))        
    print('Manager finished in time', time.time() - start_time)

else:
    output = np.zeros(1,dtype=sim_specs['out'])
    for x in range(rounds):
        x = rank * 1000.0
        output.fill(x)
        output['scal_val'] = x + x/1e7
        comm.send(obj=output, dest=0)

System info:
A Cray CS400 cluster with KNLs.
https://www.lcrc.anl.gov/systems/resources/bebop/
Compiler: intel/17.0.4
intel-mpi/2017.3
Python 3.6.3 |Intel Corporation
CentOS Linux release 7.4.1708 (Core)
mpi4py v3.0.0
numpy 1.14.3
Env: export I_MPI_FABRICS=shm:ofa

What I have tried:

I have got the worker to dump/load the message to file using pickle with no errors. I also used MPI.pickle.dumps and MPI.pickle.loads with no errors at the worker end.

I have also tried workarounds such as catching the error and getting the worker to resend the message, but this generally results in the error being repeated.

I have tried with default pickle and using dill.

from mpi4py import MPI  
import dill  
MPI.pickle.__init__(dill.dumps, dill.loads)

I thought at first that using dill fixed it, as it wen't away, but it is still occurring, possibly less frequently.

Comments (5)