dist_object can hang execution

Issue #129 resolved
BrianS created an issue

so noticed this when trying to make an rput_irregular example

    dist_object<uvector> dParticles(uvector(500));
    future<global_ptr<particle_t> > hiVectorF = rpc(nebrHi, [](dist_object<uvector>& d){
        return global_ptr<particle_t>(&((*d).front()));}, dParticles);
    std::cout<<"rpc called "<<me<<std::endl;
    hiVectorF.wait();
    std::cout<<"pointer arrived "<<me<<std::endl;

can produce output like this:

rpc called 5
rpc called 0
rpc called 2
rpc called 4
pointer arrived 4
rpc called 3
rpc called 7
rpc called 1
pointer arrived 0
pointer arrived 2
pointer arrived 5
rpc called 6
pointer arrived 1
pointer arrived 7
pointer arrived 6

In this case, rank 3 does not return from the wait call. The code hangs and the processes stay pegged at full CPU rate.

Comments (4)

  1. Dan Bonachea

    @bvstraalen - you left out a very important part of your code which is whatever lines come next.

    In particular, the code shown has an RPC race - each rank will service RPC's while waiting for the acknowledgement to their own RPC, but may exit that wait (and fall off the end of your snippet) before servicing all incoming RPCs. You need some other later call with user-level progress (like a barrier) to ensure global quiescence and completion of all the RPCs.

  2. Log in to comment