Running time when broadcast a lot of global pointers.

Issue #279 resolved
Ngoc Phuong Chau created an issue

I am using global pointers to store a lot of arrays

n =640 000;

upcxx::global_ptr<double> *Q = new upcxx::global_ptr<double> [n];

Each Q[i] is create as an array with 100 element

Q[i] = upcxx::new_array<double>(100);

//all element is 1.

each rank will hold a part of Q as below, and then they will broadcast to other ranks.

rank 3 is creating Q from 240000 to 320000
rank 2 is creating Q from 160000 to 240000
rank 5 is creating Q from 400000 to 480000
rank 7 is creating Q from 560000 to 640000
rank 6 is creating Q from 480000 to 560000
rank 1 is creating Q from 80000 to 160000
rank 4 is creating Q from 320000 to 400000

rank 0 is creating Q from 0 to 80000

Broadcast.

nlocal = 80000;

for(int i=0; i<col; i++)
{

Q[j] = upcxx::broadcast(Q[j], i/nlocal).wait();
}

I run with 64000 global pointers, and it worked wells. However, I run with 640 000 global pointer; the program was very slow (after 30 minutes, the program was still running).

After around 40 minutes, the server output is

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.

2 total processes killed (some possibly by mpirun during cleanup)

I think that the program stop due to the limited time.

Does UPCXX provide another way to broadcast faster?

Thanks,

Comments (3)

  1. Dan Bonachea

    Hello Ngoc -

    For this situation you should use a vector broadcast, that sends many elements in a one collective operation. Ex:

    #include <upcxx/upcxx.hpp>
    #include <iostream>
    #include <cassert>
    
    int main(){
    
      upcxx::init();
    
      long n =640000;
      long nlocal = n / upcxx::rank_n();
    
      if (!upcxx::rank_me()) std::cout << "n=" << n << " nlocal=" << nlocal << std::endl;
    
      upcxx::global_ptr<double> *Q = new upcxx::global_ptr<double> [n];
      //Each Q[i] is create as an array with 100 element
      for (long i = upcxx::rank_me()*nlocal; i < (upcxx::rank_me()+1)*nlocal; i++) {
        Q[i] = upcxx::new_array<double>(100);
      }
      #if 0
        // fully blocking version
        for (long r = 0; r < upcxx::rank_n(); r++) {
          upcxx::broadcast(Q+r*nlocal, nlocal, r).wait(); // vector broadcast
        }
      #else
        // overlapped version
        upcxx::promise<> p;
        for (long r = 0; r < upcxx::rank_n(); r++) {
          upcxx::broadcast(Q+r*nlocal, nlocal, r, upcxx::world(), upcxx::operation_cx::as_promise(p));
        }
        p.finalize().wait();
      #endif
    
      // verify
      for (long i=0; i<n; i++) {
        assert(Q[i]);
        assert(Q[i].where() == i/nlocal);
      }
      upcxx::barrier();
    
      if (!upcxx::rank_me()) std::cout << "SUCCESS" << std::endl;
      upcxx::finalize();
    }
    

    I've included two variants above - one where the collectives are fully blocking and another where the collectives are overlapped with each other.

    If these Q objects are "permanent" - ie never deleted during the program run - then you can potentially make this even more efficient by allocating and broadcasting a single large array from each rank, and carving it up into 100-element blocks. Eg:

    #include <upcxx/upcxx.hpp>
    #include <iostream>
    #include <cassert>
    
    int main(){
    
      upcxx::init();
    
      long n =640000;
      long nlocal = n / upcxx::rank_n();
    
      if (!upcxx::rank_me()) std::cout << "n=" << n << " nlocal=" << nlocal << std::endl;
    
      upcxx::global_ptr<double> *Q = new upcxx::global_ptr<double> [n];
      // Each Q needs to point to 100 doubles
      upcxx::global_ptr<double> myelems = upcxx::new_array<double>(100*nlocal);
    
      for (long r = 0; r < upcxx::rank_n(); r++) {
        upcxx::global_ptr<double> hiselems = upcxx::broadcast(myelems, r).wait();
        for (long i=r*nlocal; i<(r+1)*nlocal; i++) {
          Q[i] = hiselems + i*100;
        }
      }
    
      // verify
      for (long i=0; i<n; i++) {
        assert(Q[i]);
        assert(Q[i].where() == i/nlocal);
      }
      upcxx::barrier();
    
      if (!upcxx::rank_me()) std::cout << "SUCCESS" << std::endl;
      upcxx::finalize();
    }
    

    Hope this helps..

  2. Log in to comment