- edited description
Running time when broadcast a lot of global pointers.
I am using global pointers to store a lot of arrays
n =640 000;
upcxx::global_ptr<double> *Q = new upcxx::global_ptr<double> [n];
Each Q[i] is create as an array with 100 element
Q[i] = upcxx::new_array<double>(100);
//all element is 1.
each rank will hold a part of Q as below, and then they will broadcast to other ranks.
rank 3 is creating Q from 240000 to 320000
rank 2 is creating Q from 160000 to 240000
rank 5 is creating Q from 400000 to 480000
rank 7 is creating Q from 560000 to 640000
rank 6 is creating Q from 480000 to 560000
rank 1 is creating Q from 80000 to 160000
rank 4 is creating Q from 320000 to 400000rank 0 is creating Q from 0 to 80000
Broadcast.
nlocal = 80000;
for(int i=0; i<col; i++)
{Q[j] = upcxx::broadcast(Q[j], i/nlocal).wait();
}
I run with 64000 global pointers, and it worked wells. However, I run with 640 000 global pointer; the program was very slow (after 30 minutes, the program was still running).
After around 40 minutes, the server output is
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.2 total processes killed (some possibly by mpirun during cleanup)
I think that the program stop due to the limited time.
Does UPCXX provide another way to broadcast faster?
Thanks,
Comments (3)
-
reporter -
Hello Ngoc -
For this situation you should use a vector broadcast, that sends many elements in a one collective operation. Ex:
#include <upcxx/upcxx.hpp> #include <iostream> #include <cassert> int main(){ upcxx::init(); long n =640000; long nlocal = n / upcxx::rank_n(); if (!upcxx::rank_me()) std::cout << "n=" << n << " nlocal=" << nlocal << std::endl; upcxx::global_ptr<double> *Q = new upcxx::global_ptr<double> [n]; //Each Q[i] is create as an array with 100 element for (long i = upcxx::rank_me()*nlocal; i < (upcxx::rank_me()+1)*nlocal; i++) { Q[i] = upcxx::new_array<double>(100); } #if 0 // fully blocking version for (long r = 0; r < upcxx::rank_n(); r++) { upcxx::broadcast(Q+r*nlocal, nlocal, r).wait(); // vector broadcast } #else // overlapped version upcxx::promise<> p; for (long r = 0; r < upcxx::rank_n(); r++) { upcxx::broadcast(Q+r*nlocal, nlocal, r, upcxx::world(), upcxx::operation_cx::as_promise(p)); } p.finalize().wait(); #endif // verify for (long i=0; i<n; i++) { assert(Q[i]); assert(Q[i].where() == i/nlocal); } upcxx::barrier(); if (!upcxx::rank_me()) std::cout << "SUCCESS" << std::endl; upcxx::finalize(); }
I've included two variants above - one where the collectives are fully blocking and another where the collectives are overlapped with each other.
If these Q objects are "permanent" - ie never deleted during the program run - then you can potentially make this even more efficient by allocating and broadcasting a single large array from each rank, and carving it up into 100-element blocks. Eg:
#include <upcxx/upcxx.hpp> #include <iostream> #include <cassert> int main(){ upcxx::init(); long n =640000; long nlocal = n / upcxx::rank_n(); if (!upcxx::rank_me()) std::cout << "n=" << n << " nlocal=" << nlocal << std::endl; upcxx::global_ptr<double> *Q = new upcxx::global_ptr<double> [n]; // Each Q needs to point to 100 doubles upcxx::global_ptr<double> myelems = upcxx::new_array<double>(100*nlocal); for (long r = 0; r < upcxx::rank_n(); r++) { upcxx::global_ptr<double> hiselems = upcxx::broadcast(myelems, r).wait(); for (long i=r*nlocal; i<(r+1)*nlocal; i++) { Q[i] = hiselems + i*100; } } // verify for (long i=0; i<n; i++) { assert(Q[i]); assert(Q[i].where() == i/nlocal); } upcxx::barrier(); if (!upcxx::rank_me()) std::cout << "SUCCESS" << std::endl; upcxx::finalize(); }
Hope this helps..
-
- changed status to resolved
- Log in to comment