Multipole is not OpenMP parallelized
The Multipole
thorn is completely serial, while being essentially a series for loops that could be easily parallelized with OpenMP.
(In one of my typical BBH simulation, Multipole
is of order of 1-2 % of the execution time.)
Comments (8)
-
-
reporter I was looking at the inner function
Multipole_Integrate
(ie, fixed a radius, l, and m). I think that loops there (which are over theta and phi) can be parallelized with no problems. However, you are probably right and the simplest thing to do would be to add pragmas over l and m. The function call isfor (int l=0; l <= lmax; l++) { for (int m=-l; m <= l; m++) { // Integrate sYlm (real + i imag) over the sphere at radius r Multipole_Integrate(array_size, ntheta, reY[si][l][m+l], imY[si][l][m+l], real, imag, th, ph, &modes(v, i, l, m, 0), &modes(v, i, l, m, 1)); }//loop over m }//loop over l
Multipole_Integrate
does not touch any input except the last two, so the input can be shared.modes
is a custom class that contains all the output data and&modes(v, i, l, m, 0)
is just a pointer to aCCTK_REAL
. Different iterations are going to have different pointers, so I think that the iterations are all independent without making any change.l
andm
are already private, so, probably to parallelize the most of the thorn we just need to add#pragma omp for collapse(2)
(Plus, we’d have to remove an inner pragma in one of the integration methods)
You may actually see an extra speedup, if you are using it, by switching from ASCII output to HDF5 output.
Yes, I am already doing that (which, incidentally, speeds up significantly kuibit as well).
-
When you say that Multipole is taking 1-2% of the execution time in your BBH simulation, I would say:
- That’s not very much
- It is possible that what the timer is measuring is the time spent on one process waiting for the other processes to catch up. If you want to know, you can set
Carpet::schedule_barriers
= yes and `Carpet::sync_barriers
= yes` and then look at the timer output for the barriers.
In general, I would only invest time in optimising if you have profiled first. So if you have measured that the integration loops are taking a significant amount of time, then by all means parallelise them. They don’t scale with the grid resolution (unless you manually scale the angular resolution parameters), and I would expect interpolation to be by far the dominant cost here.
Having said that, since it’s just adding one pragma, maybe it’s not worth investigating too much
-
reporter It turns out that the problem cannot be solved with one single pragma. It is still very easy to sprinkle pragmas in the for loops in the integrators. I did that in
All the test still pass.
I used the new code to run the next checkpoint in one of my BBH simulations and I found that approximately 50 % less time is spent in the multipole thorn when using 4 threads.
-
@Ian Hinder it always needs profiling. Adding
OpenMP
pragmas may make things slower after all :-) (would not be the first time). -
@Gabriele Bozzola that’s good news!
-
@Gabriele Bozzola has a final conclusion been reached for this?
-
reporter If I did reach a final conclusion, I forgot about it. I see I have a commit about it: https://bitbucket.org/Sbozzolo/einsteinanalysis/commits/69b64606b6a1ac35405a2361b84fef4a2ca87a1b
I think I claim that this is (marginally) faster in production runs.
- Log in to comment
If the loops (e.g. the ones over radii) contain calls to the interpolator then they cannot be OpenMP parallelized, since the interpolator (no Cactus call actually) is not thread save.
One would have to MPI parallelize but that is not trivial since the interpolator is a MPI collective call (but different ranks may pass different arguments, so what is needed is “dummy” interpolator calls on those ranks that have run out of radii to work on). It is also not fully clear if this would speed up (or not, if the inter-process communication actually dominates).
The loops over (l,m) can be OpenMP parallelized (no interpolator call there) can be parallelized using OpenMP (though watch out for shared temporary arrays and static variables).
You may actually see an extra speedup, if you are using it, by switching from ASCII output to HDF5 output.