Multipole is not OpenMP parallelized

Create issue
Issue #2626 new
Gabriele Bozzola created an issue

The Multipole thorn is completely serial, while being essentially a series for loops that could be easily parallelized with OpenMP.

(In one of my typical BBH simulation, Multipole is of order of 1-2 % of the execution time.)

Comments (6)

  1. Roland Haas

    If the loops (e.g. the ones over radii) contain calls to the interpolator then they cannot be OpenMP parallelized, since the interpolator (no Cactus call actually) is not thread save.

    One would have to MPI parallelize but that is not trivial since the interpolator is a MPI collective call (but different ranks may pass different arguments, so what is needed is “dummy” interpolator calls on those ranks that have run out of radii to work on). It is also not fully clear if this would speed up (or not, if the inter-process communication actually dominates).

    The loops over (l,m) can be OpenMP parallelized (no interpolator call there) can be parallelized using OpenMP (though watch out for shared temporary arrays and static variables).

    You may actually see an extra speedup, if you are using it, by switching from ASCII output to HDF5 output.

  2. Gabriele Bozzola reporter

    I was looking at the inner function Multipole_Integrate (ie, fixed a radius, l, and m). I think that loops there (which are over theta and phi) can be parallelized with no problems. However, you are probably right and the simplest thing to do would be to add pragmas over l and m. The function call is

          for (int l=0; l <= lmax; l++)
          {
            for (int m=-l; m <= l; m++)
            {
              // Integrate sYlm (real + i imag) over the sphere at radius r
              Multipole_Integrate(array_size, ntheta,
                                  reY[si][l][m+l], imY[si][l][m+l],
                                  real, imag, th, ph,
                                  &modes(v, i, l, m, 0), &modes(v, i, l, m, 1));
    
            }//loop over m
          }//loop over l
    

    Multipole_Integrate does not touch any input except the last two, so the input can be shared. modes is a custom class that contains all the output data and &modes(v, i, l, m, 0) is just a pointer to a CCTK_REAL. Different iterations are going to have different pointers, so I think that the iterations are all independent without making any change. land m are already private, so, probably to parallelize the most of the thorn we just need to add

    #pragma omp for collapse(2)

    (Plus, we’d have to remove an inner pragma in one of the integration methods)

    You may actually see an extra speedup, if you are using it, by switching from ASCII output to HDF5 output.

    Yes, I am already doing that (which, incidentally, speeds up significantly kuibit as well).

  3. Ian Hinder

    When you say that Multipole is taking 1-2% of the execution time in your BBH simulation, I would say:

    1. That’s not very much 🙂
    2. It is possible that what the timer is measuring is the time spent on one process waiting for the other processes to catch up. If you want to know, you can set Carpet::schedule_barriers = yes and `Carpet::sync_barriers = yes` and then look at the timer output for the barriers.

    In general, I would only invest time in optimising if you have profiled first. So if you have measured that the integration loops are taking a significant amount of time, then by all means parallelise them. They don’t scale with the grid resolution (unless you manually scale the angular resolution parameters), and I would expect interpolation to be by far the dominant cost here.

    Having said that, since it’s just adding one pragma, maybe it’s not worth investigating too much 🙂

  4. Gabriele Bozzola reporter

    It turns out that the problem cannot be solved with one single pragma. It is still very easy to sprinkle pragmas in the for loops in the integrators. I did that in

    https://bitbucket.org/einsteintoolkit/einsteinanalysis/pull-requests/19/multipole-add-openmp-parallelization

    All the test still pass.

    I used the new code to run the next checkpoint in one of my BBH simulations and I found that approximately 50 % less time is spent in the multipole thorn when using 4 threads.

  5. Roland Haas

    @Ian Hinder it always needs profiling. Adding OpenMP pragmas may make things slower after all :-) (would not be the first time).

  6. Log in to comment