openmp parallelization within Exact broken

Create issue
Issue #928 closed
Frank Löffler created an issue

Exact uses openmp for the loop over grid points. However, quite a few metrics (which get called pointwise) use static (saved) data. In all cases I have looked at this is only used to initialize some local variables with values from Cactus parameters - and only as long as a 'global' variable 'firstcall' is true. This variable is set to 'false' after the other variables had been initialized, which should be ok even when using multiple threads. However, the compiler can switch the two (and does according to the assembly output), leading to another thread 'seeing' first_call being false, but the global variables not being initialized yet.

The right solution would be to remove these variables. They are not really necessary, because the Cactus parameters could directly be used. However, that patch would be quite large.

A simple and quick workaround would be to remove the openmp parallelization for that loop, at least for the upcoming release.

We have to do one of the two - or something else in case someone comes up with another idea. This is currently breaking several testsuites (sometimes).

In case you want to see an example: look at de_Sitter.F77 and firstcall and arad. Thoughts?


Comments (20)

  1. Ian Hinder
    • removed comment

    I think we should remove the openmp parallelization. Exact is overwhelmingly used for initial data, so performance should not be too much of an issue. We can mention in the release notes that we have removed this parallelization, so if someone needs it to be fast for evaluating exact solutions as evolution, they can run with pure MPI.

  2. Erik Schnetter
    • removed comment

    Correctness is most important, and not parallelising this loop is the safest solution for this.

    Later, we could (instead of using saved variables) use regular local variables (or place the "firstcall" stuff into a "single" region, but that's probably not worthwhile).

  3. Roland Haas
    • removed comment

    I beleieve that as long as all the threads are doing is reading a parameter then based upon tis value write some possibly static variables but all with the same data (which eg the boost routine does) then having the race condition is benign since they all just race to write the same data (unless of course the Fortran compiler does something strange such as first setting variables to zero then later changing them...). This of course fails as soon as there is a statement of the form "a = a + b" involving the saved variable "a".

    A safer way is to put a single-thread call to the routine somewhere earlier and throw away its results that way the variables are intialized or use a OpenMP threadprivate statement (they are designed for precisely this application) which probably gives you the samllest diff and is also the most "standard conforming" way of handling this I think.

  4. Erik Schnetter
    • removed comment

    No, this is not safe. Concurrent write access is not permitted by OpenMP. For example, a compiler may detect that this variable will be written to, and then use it as temporary storage for some other quantity, which may even have a different type. I believe gcc 4.7 has a respective optimisation (re-use of stack slots to reduce the size of stack frames), although I don't know whether it would be triggered in this case.

    In this case, the problem is that thread A begins to initialise this variable, then thread B notices this and will skip the initialisation and go on to use the presumed-initialised value only it hasn't actually finished being initialised.

    There are two other ways to correct this: either one does not "save" these variables (because the initialisation is very cheap anyway), or one initialises them in an "openmp single" region.

  5. Frank Löffler reporter

    OpenMP is now disabled in Exact for the ET_2012_05 release. I'll leave this ticket open as reminder to fix and test this the proper way - something we don't have the time now and something which isn't that terribly important.

  6. Roland Haas
    • removed comment

    The attached patch replaces the SAVE variable with THREADPRIVATE variables (ie. the per-thread equivalent). I also did this substitution in boost.F77 where currently a CRITICAL section is used. I believe that the way the the critical section was used does not properly address the race condition (and I think I sprinkled a couple such constructs through the Cactus code). The code and reasoning go like this:

    logical,save :: firstcall = .true. integer,save :: somevalue

    if (firstcall) then $omp critical if (firstcall) then somevalue = 42 firstcall = .false. end if $end critical end if

    with the aim in skipping the whole critical section once somevalue has been initialized and firstcall set to true on the first go. The two if statements on the same condition are seens as a first speculative one the allows a quick skip and only if we think we might have to do something (ie if firstcall.eqv..true.) do we enter the expensive section of code. The critical section is intended to ensure we really only do so once.

    The mistake now is that as far as I know, an optimizing compiler is free to actually execute the following code:

    logical,save :: firstcall = .true. integer,save :: somevalue

    if (firstcall) then $omp critical if (firstcall) then ! order of statements reversed firstcall = .false. somevalue = 42 end if $end critical end if

    since there is no (visible to the compiler) ordering required. This can go wrong in the following way 1. thread 0 sets firstcall=.false. and flushes this value to memory. 1. before thread 0 sets somevalue, thread 1 comes along, encounters the outer if statement, finds that firstcall=.false. 1. thread 1 proceeds with the wrong value of somevalue.

    There seems to be no way to force a compiler (C or Fortran) to *not* change a value temporarily (or to explicitly enforce an ordering on the statements). C's volatile attribute (think of using it to access a hardware register) seems the closest (and C also defines synchronization points where the compiler must ensure that all variable data is in memory, eg. before function calls, though optimization might do funny things as well if the compiler thinks it can be sure that a variable is not visible to anyone else). None of this helps with Fortran code of course.

  7. Erik Schnetter
    • removed comment

    Your reasoning is correct; one needs a barrier as well. Given how the function is called, however, a barrier cannot be used here. - Maybe one could use atomic access to first_call instead? - I wonder whether one could use an OpenMP flush statement here to ensure a correct ordering of access to first_call.

    I've had problems with threadprivate variables in the past. This was in C++ (Carpet). I don't recall whether this was with "relevant" compilers, and/or what linkage these variables had. I suggest to try this patch, and be careful.

    The subdirectory metrics contains many save statements as well. Some of these are trivial and should simply be omitted, others may require special handling as well.

  8. Roland Haas
    • removed comment

    I don't know if an atomic would help us. The problem is that we would need to guarantee that first_call is only updated (in memory) after all the other variables have been flushed to memory. Ie the problem is not that threads see an inconsistent first_call but that first_call does not match status of the other saved variables. A flush would seem to have the same problem: we don't know if the compiler might move the first_call setting and flushing to the beginning of the block (OpenMP says nothig about ordering it seems and as far Fortran itself goes I don't know either). The attached patch also contained omp statements for the metrics subdirectory. The only reason I did not drop them is that some of the code inside the if(first_call) sections was CCTK_Equals which can be quite expensive (if eg Exact runs at each timestep) when run for each grid point (or multiple times when derivatives are taken inside Exact).

  9. Erik Schnetter
    • removed comment

    if (first_call) then !$omp critical if (first_call) then initialise stuff !$omp flush first_call = .false. end if !$omp end critical end if

    I believe (would need to check the standard) that another flush is implied at the end of the critical region.

    I am not sure whether it is legal to read first_call while it may simultaneously be written inside the critical region. I believe this is not legal. One would need to move the read of first_call into the critical region as well (very expensive, bad idea), or use an atomic operation to read/write first_call:

    !$omp atomic my_first_call = first_call if (my_first_call) then !$omp critical if (first_call) then initialise stuff end if !$omp end critical !$omp atomic first_call = .false. end if

    In any way, this is getting way more complicated than it should be. Threadprivate variables are clearly much simpler, and if they work, I would prefer to use them.

  10. Frank Löffler reporter
    • removed comment

    Why was the suggestion of a 'single' region dropped? It sounds quite clean to me.

  11. Erik Schnetter
    • removed comment

    "single" alone would not solve the issue. It ensures that only a single thread executes the code (similar to "critical"), but would still be executed for every grid point for this thread. To ensure that the other threads see the result of the single region, one would need some kind of synchronisation, probably either a barrier or (again) an if statement checking first_call. I don't see any advantage. Could you give a particular example?

  12. Frank Löffler reporter
    • removed comment

    A single region implies a barrier at the end, doesn't it?

    Of course it would be nice to avoid any barrier if possible. Exact was designed such that the metric functions are 'grid-point local'. They should not depend on something non-local like a 'saved variable'. In the current case, would it lead to regression of performance to simply remove 'firstcall' and 'arad' and use the parameter de_Sitterscale in the computation of 'am'? What is the reason for having 'arad' in the first place?

  13. Erik Schnetter
    • removed comment

    A single region implies a barrier, but this barrier can be omitted with a nowait clause. If there was a barrier, things would break, since the barrier requires that each thread processes the exact same number of grid points, which we cannot guarantee.

    In many cases, saved variables can be omitted, especially if they only provide a shorter name for parameter that has a much longer name. However, in many other cases saved variables really speed up things, e.g. if they calculate large matrices for boosts or evaluate Bessel functions.

    In this case, the simplest solution is to use threadprivate variables.

  14. Ian Hinder
    • removed comment

    On Datura (Intel 11):

    arrangements/EinsteinInitialData/Exact/src/metrics/Bertotti.F77(41): error #6415: This name cannot be assigned this data type because it co\
    nflicts with prior uses of the name.   [BAZA]
          REAL*8 baza
  15. Erik Schnetter
    • removed comment

    This particular use of the save attribute is a good example of what not to do... removing the save and threadprivate attributes, removing firstcall, and setting the variable at each iteration would probably save about 10 lines of code and speed things up a slight bit.

  16. Erik Schnetter
    • changed status to resolved
    • removed comment

    Made it compile by rearranging the declarations. Apparently, threadprivate has to come after the type declaration for the Intel compiler.

  17. Log in to comment