Enable detection of `attribute(pure)`

Issue #2512 resolved
Erik Schnetter created an issue

Cactus has the ability (currently disabled) to detect whether the compiler supports attribute(pure). Pure functions do not have side effects and are only called for their result. This allows the compiler to avoid calling a function if the result is not used.

In the flesh, notably CCTK_VarIndex and CCTKi_VarDataPtrI are declared pure. These functions are called via the macro DECLARE_CCTK_ARGUMENTS. I observe cases (e.g. when using CarpetX thorns) where enabling this attribute reduces the size of function significantly.

Note that CCTKi_VarDataPtrI is somewhat expensive. (I traced its call chain, and the way it checks whether the scheduled function has access to a given variable and time level is surprisingly duplicitous.) Independent of that, not calling this function if the grid variable isn’t actually used would be good, and attribute(pure) is exactly the annotation that allows the compiler to do that.

In the past, I suspected this attribute to cause miscompiled code, at least on some systems with some compilers. We’ll have to test carefully when and where to enable it.

Comments (27)

  1. Roland Haas

    Having attribute(pure) working would be great, in particular for CCTKi_VarDataPtrI since it would could reduce use of this function quite a lot on scheduled functions that only end up accessing a fraction of the variables in a thorn. Since you looked at them, do you have an suggestion where one could improve access checking in CCTKi_VarDataPtrI? This issue does not exist with DECLARE_CCTK_ARGUMENTS_Foo which only declare variables that are listed in the READS/WRITES schedule statements, does it?

  2. Erik Schnetter reporter

    Access checking should instead done by the driver when it calls a scheduled function. Before the call, the driver fills in cctkGH with the pointers to all (!) variables and time levels. Checking access there would be straightforward (in Carpet and CarpetX, not in PUGH), and would also be cheaper since group-wise calculations have to be done only once per group, not once per variable. CCTK_VarDataPtrI would then simply look up the pointer in cctkGH, with proper index checking. The latter could also be avoided if we trust the driver.

    The checked macros also call CCTK_VarDataPtrI.

    Overall, this would reduce overhead both in the driver an in the flesh.

  3. Roland Haas

    Isn’t that array filled in at the enter_local_mode function and the enter_global_mode function for grid scalars and grid arrays? Which do not (right now, and in the case of enter_global_mode cannot) know which scheduled function in to be run?

    The checked macros must not call CCTK_VarDataPtrI (without the “i”) since that function does not check access (it cannot, since it is called by non-presync aware thorns). Probably this also means that the driver must monitor calls to Driver_RequireValidData et al. if ti sets cctkGH->data[] to NULL since a scheduled function may call that function to gain access at runtime.

  4. Gabriele Bozzola

    Just a quick comment: I recently did some extensive benchmarking of Einstein Toolkit and found that for my applications the function CCTKi_VatDataPtrI was taking up to 21.7 % of the entire simulation time!

    So, this change could lead to a noticeable speed up.

  5. Erik Schnetter reporter

    enter_local_mode is generally called from the scheduler, in CallFunction, when the driver knows which scheduled function is going to be called.

    enter_global_mode is usually called further outwards when a scheduled function is not known. We could change that, so that it is only called for particular scheduled functions. This would be straightforward in CarpetX, probably less so in Carpet.

    I would worry about Driver_RequireValidData at the moment, as this is probably a function that is very rarely called. If we can set up a special case that speeds up thorns that don’t call this function I’d be very happy already. For example, such thorns could call CCTK_VarDataPtr instead of looking into the cGH structure.

  6. Roland Haas

    Can you provide the value fo the parameter Cactus::presync_mode you are using?

    The CCTKi_VarDataPriI function’s “new” thing due to presync is the bit:

      if(!CCTK_HasAccess(GH,vindex))
      {
        return NULL;
      }
    

    and CCTK_HasAccess has a short-circuit:

      static bool presync_only = CCTK_Equals(presync_mode, "presync-only");
    
      if(!presync_only)
        return true;
    

    which returns early if the static variable indicates that the mode is not presync-early. So if not set then the expense would seem to be the (implicit) mutex to make the static initialization thread safe.
    The numbers in you table add up to 169.9% is it possible to see which function calls which other function?

  7. Gabriele Bozzola

    I set Cactus::presync_mode = "mixed-error" as not all the thorns I use have READ/WRITE statements (e.g., IllinoisGRMHD).

    This is the call tree for the functions I reported.

    EDIT: As it is clear from this table, IllinoisGRMHD is the culprit. If adding the READ/WRITE statements to this thorn removes the need to call this function, this would make the code almost twice as fast. If this is the case, then, it is high priority to do this as it would literally save several days of computations. I am not sure I have the required knowledge to this properly, but I can try (if doing that would lead to such speed up).

  8. Roland Haas

    Linux perf timing for a 1 thread qc0-mclachlan.par test run (until it aborts due to multiple components on the single MPI rank).

    Most time is spent in:

    # Overhead  Command     Shared Object                Symbol                                             # 
        14.17%  cactus_sim  cactus_sim                   [.] ML_BSSN::ML_BSSN_EvolutionInteriorSplitBy3_Body
        11.16%  cactus_sim  cactus_sim                   [.] ML_BSSN::ML_BSSN_EvolutionInteriorSplitBy2_Body
        10.29%  cactus_sim  cactus_sim                   [.] CarpetLib::prolongate_3d_rf2<double, 5>
         6.58%  cactus_sim  cactus_sim                   [.] ML_BSSN::ML_BSSN_EvolutionInteriorSplitBy1_Body
         4.64%  cactus_sim  cactus_sim                   [.] apply_dissipation_._omp_fn.2
         4.38%  cactus_sim  cactus_sim                   [.] ML_BSSN::ML_BSSN_ADMBaseInterior_Body
         4.19%  cactus_sim  libm-2.31.so                 [.] __ieee754_pow_sse2
    

    and CCTKi_VarDataPtrI is 0.01%. So this could be due to McLachlan being C and Lean being Fortran (hence no Fortran wrappers). The Fortran wrappers are terrible in that they contain calls to CCTKi_GroupLengthAsPointer for all vectors of grid functions (which takes a group name ie a string as an argument).

  9. Gabriele Bozzola

    I updated my previous comment with a better table from which we can see that the problem is IllinoisGRMHD

  10. Roland Haas

    Thank you. The table is odd. It shows CCTKi_VarDataPtrI calls from within compute_tau_rhs_extrinsic_curvature_terms_and_TUPmunu but that function has no such calls (certainly not explicitly but also not in DECLARE_CCTK_ARGUMENTS). That function (in wvuthorn’s master) inside of an omp parallel would be somewhat suspicious. I have the suspicion that the intel compiler (that is the one you used) is confusing the profiler. The source file is compute_tau_rhs_extrinsic_curvature_terms_and_TUPmunu.C.

  11. Roland Haas

    If you want to try if the overhead is an issue (you cannot really avoid CCTKi_VarDataPtrI but, other than the CCTK_HasAcccess call that function is basically a cctkGH->data[varindex][tl]) and you would have to comment out the CCTK_HasAccess call in src/main/GroupsOnGH.c.

  12. Gabriele Bozzola

    I am using vtune, and I would expect the Intel compiler to work well with the Intel profiler, but maybe the various macros in Cactus are throwing it off.

    This is the reported callee tree:

    The time in the ones that are 0.00% is not identically 0, so it is not that everything was erroneously attributed to the first entry.

    I can test a TOV star with McLachlan to see if this is still there.

  13. Roland Haas

    Just to be sure: you are running master of IllinoisGRMHD, not some local version where someone stuck a DECLARE_CCTK_ARGUMENTS into the innermost loop?

  14. Gabriele Bozzola

    Just to be sure: you are running master of IllinoisGRMHD, not some local version where someone stuck a DECLARE_CCTK_ARGUMENTS into the innermost loop?

    Good point. I am running a modified version. Currently, there’s no DECLARE_CCTK_ARGUMENTS in the body of compute_tau_rhs_extrinsic_curvature_terms_and_TUPmunu.C, but I cannot guarantee that it was not there when I did the benchmark. I can probably repeat a smaller scale benchmark to see if the problem is still there.

  15. Roland Haas

    @Gabriele Bozzola did you manage to find out if indeed CCTK_VarDataPtrI is as expensive as originally thought or if this is an issue with the performance monitor mis-representing data?

  16. Gabriele Bozzola

    Hi Roland. I have yet to redone the test yest, but I am running full simulations with a similar setup. If I look at the timer report from Cactus, I find that IllinoisGRMHD_RHS_eval takes about twice as much (precisely, 2.5 times as much) as LeanBSSN_CalcRHS. This seems to be consistent with my initial finding. It is not to be excluded that we introduced some big inefficiency in our private version of IllinoisGRMHD. This simulation is on 498 MPI processes, if that matters.

  17. Roland Haas

    A ratio of 2:1 for time spend in the hydro RHS compared to vacuum RHS actually is what I used to find when running GRHydro simulations, so that ratio in itself does not seem too unusual to me.

  18. Gabriele Bozzola

    If the performance is the one expected, and no one has ever observed this behavior before, I would conclude that it is very likely the problem lies in VTune (or how I used it).

    The next time I will profile my code, I will try with the option that you posted in a previous comment and report here in case results persist.

  19. Erik Schnetter reporter

    Gabriele

    There are two kinds of performance measurement tools: Profilers (which instrument, i.e. modify the code), and Tracers (which examine snapshots of the run every so many milliseconds).

    Profilers have the advantage that they see all the code, where as tracers are blind to what happens in between the snapshots. The big disadvantage of profilers is that the code they insert might run too often, and this might slow down or distort the run. For example, if a profiler counts how often a particular routine is executed, and that routine ends up compiling to just a few machine instructions after inlining, then the overhead from profiling (at least to atomic read/write operations to memory) might be much larger than the cost of the routine, and this might present very distorted and unrealistic results.

    These days, almost all tools are tracers. I’ve been using HPCToolkit recently. I’m sure VTune has a tracing mode as well. I think GCC has an option -pg to compile with profiling information – I would not use this method these days.

    (I say “these days” because it used to be the case that all routines are expensive. With the advent of C++, functions can be small and can be completely optimized away, and this made profiling with traditional tools useless.)

  20. Roland Haas

    This ticket seems to have gone off topic. The original issue, namely (re-)activating checks and use of atribute(pure) in Cactus would still seem worthwhile.

  21. Roland Haas

    I will re-enable CCTK_ATTRIBUTE_PURE shortly after the 2021_11 release so that we can test if this breaks anything.

  22. Log in to comment