Enable detection of `attribute(pure)`
Cactus has the ability (currently disabled) to detect whether the compiler supports attribute(pure)
. Pure functions do not have side effects and are only called for their result. This allows the compiler to avoid calling a function if the result is not used.
In the flesh, notably CCTK_VarIndex
and CCTKi_VarDataPtrI
are declared pure
. These functions are called via the macro DECLARE_CCTK_ARGUMENTS
. I observe cases (e.g. when using CarpetX thorns) where enabling this attribute reduces the size of function significantly.
Note that CCTKi_VarDataPtrI
is somewhat expensive. (I traced its call chain, and the way it checks whether the scheduled function has access to a given variable and time level is surprisingly duplicitous.) Independent of that, not calling this function if the grid variable isn’t actually used would be good, and attribute(pure)
is exactly the annotation that allows the compiler to do that.
In the past, I suspected this attribute to cause miscompiled code, at least on some systems with some compilers. We’ll have to test carefully when and where to enable it.
Comments (27)
-
-
-
assigned issue to
-
assigned issue to
-
reporter Access checking should instead done by the driver when it calls a scheduled function. Before the call, the driver fills in
cctkGH
with the pointers to all (!) variables and time levels. Checking access there would be straightforward (in Carpet and CarpetX, not in PUGH), and would also be cheaper since group-wise calculations have to be done only once per group, not once per variable.CCTK_VarDataPtrI
would then simply look up the pointer incctkGH
, with proper index checking. The latter could also be avoided if we trust the driver.The checked macros also call
CCTK_VarDataPtrI
.Overall, this would reduce overhead both in the driver an in the flesh.
-
Isn’t that array filled in at the
enter_local_mode
function and theenter_global_mode
function for grid scalars and grid arrays? Which do not (right now, and in the case ofenter_global_mode
cannot) know which scheduled function in to be run?The checked macros must not call
CCTK_VarDataPtrI
(without the “i”) since that function does not check access (it cannot, since it is called by non-presync aware thorns). Probably this also means that the driver must monitor calls toDriver_RequireValidData
et al. if ti setscctkGH->data[]
toNULL
since a scheduled function may call that function to gain access at runtime. -
Just a quick comment: I recently did some extensive benchmarking of Einstein Toolkit and found that for my applications the function
CCTKi_VatDataPtrI
was taking up to 21.7 % of the entire simulation time!So, this change could lead to a noticeable speed up.
-
reporter enter_local_mode
is generally called from the scheduler, inCallFunction
, when the driver knows which scheduled function is going to be called.enter_global_mode
is usually called further outwards when a scheduled function is not known. We could change that, so that it is only called for particular scheduled functions. This would be straightforward in CarpetX, probably less so in Carpet.I would worry about
Driver_RequireValidData
at the moment, as this is probably a function that is very rarely called. If we can set up a special case that speeds up thorns that don’t call this function I’d be very happy already. For example, such thorns could callCCTK_VarDataPtr
instead of looking into thecGH
structure. -
Can you provide the value fo the parameter
Cactus::presync_mode
you are using?The
CCTKi_VarDataPriI
function’s “new” thing due to presync is the bit:if(!CCTK_HasAccess(GH,vindex)) { return NULL; }
and
CCTK_HasAccess
has a short-circuit:static bool presync_only = CCTK_Equals(presync_mode, "presync-only"); if(!presync_only) return true;
which returns early if the static variable indicates that the mode is not
presync-early
. So if not set then the expense would seem to be the (implicit) mutex to make the static initialization thread safe.
The numbers in you table add up to 169.9% is it possible to see which function calls which other function? -
I set
Cactus::presync_mode = "mixed-error"
as not all the thorns I use have READ/WRITE statements (e.g.,IllinoisGRMHD
).This is the call tree for the functions I reported.
EDIT: As it is clear from this table,
IllinoisGRMHD
is the culprit. If adding the READ/WRITE statements to this thorn removes the need to call this function, this would make the code almost twice as fast. If this is the case, then, it is high priority to do this as it would literally save several days of computations. I am not sure I have the required knowledge to this properly, but I can try (if doing that would lead to such speed up). -
- attached perf-qc0-mclachlan.txt
Linux perf timing for a 1 thread qc0-mclachlan.par test run (until it aborts due to multiple components on the single MPI rank).
Most time is spent in:
# Overhead Command Shared Object Symbol # 14.17% cactus_sim cactus_sim [.] ML_BSSN::ML_BSSN_EvolutionInteriorSplitBy3_Body 11.16% cactus_sim cactus_sim [.] ML_BSSN::ML_BSSN_EvolutionInteriorSplitBy2_Body 10.29% cactus_sim cactus_sim [.] CarpetLib::prolongate_3d_rf2<double, 5> 6.58% cactus_sim cactus_sim [.] ML_BSSN::ML_BSSN_EvolutionInteriorSplitBy1_Body 4.64% cactus_sim cactus_sim [.] apply_dissipation_._omp_fn.2 4.38% cactus_sim cactus_sim [.] ML_BSSN::ML_BSSN_ADMBaseInterior_Body 4.19% cactus_sim libm-2.31.so [.] __ieee754_pow_sse2
and
CCTKi_VarDataPtrI
is 0.01%. So this could be due to McLachlan being C and Lean being Fortran (hence no Fortran wrappers). The Fortran wrappers are terrible in that they contain calls toCCTKi_GroupLengthAsPointer
for all vectors of grid functions (which takes a group name ie a string as an argument). -
I updated my previous comment with a better table from which we can see that the problem is
IllinoisGRMHD
-
Thank you. The table is odd. It shows
CCTKi_VarDataPtrI
calls from withincompute_tau_rhs_extrinsic_curvature_terms_and_TUPmunu
but that function has no such calls (certainly not explicitly but also not inDECLARE_CCTK_ARGUMENTS
). That function (in wvuthorn’s master) inside of an omp parallel would be somewhat suspicious. I have the suspicion that the intel compiler (that is the one you used) is confusing the profiler. The source file iscompute_tau_rhs_extrinsic_curvature_terms_and_TUPmunu.C
. -
If you want to try if the overhead is an issue (you cannot really avoid
CCTKi_VarDataPtrI
but, other than theCCTK_HasAcccess
call that function is basically acctkGH->data[varindex][tl]
) and you would have to comment out theCCTK_HasAccess
call insrc/main/GroupsOnGH.c
. -
I am using vtune, and I would expect the Intel compiler to work well with the Intel profiler, but maybe the various macros in Cactus are throwing it off.
This is the reported callee tree:
The time in the ones that are 0.00% is not identically 0, so it is not that everything was erroneously attributed to the first entry.
I can test a TOV star with McLachlan to see if this is still there.
-
Just to be sure: you are running master of IllinoisGRMHD, not some local version where someone stuck a
DECLARE_CCTK_ARGUMENTS
into the innermost loop? -
Just to be sure: you are running master of IllinoisGRMHD, not some local version where someone stuck a
DECLARE_CCTK_ARGUMENTS
into the innermost loop?Good point. I am running a modified version. Currently, there’s no
DECLARE_CCTK_ARGUMENTS
in the body ofcompute_tau_rhs_extrinsic_curvature_terms_and_TUPmunu.C
, but I cannot guarantee that it was not there when I did the benchmark. I can probably repeat a smaller scale benchmark to see if the problem is still there. -
You may also find useful suggested options on https://software.intel.com/content/www/us/en/develop/documentation/vtune-help/top/set-up-analysis-target/linux-targets/compiler-switches-for-performance-analysis-on-linux-targets.html where eg
-parallel-source-info=2
would be one option that they recommend but that the ET’s option lists would typically not set (we tend to set-g
and-O2
is the Intel compiler’s default). -
@Gabriele Bozzola did you manage to find out if indeed
CCTK_VarDataPtrI
is as expensive as originally thought or if this is an issue with the performance monitor mis-representing data? -
Hi Roland. I have yet to redone the test yest, but I am running full simulations with a similar setup. If I look at the timer report from Cactus, I find that
IllinoisGRMHD_RHS_eval
takes about twice as much (precisely, 2.5 times as much) asLeanBSSN_CalcRHS
. This seems to be consistent with my initial finding. It is not to be excluded that we introduced some big inefficiency in our private version ofIllinoisGRMHD
. This simulation is on 498 MPI processes, if that matters.
-
A ratio of 2:1 for time spend in the hydro RHS compared to vacuum RHS actually is what I used to find when running GRHydro simulations, so that ratio in itself does not seem too unusual to me.
-
If the performance is the one expected, and no one has ever observed this behavior before, I would conclude that it is very likely the problem lies in VTune (or how I used it).
The next time I will profile my code, I will try with the option that you posted in a previous comment and report here in case results persist.
-
reporter Gabriele
There are two kinds of performance measurement tools: Profilers (which instrument, i.e. modify the code), and Tracers (which examine snapshots of the run every so many milliseconds).
Profilers have the advantage that they see all the code, where as tracers are blind to what happens in between the snapshots. The big disadvantage of profilers is that the code they insert might run too often, and this might slow down or distort the run. For example, if a profiler counts how often a particular routine is executed, and that routine ends up compiling to just a few machine instructions after inlining, then the overhead from profiling (at least to atomic read/write operations to memory) might be much larger than the cost of the routine, and this might present very distorted and unrealistic results.
These days, almost all tools are tracers. I’ve been using HPCToolkit recently. I’m sure VTune has a tracing mode as well. I think GCC has an option
-pg
to compile with profiling information – I would not use this method these days.(I say “these days” because it used to be the case that all routines are expensive. With the advent of C++, functions can be small and can be completely optimized away, and this made profiling with traditional tools useless.)
-
This ticket seems to have gone off topic. The original issue, namely (re-)activating checks and use of
atribute(pure)
in Cactus would still seem worthwhile. -
I will re-enable
CCTK_ATTRIBUTE_PURE
shortly after the 2021_11 release so that we can test if this breaks anything. -
-
-
- changed status to open
-
- changed status to resolved
- Log in to comment
Having
attribute(pure)
working would be great, in particular forCCTKi_VarDataPtrI
since it would could reduce use of this function quite a lot on scheduled functions that only end up accessing a fraction of the variables in a thorn. Since you looked at them, do you have an suggestion where one could improve access checking inCCTKi_VarDataPtrI
? This issue does not exist withDECLARE_CCTK_ARGUMENTS_Foo
which only declare variables that are listed in theREADS
/WRITES
schedule statements, does it?