Different simulation result with different number of cores setting

Issue #2516 open
Former user created an issue

I had been using the Mayer version (2019_10) of the EinsteinToolkit to do some BNS simulations, but not until very recently did I realise that there is some discrepancy in the outputs between simulations run with different number of cores. The discrepancy starts to grow after quite a while and at some point the relative error could be sever tens of percents.

I have got mainly two set of output after running the same simulation with different number of cores: i) 8 / 12 / 32 / 36 ii) 16 / 17 / 24 / 48 within each set, the outputs agree well up to 10^(-10) digit for a long enough simulation time. I simply do not know which data should I trust...

And when I tried to use more cores to test (e.g. 64), the code crashed almost immediately with the following errors: (a bunch of) [ml=0 rl=0 c=6] The following grid structure consistency check failed: The communicated region must be contained in the active part of the domain comm <= domain_active and Cactus/configs/ET2/build/CarpetLib/dh.cc:2123: -> The grid structure is inconsistent. It is impossible to continue.

The relevant code for the grid settings are:

CoordBase::xmin =    0.00
CoordBase::ymin = -492.80
CoordBase::zmin =    0.00
CoordBase::xmax = +492.80
CoordBase::ymax = +492.80
CoordBase::zmax = +492.80
CoordBase::dx   =   22.4
CoordBase::dy   =   22.4
CoordBase::dz   =   22.4
...
Carpet::max_refinement_levels    = 8 
Carpet::prolongation_order_space = 5
Carpet::prolongation_order_time  = 2
Carpet::refinement_centering     = "vertex"
...
CarpetRegrid2::num_levels_2 = 6
CarpetRegrid2::position_x_2 = -34.0
CarpetRegrid2::radius_2[1]  =240.0
CarpetRegrid2::radius_2[2]  =120.0
CarpetRegrid2::radius_2[3]  = 52.0
CarpetRegrid2::radius_2[4]  = 40.0
CarpetRegrid2::radius_2[5]  = 30.0

Do anyone encounter this kind of problem before? I thought the rule of thumb is to use 2^N number of cores but it turns out that the output for using 16 or 32 cores are quite different though.. what could be the issue here? Please direct me if there is tickets concerning similar issue.. Any comments would also be appreciated.

Comments (9)

  1. Roland Haas

    If you run on (way) too many MPI ranks then the number of points in a direction can become too small and all kinds of things may break. While one would not like this to happen, it may be hard to avoid in extreme situations.

    There are some things that can affect results:

    • for hydro run (not usually for pure vacuum runs) checkpoint / recovery is no bitwise identical due to the MoL_PostStep group in POST_RECOVER_VARIABLES which contains a con2prim which is not a projection (ie applying it twice changes results). Can be disabled with a MoL parameter.
    • OpenMP reductions (used in all reductions) are not guaranteed to be the same each time they are used due to the order of operations not being fixed and thus the results differ at the roundoff level this affects only output (usually, unless you feed back a reduction result into the simulation eg via using an integral)
    • MPI reductions are the same though the MPI standard encourages implementations to return the same answer no matter the number of MPI ranks or the ordering of ranks

    There is nothing special about 2^N number of cores from Cactus’s perspective (it’s nice that it only has the single, small, prime factor and this may make domain decomposition a bit easier than say have 68=17*4 cores).

    My first suggestion would be to set the MoL parameter that leads to bit identical recovery:

    MoL::run_MoL_PostStep_in_Post_Recover_Variables = "no"
    

  2. Ken Hui

    Thanks for the suggestion on the MoL parameter setting, I will give a try on that.

    You mentioned that OpenMP might give different result due to the unfixed order of operations, but I have got exactly the same output data (I have been comparing is mainly the h+ hx in mp_Psi4_l2_m2_rXXX.00.asc and the rho_maximum in hydrobase-rho.scalars.asc) with no discrepancy at all if I ran the simulation again with the same number of core setting, does that mean the problem is not related to OpenMP?

    For all simulations I have done, I start the simulation from scratch each time (reading from the BNS initial data all over again), so probably the issue I am having is not related to the checkpoint/recovery part.

    Anyway thanks again for the suggestion, I will try re-run the simulations to see if I still got very different output using different core number.

  3. Roland Haas

    Unfortunately running twice with the same settings and getting the same result does not mean that the result does not depend on the order of operations in OpenMP. What it shows is only that that OpenMP choose the same order both times. You would have to check that you get the same result if you run with 1 OpenMP thread than you get with 2, 3, 4 etc. (It’s essentially the difference between deterministic but unpredictable and truly random).

    To be clear: there is no checkpoint / recovery cycle in any of your runs?

    Since you are looking at timeseries data in mp_psi4: do you observe all data to be off or is it only individual timesteps that are off?

    You may want to check the actual grid data (eg ask for 3d output and compare the values in the output files) rather than the multipole output to see if the issue is in the evolution or in the analysis (Multipole itself, or the interpolators).

  4. Ken Hui

    For all simulations I did for this test, there is no checkpoint/recovery cycle; and the time-series h+ hx data in mp_psi4 are different in (almost) every line. I mentioned that there are two groups of output:

    • (i) number of cores = 8 / 12 / 32 / 36
    • (ii) number of cores = 16 / 17 / 24 / 48

    what I meant is that, among all simulations in the same group, the mp_psi4 outputs are only different after ~10 decimal places even after a long simulation time; if I compare group (i) and group (ii), for example the mp_psi4 results from 16-core or 32-core, at first they differ at roughly 1 in 10^10, but later on the discrepancy starts growing (see figure, viewed by vim -d).

    For the OpenMP part, if I understand correctly, you mean even if I issue

    export OMP_NUM_THREADS=1
    

    before the mpirun command every time, there might still problem led by its operation?

  5. Roland Haas

    Setting

    export OMP_NUM_THREADS=1
    

    would at least remove the known and documented source of differences if you have no checkpointing / recovery. With a single thread then I would expect that your results should exactly identical no matter how many MPI ranks. If they are not identical then that could indicate a bug in the ET (or a bug in my understanding of the ET).

  6. Ken Hui

    That really sounds bad.. I shall assume this potential “bug” is not easy to be found.. My original thought on this after seeing the separate self-consistent data group was that there might be somehow some intrinsic selection rules on the core number, that’s why I have tried using 17 cores to run while hoping to see it give weird results, but out of expectation the data from this also fit into one of the two data groups.

    Well at least allow me to make sure I have understood the following correctly:

    1. The MoL parameter you mentioned earlier looks to be related to recovery, so there is no need to try setting that to test.
    2. The “The grid structure is inconsistent. It is impossible to continue.” error I had for using 64 cores is probably simply because I have request too many cores.
    3. There is no explicit rule for choosing the number of cores in Cactus. I had thought that setting some ugly number N (let say 17) might give me bad result because the domain cannot be well divided into N parts.

  7. Zach Etienne

    @Ken Hui : If I were you I would first measure the level of roundoff error by performing a simulation with the initial data perturbed at the least significant digit(s), and monitoring how the result differs over time. GRMHD codes are particularly susceptible to rapid growth of roundoff errors, largely due to the conservatives-to-primitives solver. My guess is you might be surprised at how quickly significant digits of agreement are lost, especially when magnetic fields are nonzero.

  8. Ken Hui

    I had been trying to slightly change the parameter settings to check what the problem is related to. I realised that if I allow more space between the different refinement levels, i.e. changing the “radius” of CarpetRegrid2, I can now generate pretty much the same result with different number of cores.

    CarpetRegrid2::radius_2[1]  =240.0
    CarpetRegrid2::radius_2[2]  =120.0
    CarpetRegrid2::radius_2[3]  = 52.0 ===> 64 !!!!!!
    CarpetRegrid2::radius_2[4]  = 40.0
    CarpetRegrid2::radius_2[5]  = 30.0 ===> 28 !!!!!!
    

    Originally I thought the coarser levels only needs to cover the finer levels + 12 points for buffer, allowing a buffer zone of 16 points somehow solves the problem. In my setup, the resolution dx in the first few finest levels are 0.7M, 1.4M, etc; and I have reset the “radius” accordingly by

    • 28 + 0.7 * (16 points buffer) = 39.2 < 40 (next level radius)
    • 40 + 1.4 * 16 = 62.4 < 64
    • 64 + 2.8 * 16 = 108.8 < 120

    However for the BNS benchmark run (nsnstohmns), I do not see difference in simulation results with different number of cores by putting the refinement levels around the stars closer to each other, this makes me wonder if the above “radius” settings were really the cause.

    Any comments or updates on this issue are appreciated!

  9. Log in to comment