Refinement prolongation: All points must have been received

Create issue
Issue #2670 closed
Former user created an issue
-----------------------------------------------------------------------------------------------------------------------
Iteration      Time | *me_per_hour |             GRHYDRO::dens | *ROBASE::rho | *::w_lorentz | *STRAINTS::H | *axrss_mb
                    |              |          sum      maximum |      maximum |      maximum |      maximum |   maximum
-----------------------------------------------------------------------------------------------------------------------
     9840   184.500 |   98.5293558 |    0.0004144    0.0038553 |    0.0011696 |    1.2604477 |    0.0020844 |      2839
INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (-0.375,12.5625,0)
INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (0.375,-12.5625,0)
INFO (NSTracker): Found star at (-0.375,12.5625,0)
     9844   184.575 |   98.5440274 |    0.0004144    0.0038556 |    0.0011696 |    1.2596245 |    0.0020791 |      2839
INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (-0.375,12.5625,0)
INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (0.375,-12.5625,0)
INFO (NSTracker): Found star at (-0.5625,12.5625,0)
     9848   184.650 |   98.5682477 |    0.0004144    0.0038559 |    0.0011696 |    1.2568424 |    0.0020807 |      2839
INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (-0.5625,12.5625,0)
INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (0.5625,-12.5625,0)
INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 0
INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 1

==> projectdns_maxwell_65_300_1_11251505_.err <==
box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([box.active=bboxset<CCTK_INT4,3>928,8416,768]:[1056,8668,1048]:[4,4,4]/[232,2104,192]:[264,(set<bbox>:{([768,8416,768]:[924,8668,1048]:[4,4,4]/[192,2104,192]:[2312167,262]/[33,64,71]/149952)},stride:[4,4,4],offset:[0,0,0])
needrecv=bboxset<CCTK_INT4,3>,2167,262]/[40,64,71]/181760)},stride:[4,4,4],offset:[0,0,0])
needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([928,8416,768]:[1056,8516,1048]:[4,4,4]/[232,2104,192]:[264,2129(set<bbox>:{([768,8416,768]:[924,8516,1048]:[4,4,4]/[192,2104,192]:[231,2129,262]/[40,26,262]/[33,26,71]/60918)},stride:[4,4,4],offset:[0,0,0])
WARNING level 1 from host node2 process 1
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->,71]/73840)},stride:[4,4,4],offset:[0,0,0])
WARNING level 1 from host node2 process 0
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
   [ml=0 rl=6 c=1] The following grid structure consistency check failed:
   Refinement prolongation: All points must have been received
   needrecv.empty()

/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
   [ml=0 rl=6 c=0] The following grid structure consistency check failed:
   Refinement prolongation: All points must have been received
   needrecv.empty()

==> projectdns_maxwell_65_300_1_11251505_.out <==
WARNING level 1 from host node2 process 0
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
   [ml=0 rl=6 c=0] The following grid structure consistency check failed:
   Refinement prolongation: All points must have been received
   needrecv.empty()

==> projectdns_maxwell_65_300_1_11251505_.err <==
box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9400,768]:[1056,9504,1048]:[4,4,4]/[192,2350,192]:[264,2376,262]/[73,27,71]/139941)},stride:[4,4,4],offset:[0,0,0])
needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9404,768]:[1056,9504,1048]:[4,4,4]/[192,2351,192]:[264,2376,262]/[73,26,71]/134758)},stride:[4,4,4],offset:[0,0,0])
WARNING level 1 from host node2 process 7
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
   [ml=0 rl=6 c=7] The following grid structure consistency check failed:
   Refinement prolongation: All points must have been received
   needrecv.empty()
WARNING level 1 from host node2 process 0
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
   [ml=0 rl=6 c=0] The following grid structure consistency check failed:
   Synchronisation and boundary prolongation: All points must have been received
   needrecv.empty()

==> projectdns_maxwell_65_300_1_11251505_.out <==
WARNING level 1 from host node2 process 0
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
   [ml=0 rl=6 c=0] The following grid structure consistency check failed:
   Synchronisation and boundary prolongation: All points must have been received
   needrecv.empty()

==> projectdns_maxwell_65_300_1_11251505_.err <==
WARNING level 1 from host node2 process 1
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
   [ml=0 rl=6 c=1] The following grid structure consistency check failed:
   Synchronisation and boundary prolongation: All points must have been received
   needrecv.empty()
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 0 with PID 285012 received signal 6
Writing backtrace to dns/backtrace.0.txt
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 1 with PID 285013 received signal 6
Writing backtrace to dns/backtrace.1.txt
WARNING level 1 from host node2 process 6
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
   [ml=0 rl=6 c=6] The following grid structure consistency check failed:
   Synchronisation and boundary prolongation: All points must have been received
   needrecv.empty()
WARNING level 1 from host node2 process 7
  in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
  ->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
   [ml=0 rl=6 c=7] The following grid structure consistency check failed:
   Synchronisation and boundary prolongation: All points must have been received
   needrecv.empty()
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 6 with PID 285018 received signal 6
Writing backtrace to dns/backtrace.6.txt
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 7 with PID 285019 received signal 6
Writing backtrace to dns/backtrace.7.txt

==> projectdns_maxwell_65_300_1_11251505_.out <==

can some any point some direction on how to solve this problem? thanks so much.

Comments (22)

  1. Roland Haas

    These errors are usually caused by issues with the grid setup that develop over time in the simulation. The error is unfortunately only detectable deep inside of Carpet where the original cause is no longer obvious. With just the error message it is not really possible to help diagnose it.

    To help with this, more information is needed:

    1. the .out and .err files produced by the simulation
    2. the RunScript as found in output-NNNN/SIMFACTORY
    3. the SubmitScript as found in output-NNNN/SIMFACTORY
    4. if a public cluster, the name of the cluster used, otherwise the option list used

    With those it may be possible to provide meaningful advise. Otherwise shot in the dark suggestion would be:

    1. do not use too many MPI ranks (no more than 128 for a mid res BNS run), this avoids the domain parts assigned to each MPI rank becomming too small
    2. check if your box locations have moved out of the grid (ie something went wrong with tracking)
    3. check for NaNs in evolution variables

  2. Roland Haas

    Hmm, one thing I would suggest changing would be to use fewer threads per MPI rank. Right now you have set:

    export OMP_NUM_THREADS=24
    

    which gives you 24 OpenMP thtreads per MPI rank. The SubmitScipt you use is (essentially, up to comments) generic.sub as far as I can tell, so really only designed for a non-cluster environment (it may work on a cluster, but that would be kind of accidental).

    Right now it seems that you are using 8 MPI ranks each with 24 OpenMP threads. So this is a total of 192 cores. So this should be somewhere between 4 and 8 nodes, yes?

    My suggestion would be to try and use only about 8 OpenMP threads and correspondingly more MPI ranks, so use --cores 192 --num-threads 8 instead of --cores 192 --num-threads 24 which is what you seem to have used.

    In principle, more threads should of course not make things fail (this would indeed be a bug), though it may be quite hard to reproduce since it would, most likely, be a race condition that only shows up with large thread counts. Also note that multi-threading in Cactus/Carpet tends to not be extremely efficient (since it was added to an existing MPI parallel code instead of being integrated from the beginning), which is why I suggest to use fewer threads. Usually you want to use as many MPI ranks and as few threads as you can get away with before you are limited by added communication overhead (which scales as the number of MPI ranks and is constant with the number of OpenMP threads).

  3. Roland Haas

    Just to be clear, by 8 CPU you mean 8 CPU sockets (which would make this a very large node indeed)? Or 8 cores (which would make this almost small by today’s standards)? Basically if you run cat /proc/cpuinfo, what is the highest value for processor that you see? I would guess either 7 or 191. You can also try to see if lscpu exists which will give output in a bit of a nicer form than the raw cpuinfo output.

  4. Artectek

    I’m sorry,but the result is the same.

    INFO (NSTracker): Found star at (0,-12.75,0)
        10248   192.150 |  111.3619840 |    0.0003162    0.0026300 |    0.0010288 |    1.0506992 |    0.0011846 |      2909
    INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (0,-12.75,0)
    INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (-0,12.75,0)
    INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 0
    INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 1
    
    ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.err <==
    box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,8408,768]:[1048,8664,920]:[4,4,4]/[192,2102,192]:[262,2166,230]/[71box.active=bboxset<CCTK_INT4,3>(set<bbox>:,65,39]/179985)},stride:[4,4,4],offset:[0,0,0])
    needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,8408,768]:[1048{([768,8408,924]:[1048,8664,1048]:[4,4,4]/[192,2102,231]:[262,2166,262]/[71,65,32]/147680)},stride:[4,4,4],8516,920]:[4,4,4]/[192,2102,192]:[262,2129,230]/[71,28,39]/77532)},stride:[4,4,4],offset:[0,0,0])
    WARNING level 1 from host node2 process 0
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      -> ,offset:[0,0,0])
    needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,8408,924]:[1048,8516,1048]:[4,4,4]
    ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.out <==
    WARNING level 1 from host node2 process 0
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
       [ml=0 rl=6 c=0] The following grid structure consistency check failed:
       Refinement prolongation: All points must have been received
       needrecv.empty()
    
    ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.err <==
    
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
       [ml=0 rl=6 c=0] The following grid structure consistency check failed:
       Refinement prolongation: All points must have been received
       needrecv.empty()
    /[192,2102,231]:[262,2129,262]/[71,28,32]/63616)},stride:[4,4,4],offset:[0,0,0])
    WARNING level 1 from host node2 process 1
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
       [ml=0 rl=6 c=1] The following grid structure consistency check failed:
       Refinement prolongation: All points must have been received
       needrecv.empty()
    box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9408,768]:[1048,9512,1048]:[4,4,4]/[192,2352,192]:[262,2378,262]/[71,27,71]/136107)},stride:[4,4,4],offset:[0,0,0])
    needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9408,768]:[1048,9512,1048]:[4,4,4]/[192,2352,192]:[262,2378,262]/[71,27,71]/136107)},stride:[4,4,4],offset:[0,0,0])
    WARNING level 1 from host node2 process 7
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
       [ml=0 rl=6 c=7] The following grid structure consistency check failed:
       Refinement prolongation: All points must have been received
       needrecv.empty()
    box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9260,768]:[1048,9404,1048]:[4,4,4]/[192,2315,192]:[262,2351,262]/[71,37,71]/186517)},stride:[4,4,4]
    ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.out <==
    WARNING level 1 from host node2 process 0
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
       [ml=0 rl=6 c=0] The following grid structure consistency check failed:
       Synchronisation and boundary prolongation: All points must have been received
       needrecv.empty()
    
    ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.err <==
    WARNING level 1 from host node2 process 0
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
       [ml=0 rl=6 c=0] The following grid structure consistency check failed:
       Synchronisation and boundary prolongation: All points must have been received
       needrecv.empty()
    ,offset:[0,0,0])
    needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9404,768]:[1048,9404,1048]:[4,4,4]/[192,2351,192]:[262,2351,262]/[71,1,71]/5041)},stride:[4,4,4],offset:[0,0,0])
    WARNING level 1 from host node2 process 6
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
       [ml=0 rl=6 c=6] The following grid structure consistency check failed:
       Refinement prolongation: All points must have been received
       needrecv.empty()
    cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
    Rank 0 with PID 768170 received signal 6
    Writing backtrace to dns/backtrace.0.txt
    WARNING level 1 from host node2 process 1
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
       [ml=0 rl=6 c=1] The following grid structure consistency check failed:
       Synchronisation and boundary prolongation: All points must have been received
       needrecv.empty()
    cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
    Rank 1 with PID 768171 received signal 6
    Writing backtrace to dns/backtrace.1.txt
    WARNING level 1 from host node2 process 7
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
       [ml=0 rl=6 c=7] The following grid structure consistency check failed:
       Synchronisation and boundary prolongation: All points must have been received
       needrecv.empty()
    WARNING level 1 from host node2 process 6
      in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
      ->
    /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
       [ml=0 rl=6 c=6] The following grid structure consistency check failed:
       Synchronisation and boundary prolongation: All points must have been received
       needrecv.empty()
    cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
    Rank 7 with PID 768177 received signal 6
    Writing backtrace to dns/backtrace.7.txt
    cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
    Rank 6 with PID 768176 received signal 6
    Writing backtrace to dns/backtrace.6.txt
    

    although it speeded up a lot.but ended the same.

  5. Roland Haas

    Ok, in some sense this is actually good news (at least it’s not obviously a race condition). Being paranoid, did you check in RunScript that it does indeed set OMP_NUM_THREADS=8 (Carpet also prints out the number of threads used at the top of the out file, in a line INFO (Carpet): There are 24 threads per process).

  6. Artectek

    indeed it set

    CACTUS_NUM_THREADS=8
    

    but the info(Carpet) shows :

    INFO (Carpet): There are 24 threads per process
    

    shouldn't it be 8 threads per process?

    worth to mention that

    Although OpenMP is enabled, the environment variable CACTUS_NUM_THREADS is not set.
    

    I submited this job on a login node,but openmpi ran it on a compute node,could this be the problem?

  7. Roland Haas

    The variable that (usually) controls how many threads are used is called OMP_NUM_THREADS . CACTUS_NUM_THREADS is a copy that records the number of threads requested from simfactory. In case something interferes with simfactory (mpirun sometimes does this, or SLURM, or Cray’s alps system) then Cactus / Carpet can use CACTUS_NUM_THREADS to detect this. More or less what you see (only it would produce an error instead of just a warning).

    If neither OMP_NUM_THREADS nor CACTUS_NUM_THREADS is set then OpenMP would default to the total number of cores for the number of threads (ie 24) and Carpet would output such a warning message.

    Some MPI stacks (eg OpenMPI, note the “I” at the end) can be configured to not pass environment variables to the Cactus executable. They may need to be given an explicit list of environment variables to pass to Cactus using the -x option. See eg: https://bitbucket.org/simfactory/simfactory2/src/master/mdb/runscripts/cygwin.run where you can see the various -x options in the mpirun line. I would give adding those a try.

  8. Artectek

    I try what you said,the threads are now correct,but the problem still breaks down,exactly as before.

    is there any other suggestion regarding this problem?

  9. Roland Haas

    Very good. I also just noticed I quote the wrong time of day for the call. It is actually 9:00 Central Time that is 10:00 Eastern (US) time.

  10. Log in to comment