Refinement prolongation: All points must have been received
-----------------------------------------------------------------------------------------------------------------------
Iteration Time | *me_per_hour | GRHYDRO::dens | *ROBASE::rho | *::w_lorentz | *STRAINTS::H | *axrss_mb
| | sum maximum | maximum | maximum | maximum | maximum
-----------------------------------------------------------------------------------------------------------------------
9840 184.500 | 98.5293558 | 0.0004144 0.0038553 | 0.0011696 | 1.2604477 | 0.0020844 | 2839
INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (-0.375,12.5625,0)
INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (0.375,-12.5625,0)
INFO (NSTracker): Found star at (-0.375,12.5625,0)
9844 184.575 | 98.5440274 | 0.0004144 0.0038556 | 0.0011696 | 1.2596245 | 0.0020791 | 2839
INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (-0.375,12.5625,0)
INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (0.375,-12.5625,0)
INFO (NSTracker): Found star at (-0.5625,12.5625,0)
9848 184.650 | 98.5682477 | 0.0004144 0.0038559 | 0.0011696 | 1.2568424 | 0.0020807 | 2839
INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (-0.5625,12.5625,0)
INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (0.5625,-12.5625,0)
INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 0
INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 1
==> projectdns_maxwell_65_300_1_11251505_.err <==
box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([box.active=bboxset<CCTK_INT4,3>928,8416,768]:[1056,8668,1048]:[4,4,4]/[232,2104,192]:[264,(set<bbox>:{([768,8416,768]:[924,8668,1048]:[4,4,4]/[192,2104,192]:[2312167,262]/[33,64,71]/149952)},stride:[4,4,4],offset:[0,0,0])
needrecv=bboxset<CCTK_INT4,3>,2167,262]/[40,64,71]/181760)},stride:[4,4,4],offset:[0,0,0])
needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([928,8416,768]:[1056,8516,1048]:[4,4,4]/[232,2104,192]:[264,2129(set<bbox>:{([768,8416,768]:[924,8516,1048]:[4,4,4]/[192,2104,192]:[231,2129,262]/[40,26,262]/[33,26,71]/60918)},stride:[4,4,4],offset:[0,0,0])
WARNING level 1 from host node2 process 1
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->,71]/73840)},stride:[4,4,4],offset:[0,0,0])
WARNING level 1 from host node2 process 0
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
[ml=0 rl=6 c=1] The following grid structure consistency check failed:
Refinement prolongation: All points must have been received
needrecv.empty()
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
[ml=0 rl=6 c=0] The following grid structure consistency check failed:
Refinement prolongation: All points must have been received
needrecv.empty()
==> projectdns_maxwell_65_300_1_11251505_.out <==
WARNING level 1 from host node2 process 0
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
[ml=0 rl=6 c=0] The following grid structure consistency check failed:
Refinement prolongation: All points must have been received
needrecv.empty()
==> projectdns_maxwell_65_300_1_11251505_.err <==
box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9400,768]:[1056,9504,1048]:[4,4,4]/[192,2350,192]:[264,2376,262]/[73,27,71]/139941)},stride:[4,4,4],offset:[0,0,0])
needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9404,768]:[1056,9504,1048]:[4,4,4]/[192,2351,192]:[264,2376,262]/[73,26,71]/134758)},stride:[4,4,4],offset:[0,0,0])
WARNING level 1 from host node2 process 7
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829:
[ml=0 rl=6 c=7] The following grid structure consistency check failed:
Refinement prolongation: All points must have been received
needrecv.empty()
WARNING level 1 from host node2 process 0
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
[ml=0 rl=6 c=0] The following grid structure consistency check failed:
Synchronisation and boundary prolongation: All points must have been received
needrecv.empty()
==> projectdns_maxwell_65_300_1_11251505_.out <==
WARNING level 1 from host node2 process 0
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
[ml=0 rl=6 c=0] The following grid structure consistency check failed:
Synchronisation and boundary prolongation: All points must have been received
needrecv.empty()
==> projectdns_maxwell_65_300_1_11251505_.err <==
WARNING level 1 from host node2 process 1
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
[ml=0 rl=6 c=1] The following grid structure consistency check failed:
Synchronisation and boundary prolongation: All points must have been received
needrecv.empty()
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 0 with PID 285012 received signal 6
Writing backtrace to dns/backtrace.0.txt
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 1 with PID 285013 received signal 6
Writing backtrace to dns/backtrace.1.txt
WARNING level 1 from host node2 process 6
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
[ml=0 rl=6 c=6] The following grid structure consistency check failed:
Synchronisation and boundary prolongation: All points must have been received
needrecv.empty()
WARNING level 1 from host node2 process 7
in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161:
->
/public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971:
[ml=0 rl=6 c=7] The following grid structure consistency check failed:
Synchronisation and boundary prolongation: All points must have been received
needrecv.empty()
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 6 with PID 285018 received signal 6
Writing backtrace to dns/backtrace.6.txt
cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed.
Rank 7 with PID 285019 received signal 6
Writing backtrace to dns/backtrace.7.txt
==> projectdns_maxwell_65_300_1_11251505_.out <==
can some any point some direction on how to solve this problem? thanks so much.
Comments (22)
-
-
- attached projectdns_maxwell_65_300_1_11251505_.err
- attached projectdns_maxwell_65_300_1_11251505_.out
-
- attached SubmitScript
- attached RunScript
here are the files you requested. if you can shed some light on this matter,I would be very grateful.@Roland Haas
-
Hmm, one thing I would suggest changing would be to use fewer threads per MPI rank. Right now you have set:
export OMP_NUM_THREADS=24
which gives you 24 OpenMP thtreads per MPI rank. The SubmitScipt you use is (essentially, up to comments)
generic.sub
as far as I can tell, so really only designed for a non-cluster environment (it may work on a cluster, but that would be kind of accidental).Right now it seems that you are using 8 MPI ranks each with 24 OpenMP threads. So this is a total of 192 cores. So this should be somewhere between 4 and 8 nodes, yes?
My suggestion would be to try and use only about 8 OpenMP threads and correspondingly more MPI ranks, so use
--cores 192 --num-threads 8
instead of--cores 192 --num-threads 24
which is what you seem to have used.In principle, more threads should of course not make things fail (this would indeed be a bug), though it may be quite hard to reproduce since it would, most likely, be a race condition that only shows up with large thread counts. Also note that multi-threading in Cactus/Carpet tends to not be extremely efficient (since it was added to an existing MPI parallel code instead of being integrated from the beginning), which is why I suggest to use fewer threads. Usually you want to use as many MPI ranks and as few threads as you can get away with before you are limited by added communication overhead (which scales as the number of MPI ranks and is constant with the number of OpenMP threads).
-
yes,its only one node with 8 cpus,i will try your solution right away,thank you very much!
-
Just to be clear, by 8 CPU you mean 8 CPU sockets (which would make this a very large node indeed)? Or 8 cores (which would make this almost small by today’s standards)? Basically if you run
cat /proc/cpuinfo
, what is the highest value forprocessor
that you see? I would guess either 7 or 191. You can also try to see iflscpu
exists which will give output in a bit of a nicer form than the raw cpuinfo output. -
yes,we have 8 cpus per node,24cores per cpu,so that make it 192 cores per node.
-
I’m sorry,but the result is the same.
INFO (NSTracker): Found star at (0,-12.75,0) 10248 192.150 | 111.3619840 | 0.0003162 0.0026300 | 0.0010288 | 1.0506992 | 0.0011846 | 2909 INFO (CarpetTracker): Setting position of refined region #1 from surface #0 to (0,-12.75,0) INFO (CarpetTracker): Setting position of refined region #2 from surface #1 to (-0,12.75,0) INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 0 INFO (CarpetRegrid2): Enforcing grid structure properties, iteration 1 ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.err <== box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,8408,768]:[1048,8664,920]:[4,4,4]/[192,2102,192]:[262,2166,230]/[71box.active=bboxset<CCTK_INT4,3>(set<bbox>:,65,39]/179985)},stride:[4,4,4],offset:[0,0,0]) needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,8408,768]:[1048{([768,8408,924]:[1048,8664,1048]:[4,4,4]/[192,2102,231]:[262,2166,262]/[71,65,32]/147680)},stride:[4,4,4],8516,920]:[4,4,4]/[192,2102,192]:[262,2129,230]/[71,28,39]/77532)},stride:[4,4,4],offset:[0,0,0]) WARNING level 1 from host node2 process 0 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> ,offset:[0,0,0]) needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,8408,924]:[1048,8516,1048]:[4,4,4] ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.out <== WARNING level 1 from host node2 process 0 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829: [ml=0 rl=6 c=0] The following grid structure consistency check failed: Refinement prolongation: All points must have been received needrecv.empty() ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.err <== /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829: [ml=0 rl=6 c=0] The following grid structure consistency check failed: Refinement prolongation: All points must have been received needrecv.empty() /[192,2102,231]:[262,2129,262]/[71,28,32]/63616)},stride:[4,4,4],offset:[0,0,0]) WARNING level 1 from host node2 process 1 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829: [ml=0 rl=6 c=1] The following grid structure consistency check failed: Refinement prolongation: All points must have been received needrecv.empty() box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9408,768]:[1048,9512,1048]:[4,4,4]/[192,2352,192]:[262,2378,262]/[71,27,71]/136107)},stride:[4,4,4],offset:[0,0,0]) needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9408,768]:[1048,9512,1048]:[4,4,4]/[192,2352,192]:[262,2378,262]/[71,27,71]/136107)},stride:[4,4,4],offset:[0,0,0]) WARNING level 1 from host node2 process 7 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829: [ml=0 rl=6 c=7] The following grid structure consistency check failed: Refinement prolongation: All points must have been received needrecv.empty() box.active=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9260,768]:[1048,9404,1048]:[4,4,4]/[192,2315,192]:[262,2351,262]/[71,37,71]/186517)},stride:[4,4,4] ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.out <== WARNING level 1 from host node2 process 0 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971: [ml=0 rl=6 c=0] The following grid structure consistency check failed: Synchronisation and boundary prolongation: All points must have been received needrecv.empty() ==> projectdns_maxwell_65_300_1_maxrho=10_11291630_threads=8_core64.err <== WARNING level 1 from host node2 process 0 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971: [ml=0 rl=6 c=0] The following grid structure consistency check failed: Synchronisation and boundary prolongation: All points must have been received needrecv.empty() ,offset:[0,0,0]) needrecv=bboxset<CCTK_INT4,3>(set<bbox>:{([768,9404,768]:[1048,9404,1048]:[4,4,4]/[192,2351,192]:[262,2351,262]/[71,1,71]/5041)},stride:[4,4,4],offset:[0,0,0]) WARNING level 1 from host node2 process 6 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:829: [ml=0 rl=6 c=6] The following grid structure consistency check failed: Refinement prolongation: All points must have been received needrecv.empty() cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed. Rank 0 with PID 768170 received signal 6 Writing backtrace to dns/backtrace.0.txt WARNING level 1 from host node2 process 1 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971: [ml=0 rl=6 c=1] The following grid structure consistency check failed: Synchronisation and boundary prolongation: All points must have been received needrecv.empty() cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed. Rank 1 with PID 768171 received signal 6 Writing backtrace to dns/backtrace.1.txt WARNING level 1 from host node2 process 7 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971: [ml=0 rl=6 c=7] The following grid structure consistency check failed: Synchronisation and boundary prolongation: All points must have been received needrecv.empty() WARNING level 1 from host node2 process 6 in thorn CarpetLib, file /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:161: -> /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:971: [ml=0 rl=6 c=6] The following grid structure consistency check failed: Synchronisation and boundary prolongation: All points must have been received needrecv.empty() cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed. Rank 7 with PID 768177 received signal 6 Writing backtrace to dns/backtrace.7.txt cactus_sim: /public/home/guolj/nr/etk/Cactus/configs/sim/build/CarpetLib/dh.cc:1036: void CarpetLib::dh::regrid(bool): Assertion `(allrestricted & obox.buffers).empty()' failed. Rank 6 with PID 768176 received signal 6 Writing backtrace to dns/backtrace.6.txt
although it speeded up a lot.but ended the same.
-
Ok, in some sense this is actually good news (at least it’s not obviously a race condition). Being paranoid, did you check in RunScript that it does indeed set
OMP_NUM_THREADS=8
(Carpet also prints out the number of threads used at the top of the out file, in a lineINFO (Carpet): There are 24 threads per process
). -
indeed it set
CACTUS_NUM_THREADS=8
but the info(Carpet) shows :
INFO (Carpet): There are 24 threads per process
shouldn't it be 8 threads per process?
worth to mention that
Although OpenMP is enabled, the environment variable CACTUS_NUM_THREADS is not set.
I submited this job on a login node,but openmpi ran it on a compute node,could this be the problem?
-
The variable that (usually) controls how many threads are used is called
OMP_NUM_THREADS
.CACTUS_NUM_THREADS
is a copy that records the number of threads requested from simfactory. In case something interferes with simfactory (mpirun sometimes does this, or SLURM, or Cray’s alps system) then Cactus / Carpet can useCACTUS_NUM_THREADS
to detect this. More or less what you see (only it would produce an error instead of just a warning).If neither
OMP_NUM_THREADS
norCACTUS_NUM_THREADS
is set then OpenMP would default to the total number of cores for the number of threads (ie 24) and Carpet would output such a warning message.Some MPI stacks (eg OpenMPI, note the “I” at the end) can be configured to not pass environment variables to the Cactus executable. They may need to be given an explicit list of environment variables to pass to Cactus using the
-x
option. See eg: https://bitbucket.org/simfactory/simfactory2/src/master/mdb/runscripts/cygwin.run where you can see the various-x
options in the mpirun line. I would give adding those a try.
-
Words can’t not express my gratitude,thank you,I will try that right away.
-
I try what you said,the threads are now correct,but the problem still breaks down,exactly as before.
is there any other suggestion regarding this problem?
-
I am out of ideas. During the weekly ET calls (Thu 9am ET) the participants cover open tickets. Your ticket will be covered tomorrow. You are welcome to call in: https://docs.einsteintoolkit.org/et-docs/Meeting_agenda
-
- removed milestone
-
thank you,I will join the meeting.
-
Very good. I also just noticed I quote the wrong time of day for the call. It is actually 9:00 Central Time that is 10:00 Eastern (US) time.
-
ok,thanks
-
This was discussed in today’s ET call: http://lists.einsteintoolkit.org/pipermail/users/2022-December/008770.html
-
@Artectek did Gabriele’s suggestion help fix the issue for you?
-
@Artectek I will close this ticket as “resolved” unless objected before 2022-12-28
-
- changed status to closed
closed due to inactivity
- Log in to comment
These errors are usually caused by issues with the grid setup that develop over time in the simulation. The error is unfortunately only detectable deep inside of Carpet where the original cause is no longer obvious. With just the error message it is not really possible to help diagnose it.
To help with this, more information is needed:
.out
and.err
files produced by the simulationWith those it may be possible to provide meaningful advise. Otherwise shot in the dark suggestion would be: