NaNs when running static tov on >40 cores
I've been trying to run the static tov example parameter file on an HPC cluster using >40 cores, but this results in NaNs in the data. I can't remember whether the issue first appears at 40 or 41 cores (and won't be able to check this for the next few days), but using 41+ cores definitely gives me NaNs. I remember testing with 39 cores and several other lower values (down to 4), but these runs all seemed fine.
So far, I've been able to run larger simulations (e.g. BBHs) on more than 40 cores (same cluster) without any apparent issues.
I'll attach the static tov parameter file I used (I think I changed one or two outdated parameters), as well as the PBS script and error/output files from a static tov run on 44 cores.
The static tov runs were intended for speed test purposes. The results of the speed tests I've run so far (on fewer than 40 cores) seem quite strange to me, so I'm going to attach a text file with walltimes and CPU times for these runs, too. Any comments would be appreciated!
Here are some comments I received from the person who compiled the Einstein Toolkit on the cluster I'm using. He ran tests with the static tov star parameter file I attached, and sent me the following:
"The times follow a standard MPI reduction pattern. Out beyond 24 cores the code/data does not scale well and latency from message passing starts to increase rather than reduce run times. Some increases in ppn values increase the run times; this may be due to how data objects are handled in the code. There is also some fluctuation in the data runs which can only be caused by the software as I ran all tests on node 607 while no one else was using it."
He suggested that if I want to do runs on more than 20 cores (which I do), then I should ask for advice from people more familiar with the Einstein Toolkit.
This is a small example, and I don't think the results you are seeing are surprising. Think of the number of points in the grid on each refinement level, then divide that by the number of cores. Each core will have to evolve that number of points. If you are using standard MPI without OpenMP, this block of points will be surrounded by a shell of ghost points which need to be synchronised with the other processes. As the size of the component on each process decreases, the ratio between the number of ghost points which need to be communicated and the number of real points which need to be evolved will increase. Eventually, you will have so few points evolved on each process that the communication cost dominates (and is not expected to scale well with the number of cores). I suggest that you calculate how many points are evolved on each process in your tests.
One way to improve scalability may be to use OpenMP parallelisation in addition to MPI. Are you doing this already?
Note that the scaling of this example is probably a separate issue to the problem of it crashing on >40 cores.
Also, just because this specific example doesn't scale effectively beyond 20 cores, this doesn't say anything about other examples, or cases that you want to run for science.
I can confirm that this is a real problem. I have run your parameter file using the current master branch of the ET on 44 processes, and I get NaNs at iteration 736. This is earlier than you saw them, suggesting some sort of non-deterministic effect. The command I used to run this was
You can see the number of processes being used in the output file:
When I ran on 40 processes, this didn't happen.
This seems to be a bug in Carpet.
If I instead run with Carpet::processor_topology = "recursive", which uses a different algorithm for splitting the domain among processes, the code instead aborts with an error:
"recursive" works on 40 processes. It's possible that there is no sensible way to split such a small domain among 44 processes, and that aborting with an error is the correct behaviour. If that is the case, then this should be caught at a higher level in the code and a more sensible error message should be generated.
Thank you very much for looking into it!
We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.
Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?
My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."
This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.
Replying to [comment:3 hinder]:
Replying to [comment:6 allgwy001@…]:
I have never run on a system where multiple users share nodes; it doesn't really fit very well with the sort of application that Cactus is. You don't want to be worrying about whether other processes are competing with you for memory, memory bandwidth, or cores. When you have exclusive access to each node, OpenMP is usually a good idea. By the way: what sort of interconnect do you have? Gigabit ethernet, or infiniband, or something else? If users are sharing nodes, then I suspect that this cluster is gigabit ethernet only, and you may be limited to small jobs, since the performance of gigabit ethernet will quickly become your bottleneck. What cluster are you using? From your email address, I'm guessing that it is one of the ones at http://hpc.uct.ac.za? If so, or you are using a similar scheduler, then you should be able to do this, as in their documentation:
Once you are the exclusive user of a node, I don't see a problem with enabling OpenMP. Also note: OpenMP is not something that needs to be enabled by the system administrator; it is determined by your compilation flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it possible that there was a confusion, and the admins were talking about hyperthreading instead, which is a very different thing, and which I agree you probably don't want to have enabled (it would have to be enabled by the admins)?
Carpet provides a lot of output :) You may get something useful by setting
CarpetLib::output_bboxes = "yes"
On the most refined region, making the approximation that the domain is divided into N identical cubical regions, then for N = 40, you would have 29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction. The volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6. So you have 3.6 times as many points being communicated as you have being evolved. Especially if the interconnect is only gigabit ethernet, I'm not surprised that the scaling flattens off by this point. Note that if you use OpenMP, this ratio will be much smaller, because openmp threads communicate using shared memory, not ghost zones. Essentially you will have fewer processes, each with a larger cube, and multiple threads working on that cube in shared memory.
Yes - it looks like your contact doesn't know that the code is iterative, and cctk_final_time simply counts the number of iterations. In order to test with a larger problem size, you would need to reduce CoordBase::dx, dy and dz, so that there is higher overall resolution, and hence more points. I would expect the scalability to improve with larger problem sizes.
Yes, we're running on the HEX cluster at the UCT HPC facility. The admin agrees that node sharing isn't desirable, but UCT is in the middle of a financial crisis and we simply can't afford dedicated access to a single node right now.
We use 56G infiniband interconnect, but since the admin ran his tests on a single node, it doesn't really matter.
He tried running on two nodes at 19:30 this evening and found that the IB traffic (bottom right) was minimal.
He's also well-aware about the difference between OpenMP and hyperthreading (which is disabled on all HPC nodes), as well as the OMP_NUM_THREADS environment variable.
Replying to [comment:7 hinder]:
Details about the build:
where sles.cfg contains:
My PBS script:
I simply ran using
These PBS options look as if you ran 44 processes on one single node. "nodes=1:ppn=44" means "use 1 node, run 44 processes on that node." Is that intended? If so, there would be no Infiniband traffic generated.
Yes and sorry. To clarify, this is the script I used for which the NaNs appeared. The admin would've used a different script for looking at the traffic. If you like, I can ask for it.
Replying to [comment:10 eschnett]:
This does not look to have quite the same characteristics as
#2175which was fixed in #2182but given that both are concerned with NaN/poisson values appearing when running in many MPI ranks this may be worthwile to check again if there is a reproducer for the issue.
I have verified (on LSU’s melete05 machine which has 80 cores) that indeed git hash 49d31796 "ML_BSSN: SYNC after computing initial Gamma, dtalpha and dtbeta vars" of mclachlan fixes this issue.
This was fixed as part of
Gwyneth if this still happens to your runs, please re-open the ticket.