NaNs when running static tov on >40 cores

Issue #2008 closed
Gwyneth Allwright created an issue

I've been trying to run the static tov example parameter file on an HPC cluster using >40 cores, but this results in NaNs in the data. I can't remember whether the issue first appears at 40 or 41 cores (and won't be able to check this for the next few days), but using 41+ cores definitely gives me NaNs. I remember testing with 39 cores and several other lower values (down to 4), but these runs all seemed fine.

So far, I've been able to run larger simulations (e.g. BBHs) on more than 40 cores (same cluster) without any apparent issues.

I'll attach the static tov parameter file I used (I think I changed one or two outdated parameters), as well as the PBS script and error/output files from a static tov run on 44 cores.

The static tov runs were intended for speed test purposes. The results of the speed tests I've run so far (on fewer than 40 cores) seem quite strange to me, so I'm going to attach a text file with walltimes and CPU times for these runs, too. Any comments would be appreciated!

Keyword:

Comments (14)

  1. Gwyneth Allwright reporter
    • removed comment

    Here are some comments I received from the person who compiled the Einstein Toolkit on the cluster I'm using. He ran tests with the static tov star parameter file I attached, and sent me the following:

    "The times follow a standard MPI reduction pattern. Out beyond 24 cores the code/data does not scale well and latency from message passing starts to increase rather than reduce run times. Some increases in ppn values increase the run times; this may be due to how data objects are handled in the code. There is also some fluctuation in the data runs which can only be caused by the software as I ran all tests on node 607 while no one else was using it."

    He suggested that if I want to do runs on more than 20 cores (which I do), then I should ask for advice from people more familiar with the Einstein Toolkit.

  2. Ian Hinder
    • removed comment

    This is a small example, and I don't think the results you are seeing are surprising. Think of the number of points in the grid on each refinement level, then divide that by the number of cores. Each core will have to evolve that number of points. If you are using standard MPI without OpenMP, this block of points will be surrounded by a shell of ghost points which need to be synchronised with the other processes. As the size of the component on each process decreases, the ratio between the number of ghost points which need to be communicated and the number of real points which need to be evolved will increase. Eventually, you will have so few points evolved on each process that the communication cost dominates (and is not expected to scale well with the number of cores). I suggest that you calculate how many points are evolved on each process in your tests.

    One way to improve scalability may be to use OpenMP parallelisation in addition to MPI. Are you doing this already?

    Note that the scaling of this example is probably a separate issue to the problem of it crashing on >40 cores.

    Also, just because this specific example doesn't scale effectively beyond 20 cores, this doesn't say anything about other examples, or cases that you want to run for science.

  3. Ian Hinder
    • removed comment

    I can confirm that this is a real problem. I have run your parameter file using the current master branch of the ET on 44 processes, and I get NaNs at iteration 736. This is earlier than you saw them, suggesting some sort of non-deterministic effect. The command I used to run this was

    sim create-submit static_tov_mod_2 --parfile par/static_tov_mod_2.par --procs 44 --num-threads 1
    

    You can see the number of processes being used in the output file:

    $ grep "Carpet is running on" simulations/static_tov_mod_2/output-0000/static_tov_mod_2.out 
    INFO (Carpet): Carpet is running on 44 processes
    

    When I ran on 40 processes, this didn't happen.

    This seems to be a bug in Carpet.

    If I instead run with Carpet::processor_topology = "recursive", which uses a different algorithm for splitting the domain among processes, the code instead aborts with an error:

    terminate called after throwing an instance of 'std::out_of_range'
      what():  vector::_M_range_check: __n (which is 44) >= this->size() (which is 44)
    

    Backtrace is:

    Backtrace from rank 0 pid 18049:
    1. /usr/lib64/libc.so.6(+0x35670) [0x7f6cf5284670]
    2. /usr/lib64/libc.so.6(gsignal+0x37) [0x7f6cf52845f7]
    3. /usr/lib64/libc.so.6(abort+0x148) [0x7f6cf5285ce8]
    4. __gnu_cxx::__verbose_terminate_handler()   [/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d) [0x7f6cf5887d2d]]
    5. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dd86) [0x7f6cf5885d86]
    6. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5ddd1) [0x7f6cf5885dd1]
    7. /cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(+0x5dfe9) [0x7f6cf5885fe9]
    8. std::__throw_out_of_range_fmt(char const*, ...)   [/cluster/apps/gcc/4.9.3/lib64/libstdc++.so.6(_ZSt24__throw_out_of_range_fmtPKcz+0x11f) [0x7f6cf58dbfef]]
    9. std::vector<int, std::allocator<int> >::_M_range_check(unsigned long) const   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZNKSt6vectorIiSaIiEE14_M_range_checkEm+0x20) [0x1342e60]]
    a. Carpet::SplitRegionsMaps_Recursively(_cGH const*, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet28SplitRegionsMaps_RecursivelyEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11d7) [0x2f48f87]]
    b. Carpet::SplitRegionsMaps(_cGH const*, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&, std::vector<std::vector<region_t, std::allocator<region_t> >, std::allocator<std::vector<region_t, std::allocator<region_t> > > >&)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet16SplitRegionsMapsEPK4_cGHRSt6vectorIS3_I8region_tSaIS4_EESaIS6_EES9_+0x11c) [0x2ef6e5c]]
    c. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim() [0x2f05767]
    d. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim() [0x2f03b1c]
    e. Carpet::SetupGH(tFleshConfig*, int, _cGH*)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet7SetupGHEP12tFleshConfigiP4_cGH+0x18e6) [0x2eff606]]
    f. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CCTKi_SetupGHExtensions+0xc4) [0xd20924]
    10. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(CactusDefaultSetupGH+0x30e) [0xd3ad2e]
    11. Carpet::Initialise(tFleshConfig*)   [/scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(_ZN6Carpet10InitialiseEP12tFleshConfig+0x54) [0x2ee2414]]
    12. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim(main+0x96) [0xcbbef6]
    13. /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x7f6cf5270b15]
    14. /scratch/ianhin/simulations/EinsteinToolkitGit/static_tov_mod_2_rec/SIMFACTORY/exe/cactus_sim() [0xcbbbe9]
    

    "recursive" works on 40 processes. It's possible that there is no sensible way to split such a small domain among 44 processes, and that aborting with an error is the correct behaviour. If that is the case, then this should be caught at a higher level in the code and a more sensible error message should be generated.

  4. Gwyneth Allwright reporter
    • removed comment

    Thank you very much for looking into it!

    We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.

    Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?

    My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."

    This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.

    Replying to [comment:3 hinder]:

    This is a small example, and I don't think the results you are seeing are surprising. Think of the number of points in the grid on each refinement level, then divide that by the number of cores. Each core will have to evolve that number of points. If you are using standard MPI without OpenMP, this block of points will be surrounded by a shell of ghost points which need to be synchronised with the other processes. As the size of the component on each process decreases, the ratio between the number of ghost points which need to be communicated and the number of real points which need to be evolved will increase. Eventually, you will have so few points evolved on each process that the communication cost dominates (and is not expected to scale well with the number of cores). I suggest that you calculate how many points are evolved on each process in your tests.

    One way to improve scalability may be to use OpenMP parallelisation in addition to MPI. Are you doing this already?

    Note that the scaling of this example is probably a separate issue to the problem of it crashing on >40 cores.

    Also, just because this specific example doesn't scale effectively beyond 20 cores, this doesn't say anything about other examples, or cases that you want to run for science.

  5. Ian Hinder
    • removed comment

    Replying to [comment:6 allgwy001@…]:

    Thank you very much for looking into it!

    We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.

    I have never run on a system where multiple users share nodes; it doesn't really fit very well with the sort of application that Cactus is. You don't want to be worrying about whether other processes are competing with you for memory, memory bandwidth, or cores. When you have exclusive access to each node, OpenMP is usually a good idea. By the way: what sort of interconnect do you have? Gigabit ethernet, or infiniband, or something else? If users are sharing nodes, then I suspect that this cluster is gigabit ethernet only, and you may be limited to small jobs, since the performance of gigabit ethernet will quickly become your bottleneck. What cluster are you using? From your email address, I'm guessing that it is one of the ones at http://hpc.uct.ac.za? If so, or you are using a similar scheduler, then you should be able to do this, as in their documentation:

    NB3: If your software prefers to use all cores on a computer then make sure that you reserve these cores. For example running on an 800 series server which has 8 cores per server change the directive line in your script as follows:
    
      #PBS -l nodes=1:ppn=8:series800
    

    Once you are the exclusive user of a node, I don't see a problem with enabling OpenMP. Also note: OpenMP is not something that needs to be enabled by the system administrator; it is determined by your compilation flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it possible that there was a confusion, and the admins were talking about hyperthreading instead, which is a very different thing, and which I agree you probably don't want to have enabled (it would have to be enabled by the admins)?

    Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?

    Carpet provides a lot of output :) You may get something useful by setting

    CarpetLib::output_bboxes = "yes"

    On the most refined region, making the approximation that the domain is divided into N identical cubical regions, then for N = 40, you would have 29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction. The volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6. So you have 3.6 times as many points being communicated as you have being evolved. Especially if the interconnect is only gigabit ethernet, I'm not surprised that the scaling flattens off by this point. Note that if you use OpenMP, this ratio will be much smaller, because openmp threads communicate using shared memory, not ghost zones. Essentially you will have fewer processes, each with a larger cube, and multiple threads working on that cube in shared memory.

    My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."

    This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.

    Yes - it looks like your contact doesn't know that the code is iterative, and cctk_final_time simply counts the number of iterations. In order to test with a larger problem size, you would need to reduce CoordBase::dx, dy and dz, so that there is higher overall resolution, and hence more points. I would expect the scalability to improve with larger problem sizes.

  6. Gwyneth Allwright reporter
    • removed comment

    Thanks Ian!

    Yes, we're running on the HEX cluster at the UCT HPC facility. The admin agrees that node sharing isn't desirable, but UCT is in the middle of a financial crisis and we simply can't afford dedicated access to a single node right now.

    We use 56G infiniband interconnect, but since the admin ran his tests on a single node, it doesn't really matter.

    He tried running on two nodes at 19:30 this evening and found that the IB traffic (bottom right) was minimal.

    [[Image(traffic.png)]]

    He's also well-aware about the difference between OpenMP and hyperthreading (which is disabled on all HPC nodes), as well as the OMP_NUM_THREADS environment variable.

    Replying to [comment:7 hinder]:

    Replying to [comment:6 allgwy001@…]:

    Thank you very much for looking into it!

    We can't seem to disable OpenMP in the compile. However, it's effectively disabled by means of OMP_NUM_THREADS=1. Since "OpenMP does not play nicely with other software, especially in the hybridized domain of combining OpenMPI and OpenMP, where multiple users share nodes", they are not willing to enable it for me.

    I have never run on a system where multiple users share nodes; it doesn't really fit very well with the sort of application that Cactus is. You don't want to be worrying about whether other processes are competing with you for memory, memory bandwidth, or cores. When you have exclusive access to each node, OpenMP is usually a good idea. By the way: what sort of interconnect do you have? Gigabit ethernet, or infiniband, or something else? If users are sharing nodes, then I suspect that this cluster is gigabit ethernet only, and you may be limited to small jobs, since the performance of gigabit ethernet will quickly become your bottleneck. What cluster are you using? From your email address, I'm guessing that it is one of the ones at http://hpc.uct.ac.za? If so, or you are using a similar scheduler, then you should be able to do this, as in their documentation:

    NB3: If your software prefers to use all cores on a computer then make sure that you reserve these cores. For example running on an 800 series server which has 8 cores per server change the directive line in your script as follows:
    
      #PBS -l nodes=1:ppn=8:series800
    

    Once you are the exclusive user of a node, I don't see a problem with enabling OpenMP. Also note: OpenMP is not something that needs to be enabled by the system administrator; it is determined by your compilation flags (on by default in the ET) and activated with OMP_NUM_THREADS. Is it possible that there was a confusion, and the admins were talking about hyperthreading instead, which is a very different thing, and which I agree you probably don't want to have enabled (it would have to be enabled by the admins)?

    Excluding boundary points and symmetry points, I find 31 evolved points in each spatial direction for the most refined region. That gives 29 791 points in three dimensions. For each of the other four regions I find 25 695 points; that's 132 571 in total. Does Carpet provide any output I can use to verify this?

    Carpet provides a lot of output :) You may get something useful by setting

    CarpetLib::output_bboxes = "yes"

    On the most refined region, making the approximation that the domain is divided into N identical cubical regions, then for N = 40, you would have 29791/40 = 745 ~ 9^3^, so about 9 evolved points in each direction. The volume of ghost plus evolved points would be (9+3+3)^3^ = 15^3^, so the number of ghost points is 15^3^ - 9^3^, and the ratio of ghost to evolved points is (15^3^ - 9^3^)/9^3^ = (15/9)^3^ - 1 = 3.6. So you have 3.6 times as many points being communicated as you have being evolved. Especially if the interconnect is only gigabit ethernet, I'm not surprised that the scaling flattens off by this point. Note that if you use OpenMP, this ratio will be much smaller, because openmp threads communicate using shared memory, not ghost zones. Essentially you will have fewer processes, each with a larger cube, and multiple threads working on that cube in shared memory.

    My contact writes the following: "I fully understand the [comments about the scalability]. However we see a similar decrement with cctk_final_time=1000 [he initially tested with smaller times] and hence I would assume a larger solution space. Unless your problem is what is called embarrassingly parallel you will always be faced with a communication issue."

    This is incorrect, right? My understanding is that the size of the solution space should remain the same regardless of cctk_final_time.

    Yes - it looks like your contact doesn't know that the code is iterative, and cctk_final_time simply counts the number of iterations. In order to test with a larger problem size, you would need to reduce CoordBase::dx, dy and dz, so that there is higher overall resolution, and hence more points. I would expect the scalability to improve with larger problem sizes.

  7. Gwyneth Allwright reporter
    • removed comment

    Details about the build:

    zypper in libnuma-devel
    wget https://bitbucket.org/einsteintoolkit/manifest/raw/ET_2016_11/einsteintoolkit.th
    
    remove ExternalLibraries/OpenBLAS, ExternalLibraries/OpenCL, Carpet/ReductionTest and CactusTest/TestFortranCrayPointers
    add Llama/WaveExtractCPM and EinsteinAnalysis/ADMConstraints
    
    curl -kLO https://raw.githubusercontent.com/gridaphobe/CRL/ET_2016_11/GetComponents
    chmod a+x GetComponents
    ./GetComponents --parallel einsteintoolkit.th
    module load compilers/gcc530 mpi/openmpi-1.10.1
    
    cd Cactus
    cp simfactory/mdb/optionlists/centos.cfg ./sles.cfg
    Add MPI_DIR=/opt/exp_soft/openmpi-1.10.1/
    
    make ET-config options=sles.cfg THORNLIST=thornlists/einsteintoolkit.th
    make -j 64 ET
    

    where sles.cfg contains:

    # SLES
    #
    # Whenever this version string changes, the application is configured
    # and rebuilt from scratch
    VERSION = 2016-05-06
    
    CPP = cpp
    FPP = cpp
    CC  = gcc
    CXX = g++
    F77 = gfortran
    F90 = gfortran
    
    CPPFLAGS = -DMPICH_IGNORE_CXX_SEEK
    FPPFLAGS = -traditional
    
    CFLAGS   =  -g3 -std=gnu99 -rdynamic -Wunused-variable
    CXXFLAGS =  -g3 -std=c++0x -rdynamic -Wunused-variable -fno-inline
    F77FLAGS =  -g3 -fcray-pointer -ffixed-line-length-none
    F90FLAGS =  -g3 -fcray-pointer -ffixed-line-length-none
    
    LIBS = numa gfortran
    
    LIBDIRS =
    
    MPI_DIR=/opt/exp_soft/openmpi-1.10.1/
    
    DEBUG           = no
    CPP_DEBUG_FLAGS = -DCARPET_DEBUG
    FPP_DEBUG_FLAGS = -DCARPET_DEBUG
    C_DEBUG_FLAGS   = -O0
    CXX_DEBUG_FLAGS = -O0
    F77_DEBUG_FLAGS = -O0 -ffixed-line-length-none
    F90_DEBUG_FLAGS = -O0 -ffree-line-length-none
    

    My PBS script:

    #PBS -q UCTlong
    #PBS -N static_tov_mod_2
    #PBS -m abe
    #PBS -M allgwy001@myuct.ac.za
    #PBS -l nodes=1:ppn=44:series600
    #PBS -o /home/allgwy001/Output/static_tov_mod_2
    #PBS -e /home/allgwy001/Output/static_tov_mod_2
    module load software/EinsteinToolkit
    cd /home/allgwy001/Output
    export OMP_NUM_THREADS=1
    mpirun -np 44 -hostfile $PBS_NODEFILE -v cactus_ET /home/allgwy001/Parameter_Files/static_tov_mod_2.par
    

    I simply ran using

    qsub <PBS script>
    
  8. Erik Schnetter
    • removed comment

    Gwyneth

    These PBS options look as if you ran 44 processes on one single node. "nodes=1:ppn=44" means "use 1 node, run 44 processes on that node." Is that intended? If so, there would be no Infiniband traffic generated.

    -erik

  9. Gwyneth Allwright reporter
    • removed comment

    Hi Erik,

    Yes and sorry. To clarify, this is the script I used for which the NaNs appeared. The admin would've used a different script for looking at the traffic. If you like, I can ask for it.

    Replying to [comment:10 eschnett]:

    Gwyneth

    These PBS options look as if you ran 44 processes on one single node. "nodes=1:ppn=44" means "use 1 node, run 44 processes on that node." Is that intended? If so, there would be no Infiniband traffic generated.

    -erik

  10. Roland Haas
    • removed comment

    This does not look to have quite the same characteristics as #2175 which was fixed in #2182 but given that both are concerned with NaN/poisson values appearing when running in many MPI ranks this may be worthwile to check again if there is a reproducer for the issue.

  11. Roland Haas
    • edited description
    • changed status to resolved

    I have verified (on LSU’s melete05 machine which has 80 cores) that indeed git hash 49d31796 "ML_BSSN: SYNC after computing initial Gamma, dtalpha and dtbeta vars" of mclachlan fixes this issue.

    This was fixed as part of #2182 on 2018-08-22.

  12. Log in to comment