Prolongation fails with vectorisation enabled

Issue #499 closed
Ian Hinder created an issue

The development version of Carpet uses vectorisation to speed-up prolongation. This fails with various errors, including corruption of the malloc heap.

I am reducing the priority, and still hope to get to the bottom of this before the release.

Keyword:

Comments (24)

  1. Barry Wardell
    • removed comment

    This happens only when vectorisation is enabled through the use of VECTORISE="yes" in the optionlist. It also seems to happen only in runs where the grids are moved and in that case it happens at a random iteration before the first regridding happens.

  2. Erik Schnetter
    • removed comment

    Do you have more information?

    For example: Machine name, hardware architecture (which SSE version?), compiler, compiler version, etc.

    Also, do you have a stack backtrace or a core file that could lead to a line number? Alternatively, do you have the value of the instruction pointer and the disassembled executable?

  3. Barry Wardell
    • removed comment

    This happens on Datura, which has Intel Xeon X5650 processors. We are using SSE4.1 and the Intel compiler version 11.1.072. Below is a representative backtrace

    * Process received signal * Signal: Segmentation fault (11) Signal code: (128) Failing at address: (nil) [ 0] /lib64/libpthread.so.0 [0x2ba555f47b10] [ 1] /lib64/libc.so.6 [0x2ba55a6cbcc8] [ 2] /lib64/libc.so.6(libc_malloc+0x6e) [0x2ba55a6cdcde] [ 3] /usr/lib64/libstdc++.so.6(_Znwm+0x1d) [0x2ba55a00317d] [ 4] cactus_sim(_ZNSt6vectorIcSaIcEE6resizeEmc+0x137) [0x13025a7] [ 5] cactus_sim(_ZN10comm_state4stepEv+0xab9) [0x12fd459] [ 6] cactus_sim(_ZN6Carpet10SyncGroupsEPK4_cGHRKSt6vectorIiSaIiEE+0x3fa) [0x126b85a] [ 7] cactus_sim(_ZN6Carpet20SyncProlongateGroupsEPK4_cGHRKSt6vectorIiSaIiEE+0x49e) [0x126b28e] [ 8] cactus_sim [0x12c8798] [ 9] cactus_sim(_ZN6Carpet12CallFunctionEPvP13cFunctionDataS0_+0x14bb) [0x12c84db] [10] cactus_sim [0x4e83da] [11] cactus_sim [0x4eb570] [12] cactus_sim [0x4eb668] [13] cactus_sim [0x4eb668] [14] cactus_sim(CCTKi_DoScheduleTraverse+0x294) [0x4eb2f4] [15] cactus_sim(CCTK_ScheduleTraverse+0x199) [0x4e4509] [16] cactus_sim [0x126e341] [17] cactus_sim(_ZN6Carpet6EvolveEP12tFleshConfig+0x34b) [0x126c18b] [18] cactus_sim(main+0xa5) [0x4dc2b5] [19] /lib64/libc.so.6(libc_start_main+0xf4) [0x2ba55a676994] [20] cactus_sim [0x4dbfc9]

  4. Erik Schnetter
    • removed comment

    I looked at the vectorised code in Carpet's prolongation operator, and I see that all the vectorisation happens for reading, multiplying, and adding numbers. Storing the result into the target array is untouched and independent of vectorisation.

    According to the backtrace above, the actual error occurs while resizing a std::vector, probably while allocating communication buffers. It could be the system runs out of memory, or there is internal memory corruption.

    Since progress on this problem has stalled, I suggest to disable vectorisation in Carpet's prolongation operator. There is a statement "#if 0" in line 226; changing this to "#if 1" should enable the scalar code and thus circumvent the vectorised code.

  5. Barry Wardell
    • removed comment

    Replying to [comment:4 eschnett]:

    According to the backtrace above, the actual error occurs while resizing a std::vector, probably while allocating communication buffers. It could be the system runs out of memory, or there is internal memory corruption.

    I don't think it is a case of running out of memory, but it seems likely that it could be memory corruption. Upon closer inspection, it looks like the segfault is happening in recompose:

    Backtrace from rank 16 pid 4306: 1. /lib64/libc.so.6(gsignal+0x35) [0x2b51371a1265] 2. /lib64/libc.so.6(abort+0x110) [0x2b51371a2d10] 3. /lib64/libc.so.6 [0x2b51371db84b] 4. /lib64/libc.so.6 [0x2b51371e330f] 5. /lib64/libc.so.6(cfree+0x4b) [0x2b51371e376b] 6. mem<double>::mem()(.../cactus_Datura-carpet-hg-test) 7. data<double>::data()(.../cactus_Datura-carpet-hg-test) 8. ggf::recompose_free_old(int)(.../cactus_Datura-carpet-hg-test) 9. dh::recompose(int, bool)(.../cactus_Datura-carpet-hg-test) a. gh::recompose(int, bool)(.../cactus_Datura-carpet-hg-test) b. Carpet::Recompose(_cGH const*, int, bool)(.../cactus_Datura-carpet-hg-test) c. .../cactus_Datura-carpet-hg-test [0x13adabc] d. Carpet::Evolve(tFleshConfig*)(.../cactus_Datura-carpet-hg-test) e. .../cactus_Datura-carpet-hg-test(main+0xa5) [0x4de135] f. /lib64/libc.so.6(libc_start_main+0xf4) [0x2b513718e994] 10. .../cactus_Datura-carpet-hg-test [0x4dde49]

    Since progress on this problem has stalled, I suggest to disable vectorisation in Carpet's prolongation operator. There is a statement "#if 0" in line 226; changing this to "#if 1" should enable the scalar code and thus circumvent the vectorised code.

    I have disabled vectorisation in this section of the code and still get a segfault, so I guess the problem must be elsewhere (LoopControl?). It certainly has to be somewhere in Carpet as replacing the mercurial version with the git version my simulation runs without any problems.

    I also tried debugging with gdb, but unfortunately once I had disabled optimisation the crash no longer happens!

  6. Erik Schnetter
    • removed comment

    LoopControl is another candidate that can create problems when vectorising. It has a self-check built in that is enabled via LoopControl::do_selftest = yes (and which is somewhat expensive). Could you give this a try?

  7. Barry Wardell
    • removed comment

    Replying to [comment:6 eschnett]:

    LoopControl is another candidate that can create problems when vectorising. It has a self-check built in that is enabled via LoopControl::do_selftest = yes (and which is somewhat expensive). Could you give this a try?

    I have tried this and the crash still happens as before without any error detected by LoopControl. The only way I am able to avoid the crash is to disable optimisation (which annoyingly makes debugging a pain). Do you think this could point to a compiler issue? I'm going to try the same run on Kraken (the crash happens on Datura) to see if the problem happens there too.

    Has anybody else encountered a segfault in CarpetHG with vectorisation enabled and AMR? Or has anybody else tried this combination successfully?

  8. Erik Schnetter
    • removed comment

    Other random ideas coming to my mind: - running without OpenMP (with a single thread) - building with gcc (and with optimisation) instead of Intel, still on Datura

  9. Barry Wardell
    • removed comment

    I have just tried running the same job on Kraken and it works perfectly fine, without any segfault. I'm using the current SimFactory2 optionlists kraken-intel.cfg and datura.cfg. This means that on Kraken I'm using a slightly different version of the Intel compiler (11.1.038 vs 11.1.072) and some different optimisation settings (in particular SSE2 instead of SSE4.1). Note, however, that when I run the job on Damiana (damiana.cfg), which uses SSE2 but is otherwise identical to Datura, I get the same segfault.

    Replying to [comment:8 eschnett]:

    Other random ideas coming to my mind: - running without OpenMP (with a single thread) - building with gcc (and with optimisation) instead of Intel, still on Datura

    I will try these suggestions.

  10. Barry Wardell
    • removed comment

    Replying to [comment:9 barry.wardell]:

    > - running without OpenMP (with a single thread)

    Running on a single thread no longer triggers the segfault. So to summarize the combination required to trigger the problem:

    • Mercurial version of Carpet
    • Vectorisation enabled
    • Datura or Damiana (or at least not Kraken)
    • Optimisation (-O2) enabled
    • >1 OpenMP thread
    • Moving boxes AMR

    and then the segfault happens in ggf::recompose_free_old(int).

  11. Erik Schnetter
    • removed comment

    It could also be the particular version of the Intel compiler. Your licence should be good for other versions as well; if you kept the install image for a previous version around, you could give this a try.

  12. Barry Wardell
    • removed comment

    Replying to [comment:11 eschnett]:

    It could also be the particular version of the Intel compiler. Your licence should be good for other versions as well; if you kept the install image for a previous version around, you could give this a try.

    I have now tried with Intel Compiler 12.0.2 (I was previously using 11.1.072) and with OpenMPI 1.5.4 (previously 1.4.3) and the segfault happens just as before.

  13. anonymous
    • removed comment

    Attached is the thornlist, optionlist and parameter file I use to reproduce the problem

  14. Barry Wardell
    • removed comment

    Replying to [comment:15 anonymous]:

    Attached is the thornlist, optionlist and parameter file I use to reproduce the problem

    I forgot to mention that I run this with 120 cores. On Datura, I run with 6 threads and on Damiana with 2 threads. The crash usually happens within the first 1500 iterations, although it changes each time it is run.

  15. Erik Schnetter
    • removed comment

    I believe that setting

    LoopControl::use_random_restart_hill_climbing = no

    circumvents the segfault. Could you check this?

  16. Barry Wardell
    • removed comment

    I can confirm that when I use this setting the segfault no longer happens. Does this mean that you have identified the problem? Is it a bug in LoopControl? Or a compiler bug?

  17. Erik Schnetter
    • removed comment

    No, I have not identified the problem, but I assume that it is either an error in LoopControls optimization mechanism or in the compiler.

    I have asked Nico to install a malloc debugger on Damiana; could you follow up? This may the be easiest way to debug this.

  18. Erik Schnetter

    I have made

    LoopControl::use_random_restart_hill_climbing = no

    the default in Carpet since this seems to avoid the segfault.

  19. Ian Hinder reporter
    • marked as
    • removed milestone
    • removed comment

    I am removing the release milestone and reducing the priority to major, since it no longer occurs with the default parameters and the workaround is straightforward.

  20. Ian Hinder reporter
    • marked as
    • removed comment

    According to the comments above, the default for LoopControl::use_random_restart_hill_climbing has been changed to "no" and this avoids the problem. Unless someone wants to use that feature, it sounds like the problem is effectively fixed. I do not even know if the problem can be reproduced now. If it can, then probably the ticket should be considered a bug in LoopControl's optimisation code.

  21. Roland Haas
    • edited description
    • changed status to closed

    Datura and Damiana (the clusters mentioned here) no longer exists. However most clusters by now use VECTORIZE=yes by default and the issue is not seen there. The parameter use_random_restart_hill_climbing no longer exists, its function having (likely) been taken over by use_random_restart_hill_climbing which defaults to 0 since otherwise LoopControl remembers too many tries, becoming slow and using too much memory (the likely cause of the issue here).

  22. Log in to comment