issue138 exit-time hangs on aries/debug

Issue #505 resolved
Dan Bonachea created an issue

Nightly testing shows that we appear to have a persistent exit-time hang in issue138 when running in debug mode on 8 nodes of coriKNL or theta.

There are at least two potentially distinct failure modes:

  1. Here is a representative example, where the final "SUCCESS" test output has NOT been printed (or at least not reached the console) before timing out and being forcibly terminated by SLURM.
  2. Here is a representative example, where the "SUCCESS" output demonstrates the test has collectively passed the barrier inside print_test_success() and entered finalize(), and then apparently hung before any of the processes have exited (as reported by SLURM logging).

This test performs some pretty intense CPU overhead and rpc communication, and the first failure mode might indicate the test has truly just run too long. However the second failure mode is FAR more concerning, as the user program should be "done" by then and we've somehow still timed out.

Increasing the test timeout by 45 seconds (as of 9-22) does not appear to have helped.

Comments (6)

  1. Paul Hargrove

    A quick look at today's results:

    hargrove@cori06:~> grep issue138 upcnightly-*/EX-cori{-aries-gnu-pthr,_knl-aries-intel_old}/runtime/log/*/run.rpt
    upcnightly-hsw/EX-cori-aries-gnu-pthr/runtime/log/harness-opt/run.rpt:[external-upcxx/issue138-seq]   7sec  20210922_032418  SUCCESS
    upcnightly-hsw/EX-cori-aries-gnu-pthr/runtime/log/harness-opt/run.rpt:[external-upcxx/issue138-par]   6sec  20210922_033440  SUCCESS
    upcnightly-hsw/EX-cori-aries-gnu-pthr/runtime/log/harness/run.rpt:[external-upcxx/issue138-seq]   34sec  20210922_014524  SUCCESS
    upcnightly-hsw/EX-cori-aries-gnu-pthr/runtime/log/harness/run.rpt:[external-upcxx/issue138-par]   32sec  20210922_015715  SUCCESS
    upcnightly-knl/EX-cori_knl-aries-intel_old/runtime/log/harness-opt/run.rpt:[external-upcxx/issue138-seq]   31sec  20210922_053145  SUCCESS
    upcnightly-knl/EX-cori_knl-aries-intel_old/runtime/log/harness-opt/run.rpt:[external-upcxx/issue138-par]   32sec  20210922_055302  SUCCESS
    upcnightly-knl/EX-cori_knl-aries-intel_old/runtime/log/harness/run.rpt:[external-upcxx/issue138-seq]   251sec  20210922_024202  FAILED (TIME/NEW)
    upcnightly-knl/EX-cori_knl-aries-intel_old/runtime/log/harness/run.rpt:[external-upcxx/issue138-par]   241sec  20210922_031439  FAILED (TIME/NEW)
    

    My first observation is that Haswell results ran quickly and did not time out.

    My second observation is that on Haswell the opt->debug expansion factor is about 5x.
    So, one might have expected the KNL debug runs to complete in around 160s (but are timing out at beyond 240s).

  2. Paul Hargrove

    Manually testing 1, 2 and 4 ranks in a 2-node allocation on Cori KNL.
    I am using the issue138-seq executable in /global/cscratch1/sd/hargrove/upcnightly-knl/EX-cori_knl-aries-intel_old/runtime/work/dbg/gasnet/tests/upcr-harness/external-upcxx

    I find 1 and 2 ranks (thus 1ppn) complete in about 2:45 (give or take a few).
    Meanwhile, 4 ranks (thus 2ppn) completed in 4:08.

    Since 4 ranks ran in 248 seconds for me, I am now less certain that the TIME failures against a 225s limit last night are actual hangs. However, the real runs are 8 nodes X 1 ppn, which I've not yet attempted.

  3. Paul Hargrove

    In manual testing, five attempts at 8node X 1ppn ran in 243s, 243s, 293s, 239s and 246s. This is the same job topology as in the nightly tests.

    So, it looks like the test is just unexpectedly slow and harness.conf needs to raise the timeout beyond its current 225s. Given the 293s outlier above, I am thinking 2x multiplier for a 360s timeout.

    @Dan Bonachea, do you concur?

  4. Dan Bonachea reporter

    I am thinking 2x multiplier for a 360s timeout. do you concur?

    I certainly have no objection to raising the time limit to give this test a few more minutes, and hopefully that clears the issue.

    It's also worth noting that in addition to being a brutal test, the debug executable from harness that was timing out is 126MB, which is probably also slowing spawn time.

  5. Paul Hargrove
    • changed status to open

    The timeout is now 360s.

    This issue remains open pending multiple days of successful runs to confirm that there is not a hang which was simply not observed in manual tests.

    Moving status to "open" to hopefully remind us that this is in need of attention.

  6. Dan Bonachea reporter

    Multiple nights of testing confirm the persistent timeouts are resolved on cori and theta, as well as sparc64.

  7. Log in to comment