issue138 exit-time hangs on aries/debug
Nightly testing shows that we appear to have a persistent exit-time hang in issue138 when running in debug mode on 8 nodes of coriKNL or theta.
There are at least two potentially distinct failure modes:
- Here is a representative example, where the final "SUCCESS" test output has NOT been printed (or at least not reached the console) before timing out and being forcibly terminated by SLURM.
- Here is a representative example, where the "SUCCESS" output demonstrates the test has collectively passed the
barrier
insideprint_test_success()
and enteredfinalize()
, and then apparently hung before any of the processes have exited (as reported by SLURM logging).
This test performs some pretty intense CPU overhead and rpc communication, and the first failure mode might indicate the test has truly just run too long. However the second failure mode is FAR more concerning, as the user program should be "done" by then and we've somehow still timed out.
Increasing the test timeout by 45 seconds (as of 9-22) does not appear to have helped.
Comments (6)
-
-
Manually testing 1, 2 and 4 ranks in a 2-node allocation on Cori KNL.
I am using theissue138-seq
executable in/global/cscratch1/sd/hargrove/upcnightly-knl/EX-cori_knl-aries-intel_old/runtime/work/dbg/gasnet/tests/upcr-harness/external-upcxx
I find 1 and 2 ranks (thus 1ppn) complete in about 2:45 (give or take a few).
Meanwhile, 4 ranks (thus 2ppn) completed in 4:08.Since 4 ranks ran in 248 seconds for me, I am now less certain that the TIME failures against a 225s limit last night are actual hangs. However, the real runs are 8 nodes X 1 ppn, which I've not yet attempted.
-
In manual testing, five attempts at 8node X 1ppn ran in 243s, 243s, 293s, 239s and 246s. This is the same job topology as in the nightly tests.
So, it looks like the test is just unexpectedly slow and
harness.conf
needs to raise the timeout beyond its current 225s. Given the 293s outlier above, I am thinking 2x multiplier for a 360s timeout.@Dan Bonachea, do you concur?
-
reporter I am thinking 2x multiplier for a 360s timeout. do you concur?
I certainly have no objection to raising the time limit to give this test a few more minutes, and hopefully that clears the issue.
It's also worth noting that in addition to being a brutal test, the debug executable from harness that was timing out is 126MB, which is probably also slowing spawn time.
-
- changed status to open
The timeout is now 360s.
This issue remains open pending multiple days of successful runs to confirm that there is not a hang which was simply not observed in manual tests.
Moving status to "open" to hopefully remind us that this is in need of attention.
-
reporter - changed status to resolved
Multiple nights of testing confirm the persistent timeouts are resolved on cori and theta, as well as sparc64.
- Log in to comment
A quick look at today's results:
My first observation is that Haswell results ran quickly and did not time out.
My second observation is that on Haswell the opt->debug expansion factor is about 5x.
So, one might have expected the KNL debug runs to complete in around 160s (but are timing out at beyond 240s).