Confusing behavior if timeout is present on front-end but not on compute node

Issue #584 resolved
Paul Hargrove created an issue

The current logic behind make check (among other targets) attempts to use the timeout utility to detect hung processes. It tries to validate that the utility exists. However, that validation is performed on the host running make, while the eventual launch of a test will run timeout on the compute nodes.

Lacking any special cases, the failure mode for lack of timeout on (only) the compute nodes will be a message like "FAILED (exitcode=1)", where the exit code may vary. I think this could be improved upon.

IF the command is run by bash (or other shell(s) we can test), then it may be sufficient to scan the output for : command not found as we already do for messages regarding fatal signals. However, it is also possible that the command is run directly by the batch system without an intervening shell. I need to look into that.

Comments (4)

  1. Paul Hargrove reporter

    However, it is also possible that the command is run directly by the batch system without an intervening shell.

    Sigh. Even setting aside batch systems, treatment of "wrapper" elements in the command by our own ssh-spawner yields an error message other than "command not found" from bash:

    {phargrov@pcp-d-5 ibv-conduit}$ which wrap
    /tmp/wrap
    {phargrov@pcp-d-5 ibv-conduit}$ cat /tmp/wrap
    #!/bin/bash
    exec time "$@"
    {phargrov@pcp-d-5 ibv-conduit}$ ssh pcp-d-8 wrap
    bash: wrap: command not found
    {phargrov@pcp-d-5 ibv-conduit}$ ssh pcp-d-9 wrap
    bash: wrap: command not found
    {phargrov@pcp-d-5 ibv-conduit}$ export GASNET_SSH_SERVERS='pcp-d-8 pcp-d-9'
    {phargrov@pcp-d-5 ibv-conduit}$ ./contrib/gasnetrun_ibv -n2 wrap ./testgasnet
    bash: /tmp/wrap: No such file or directory
    *** SSH-SPAWNER (pcp-d-5:29571): Failed to start processes on pcp-d-8, possibly due to an inability to establish an ssh connection from pcp-d-5 without interactive authentication.
    *** FATAL ERROR (pcp-d-5:29571): in reap_one() at runtime/gasnet/other/ssh-spawner/gasnet_bootstrap_ssh.c:521: One or more processes died before setup was completed
    *** WARNING (pcp-d-5:29571): Ignoring call to gasneti_print_backtrace_ifenabled before gasneti_backtrace_init
    Aborted
    

    I am not feeling as comfortable about special-casing ": No such file or directory" as I was with ": command not found".

  2. Dan Bonachea

    I am not feeling as comfortable about special-casing ": No such file or directory" as I was with ": command not found".

    I'm not worried about that, I'd say add both.

    This is only a filter and only active when the test has already been determined to have failed, so at worst it prints irrelevant lines in an oddball failure mode

  3. Paul Hargrove reporter

    Argh. Chrome "ate my homework". So here is a short version of what I'd typed up (with complete outputs) last night:

    At least srun, jsrun and hydra from mpich3 include No such file or directory in their error output for this case.

    OpenMPI has a long message w/o that string, but I don't currently care sufficiently to worry about recognizing it.

  4. Paul Hargrove reporter

    Resolve issue 584

    This commit resolves issue #584 "Confusing behavior if timeout is present on front-end but not on compute node" by adding two new strings to those we grep for in the output of failing test runs.

    → <<cset bffd3fe2e419>>

  5. Log in to comment