exec'd program aborts with SIGPIPE when writing to stderr

Issue #434 new
Don created an issue

My exec program (which is bash script) stopped to work recently. Journal is getting such error messages:

Aug 05 16:28:59 live monit[1101]: 'apache' exec: /root/apache-fail.sh
Aug 05 16:28:59 live systemd[1]: Received SIGCHLD from PID 26094 (apache-fail.sh).
Aug 05 16:28:59 live systemd[1]: Child 26094 (apache-fail.sh) died (code=killed, status=13/PIPE)

Script was run with #!/bin/bash -x shebang, thus, outputing every command into stderr. When I remove -x and inside of script redirect stdout and stderr into a file script started to work again.

lsof -nP -p $$ inside of exec script shows:

COMMAND    PID USER   FD   TYPE             DEVICE  SIZE/OFF    NODE NAME
apache-fa 8012 root  cwd    DIR              253,0      4096       2 /
apache-fa 8012 root  rtd    DIR              253,0      4096       2 /
apache-fa 8012 root  txt    REG              253,0    960376  917674 /usr/bin/bash
apache-fa 8012 root  mem    REG              253,0     61928  927293 /usr/lib64/libnss_files-2.17.so
apache-fa 8012 root  mem    REG              253,0 106065056  925303 /usr/lib/locale/locale-archive
apache-fa 8012 root  mem    REG              253,0   2112384  927116 /usr/lib64/libc-2.17.so
apache-fa 8012 root  mem    REG              253,0     19520  927141 /usr/lib64/libdl-2.17.so
apache-fa 8012 root  mem    REG              253,0    174520  927410 /usr/lib64/libtinfo.so.5.9
apache-fa 8012 root  mem    REG              253,0    164440  927070 /usr/lib64/ld-2.17.so
apache-fa 8012 root  mem    REG              253,0     26254 1182178 /usr/lib64/gconv/gconv-modules.cache
apache-fa 8012 root    0r   CHR                1,3       0t0    1028 /dev/null
apache-fa 8012 root    1u  unix 0xffff880036ac5a00       0t0   22102 socket
apache-fa 8012 root    2u  unix 0xffff880036ac5a00       0t0   22102 socket
apache-fa 8012 root  255r   REG              253,0       796 2108312 /root/apache-fail.sh

Is it correct that descriptors 1 and 2 are connected to some unix/socket instead of FIFO/pipe?

This is in monit 5.18, I didn't notice such behavior in 5.14.

Comments (6)

  1. Tildeslash repo owner

    Please provide the following so we can reproduce:

    • The Monit statement
    • A minimum script which demonstrate the problem
  2. Don reporter

    Do you think having socket on the stdout and stderr is ok? I can reproduce that (just lsof -nP -p $$ >> /root/exec.log at the top of the script).

    Problem with script aborting with SIGPIPE is intermittent. I did not investigate it deeper than finding socket on stdout and stderr — and asking you question, which you did not answer. Month before it was working without problem, but since last week I got a lot of failures.

  3. Don reporter

    I am unable to reproduce SIGPIPE abort on other box in minimal configuration. On where it aborts there is many monit rules (maybe that's important factor), and example simplified rule is:

    check file restart-apache-test with path /root/restart-apache-test
            if changed timestamp then exec "/root/apache-fail.sh"
    

    Simplified version of script is:

    #!/bin/bash
    
    set -x
    cd /root
    D=`date +"%Y-%m-%dT%H:%M:%S"`
    mkdir $D
    cd $D
    exec >> stdout.log 2>&1
    
    date
      ps axuwwwf
    date
      systemctl restart httpd.service
    date
    

    Experiment:

    root@live:~# touch /root/restart-apache-test
    
    # I wait for alert, then
    
    root@live:~# ls -altr|tail
    -rwxr-xr-x  1 root root         678 Aug  6 20:25 apache-fail.sh
    -rw-r--r--  1 root root           0 Aug  6 20:26 restart-apache-test
    
    # dir with date is not created.
    
    root@live:~# journalctl -r|grep 'apache-fail.sh'
    Aug 06 20:30:02 live systemd[1]: Child 7168 (apache-fail.sh) died (code=killed, status=13/PIPE)
    Aug 06 20:30:02 live systemd[1]: Received SIGCHLD from PID 7168 (apache-fail.sh).
    Aug 06 20:30:02 live monit[1101]: 'restart-apache-test' exec: /root/apache-fail.sh
    ...
    
  4. Don reporter

    Experiment 2. Script:

    #!/bin/bash
    
    date >> /root/restart-apache.log
    echo stdout
    echo stdout >> /root/restart-apache.log
    echo stderr >&2
    echo stderr >> /root/restart-apache.log
    
    set -x
    echo after set-x >> /root/restart-apache.log
    
    cd /root
    echo after cd/root >> /root/restart-apache.log
    D=`date +"%Y-%m-%dT%H:%M:%S"`
    mkdir $D
    cd $D
    exec >> stdout.log 2>&1
    
    date
      ps axuwwwf
    date
      systemctl restart httpd.service
    date
    

    After restart-apache-test is triggered and script aborted with SIGPIPE:

    root@live:~# cat restart-apache.log
    Sat Aug  6 20:39:26 MSK 2016
    
  5. Don reporter
    root@live:~# ps ax|grep monit
     1101 ?        Ssl    2:01 /usr/bin/monit -I
    root@live:~# strace -o monit.strace.log -ff -p 1101 &
    root@live:~# touch /root/restart-apache-test
    

    After alert is received:

    root@live:~# egrep -i 'execve|clone|pipe' monit.strace.log.*
    monit.strace.log.1101:write(2, "'restart-apache-test' timestamp "..., 130) = -1 EPIPE (Broken pipe)
    monit.strace.log.1101:--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=1101, si_uid=0} ---
    monit.strace.log.1101:write(1, "'restart-apache-test' exec: /roo"..., 49) = -1 EPIPE (Broken pipe)
    monit.strace.log.1101:--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=1101, si_uid=0} ---
    monit.strace.log.1101:clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc5bfa64b10) = 8339
    monit.strace.log.8339:clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc5bfa64b10) = 8341
    monit.strace.log.8341:rt_sigaction(SIGPIPE, {SIG_DFL, [], SA_RESTORER|SA_RESTART, 0x7fc5bf13c100}, {SIG_IGN, [], SA_RESTORER|SA_RESTART, 0x7fc5bf13c100}, 8) = 0
    monit.strace.log.8341:execve("/root/apache-fail.sh", ["/root/apache-fail.sh"], [/* 7 vars */]) = 0
    monit.strace.log.8341:clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f6139457a10) = 8344
    monit.strace.log.8341:write(1, "stdout\n", 7)                 = -1 EPIPE (Broken pipe)
    monit.strace.log.8341:--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=8341, si_uid=0} ---
    monit.strace.log.8341:+++ killed by SIGPIPE +++
    monit.strace.log.8344:execve("/usr/bin/date", ["date"], [/* 10 vars */]) = 0
    

    No pipe() syscalls.

    root@live:~# monit -V
    This is Monit version 5.18
    Built with ssl, with pam and with large files
    Copyright (C) 2001-2016 Tildeslash Ltd. All Rights Reserved.
    
  6. Log in to comment