monit crashes at stop action

Issue #72 resolved
Anonymous created an issue

Hi

Monit version = 5.8.1 OS = Red Hat Enterprise Linux Server release 6.4 64bits

Tests have been made with compiled version then source code version

Below an extract of my config file check process stress matching "stress*" stop program = "/bin/true" if cpu > 24% for 1 cycles then alert if cpu > 24% for 2 cycles then restart

The goal is to monitor a process and kill it when it uses more than 24% of cpu (100% of 4 core cpu). the restart is useless in this context (the process being launched by a third part process)

Regardless the stop program and the command among stop and restart, monitor crashes each time it does the stop action : Jul 22 11:23:08 pedtanya03 monit[18859]: 'stress' process is running with pid 18861 Jul 22 11:23:18 pedtanya03 monit[18859]: 'stress' cpu usage of 24.9% matches resource limit [cpu usage>24.0%] Jul 22 11:23:18 pedtanya03 monit[18859]: 'stress' cpu usage of 24.9% matches resource limit [cpu usage>24.0%]

Jul 22 11:23:28 pedtanya03 monit[18859]: 'stress' cpu usage of 24.9% matches resource limit [cpu usage>24.0%] Jul 22 11:23:28 pedtanya03 monit[18859]: 'stress' trying to restart Jul 22 11:23:28 pedtanya03 monit[18859]: 'stress' stop: /bin/true Jul 22 11:23:28 pedtanya03 abrtd: Directory 'ccpp-2014-07-22-11:23:28-18859' creation detected Jul 22 11:23:28 pedtanya03 abrt[18870]: Saved core dump of pid 18859 (/root/monit-5.8.1/monit) to /var/spool/abrt/ccpp-2014-07-22-11:23:28-18859 (11440128 bytes) Jul 22 11:23:28 pedtanya03 abrtd: Executable '/root/monit-5.8.1/monit' doesn't belong to any package Jul 22 11:23:28 pedtanya03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2014-07-22-11:23:28-18859' exited with 1 Jul 22 11:23:28 pedtanya03 abrtd: Corrupted or bad directory '/var/spool/abrt/ccpp-2014-07-22-11:23:28-18859', deleting

Unfortunately, the core dumped is systematically deleted by the abrtd deamon.

have you any idea ?

Comments (8)

  1. Tildeslash repo owner

    Hello,

    i'm unable to reproduce the issue (on CentOS 6.5) ... using the following configuration:

    set daemon 5
    set httpd port 2812 allow monit:monit
    set logfile /var/log/monit.log
    check process stress matching "stress*"
    stop program = "/bin/true"
    if cpu > 24% for 1 cycles then alert
    if cpu > 24% for 2 cycles then restart
    

    You can enable the coredump by modification of /etc/abrt/abrt-action-save-package-data.conf and changing ProcessUnpackaged=no to ProcessUnpackaged=yes and then running "service abrtd restart".

    When you'll have the coredump, please send it to support@mmonit.com along with the binary /root/monit-5.8.1/monit for which it was created.

    Note that you have defined only the stop program but you use restart action ... in such case the process will be stopped only, as monit doesn't have the "start program" defined and you'll see output like this:

    'stress' process is not running
    'stress' trying to restart
    'stress' start skipped -- method not defined
    
  2. vincent giacomini

    some news: below the backtrace:

    0 0x0000003c63c328a5 in raise () from /lib64/libc.so.6

    1 0x0000003c63c34085 in abort () from /lib64/libc.so.6

    2 0x0000003c63c2ba1e in __assert_fail_base () from /lib64/libc.so.6

    3 0x0000003c63c2bae0 in __assert_fail () from /lib64/libc.so.6

    4 0x0000000000418867 in wait_process (s=0x19d0690, expect=Process_Stopped) at src/control.c:92

    5 0x0000000000418ca1 in do_stop (s=0x19d0690, flag=0) at src/control.c:170

    6 0x00000000004196e9 in control_service (S=0x19d0650 "stress", A=2) at src/control.c:417

    7 0x000000000041b266 in handle_action (E=0x19e4fa0, A=0x19d13e0) at src/event.c:709

    8 0x000000000041b01a in handle_event (E=0x19e4fa0) at src/event.c:653

    9 0x000000000041a134 in Event_post (service=0x19d0690, id=2, state=1, action=0x19d13c0, s=0x47bc29 "%s") at src/event.c:223

    10 0x00000000004319cb in check_process_resources (s=0x19d0690, r=0x19d1420) at src/validate.c:400

    11 0x0000000000433d79 in check_process (s=0x19d0690) at src/validate.c:1026

    12 0x0000000000433aba in validate () at src/validate.c:973

    13 0x0000000000417190 in do_default () at src/monit.c:572

    14 0x0000000000416a2f in do_action (args=0x7fff6696cf98) at src/monit.c:412

    15 0x00000000004164fe in main (argc=3, argv=0x7fff6696cf98) at src/monit.c:166

    I'll send you the core dump and the binary

  3. Tildeslash repo owner

    Thanks for data. Monit stopped, because no start nor restart program was defined => to fix the issue, just add "start program" ... on RHEL6 we recommend also "restart program", as upstart has synchronization problems if start is called before stop finished (when "restart program" is defined, monit will call only that instead of stop+start).

    The start/restart was refactored in the development version already and Monit won't stop even if no start/restart is defined and restart action callled.

    You can get the development snapshot here: https://bitbucket.org/tildeslash/monit/get/master.tar.gz

    To compile:

    tar -xzf master.tar.gz
    cd tildeslash*
    ./bootstrap
    ./configure
    make
    
  4. Tildeslash repo owner

    If you don't want to restart, then just don't define any stop/start/restart program or use just the alert action - for example:

    check process stress matching "stress*"
        if cpu > 24% for 2 cycles then alert
    

    There is also "mode passive" option which allows to ignore any restart attempts.

  5. vincent giacomini

    Ooops I mispoke.....my need is as follow: I need to check CPU usage of a process launched by a third part application If the process CPU usage is too high, monit have to stop it by a kill signal The process will be launched on demand by the third part application

    does the "mode passive" meets my need ?

  6. Tildeslash repo owner

    Then use the "stop" action instead of restart:

    check process stress matching "stress*"
        stop program = "..."
        if cpu > 24% for 1 cycles then alert
        if cpu > 24% for 2 cycles then stop
    
  7. Log in to comment