monit reports wrong status

Issue #810 open
dmitry babitsky created an issue

monit version 5.25.3(this form doesnt have this choice)

Program returns 0 ( success), monit considers it a fail. And vice versa

Please see the picture, it's worth a 1000 words.

Also note that I simplified the config compared to whats shown in the picture to this:

IF status ! = 100 THEN EXEC ...

Comments (18)

  1. Tildeslash repo owner

    The test works as designed:

    The "if changed" test will trigger anytime the value changed. The new value becomes a new baselin: the test will turn to "OK" next cycle and remain OK until the value changes again.

    In your case the "if changed" should trigger when the status changed from 0->1. If the value is still 1 in the next cycle, the status is changed to OK (as the value remains 1 => no change occurred). When the value will change from 1 to something else, the test will match again.

  2. dmitry babitsky reporter
    • changed status to open

    The test is not "IF Changed" The test is "IF status ! = 100"

    Please look at the picture.

  3. Tildeslash repo owner

    Please can you run monit in debug mode and attach the output?:

    1.) stop monit

    2.) start it in debug mode: monit -vI

  4. dmitry babitsky reporter

    This CHECK PROGRAM generates random successes and failures, using bash's $RANDOM to trigger divide by 0.

    As you see, in case of SUCCESS, MONIT_EVENT still shows 'Status failed'

    In fact, it always shows this status. The UI paints accordingly.

    Relevant configuration:
         43      CHECK PROGRAM NodeManager WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'"
         44        IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'"  
         45        REPEAT EVERY 1 CYCLES
         46
         47        IF STATUS !=0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"
         48        REPEAT EVERY 1 CYCLES
    
    [dab@d129668-005:/home/dab] /home/dab/scripts/monit -c /home/dab/UserDirs/dba/dev/utils/monit.conf -v | egrep -v 'proc file|system statistic|M/Monit'
    M/Monit enabled but no httpd allowed -- please add 'set httpd' statement
    Runtime constants:
     Control file       = /home/dab/UserDirs/dba/dev/utils/monit.conf
     Log file           = (not defined)
     Pid file           = /home/dab/.monit.pid
     Id file            = /local/data/scratch/monit.id
     State file         = /local/data/scratch/monit.state
     Debug              = True
     Log                = False
     Use syslog         = False
     Is Daemon          = True
     Use process engine = True
     Limits             = {
                        =   programOutput:     512 B
                        =   sendExpectBuffer:  256 B
                        =   fileContentBuffer: 512 B
                        =   httpContentBuffer: 1 kB
                        =   networkTimeout:    5 s
                        =   programTimeout:    30 s
                        =   stopTimeout:       10 s
                        =   startTimeout:      10 s
                        =   restartTimeout:    10 s
                        = }
     On reboot          = start
     Poll time          = 3 seconds with start delay 0 seconds
     Mail from          = Monit Support <monit@foo.bar>
     Mail reply to      = support@domain.com
     Mail subject       = $SERVICE $EVENT at $DATE
     Mail message       = Monit $ACTION $SERVI..(truncated)
     Start monit httpd  = False
    
    The service list contains the following entries:
    
    System Name           = d341369-005
     Monitoring mode      = active
     On reboot            = start
     Every                = Check service every 1 cycles
    
    Program Name          = NodeManager
     Path                 = /bin/bash -c echo $(( 1/($RANDOM%2) ))
     Monitoring mode      = active
     On reboot            = start
     Program timeout      = terminate the program if not finished within 30 s
     Status               = if exit value != 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo FAILURE;echo)' repeat every 1 cycle(s)
     Status               = if exit value = 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' repeat every 1 cycle(s)
    
    -------------------------------------------------------------------------------
    pidfile '/home/dab/.monit.pid' does not exist
    Starting Monit 5.25.3 daemon
    'd341369-005' Monit 5.25.3 started
    'NodeManager' program started
    'NodeManager' status succeeded (0) -- 1
    'NodeManager' status failed (0) -- 1
    'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
    MONIT_PROGRAM_STATUS=0
    MONIT_DATE=Tue, 29 Jan 2019 11:47:44
    MONIT_HOST=d341369-005
    MONIT_EVENT=Status failed
    MONIT_SERVICE=NodeManager
    MONIT_DESCRIPTION=status failed (0) -- 1
    SUCCESS
    
    'NodeManager' program started
    'NodeManager' status failed (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ")
    'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo FAILURE;echo)'
    'NodeManager' status succeeded (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ")
    'NodeManager' program started
    MONIT_PROGRAM_STATUS=1
    MONIT_DATE=Tue, 29 Jan 2019 11:47:47
    MONIT_HOST=d341369-005
    MONIT_EVENT=Status failed
    MONIT_SERVICE=NodeManager
    MONIT_DESCRIPTION=status failed (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ")
    FAILURE
    
    'NodeManager' status succeeded (0) -- 1
    'NodeManager' status failed (0) -- 1
    'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
    MONIT_PROGRAM_STATUS=0
    MONIT_DATE=Tue, 29 Jan 2019 11:47:50
    MONIT_HOST=d341369-005
    MONIT_EVENT=Status failed
    MONIT_SERVICE=NodeManager
    MONIT_DESCRIPTION=status failed (0) -- 1
    SUCCESS
    
    'NodeManager' program started
    'NodeManager' status succeeded (0) -- 1
    'NodeManager' status failed (0) -- 1
    'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
    MONIT_PROGRAM_STATUS=0
    MONIT_DATE=Tue, 29 Jan 2019 11:47:53
    MONIT_HOST=d341369-005
    MONIT_EVENT=Status failed
    MONIT_SERVICE=NodeManager
    MONIT_DESCRIPTION=status failed (0) -- 1
    SUCCESS
    
  5. Tildeslash repo owner

    The NodeManager example problem is configuration issue - each "if status" test is a standalone statement, not "if-else" branch.

    Monit thus gets the status and evaluates each rule independently. If the first match, it sets the error state - then the second is evaluated and resets the error (as it's negation of the first rule).

    If you need to to have both tests without conflict, you should split the configuration into two services:

    CHECK PROGRAM NodeManager_zero WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'"
             IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'"
    
    CHECK PROGRAM NodeManager_nonzero WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'"
             IF STATUS !=0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"
    
  6. dmitry babitsky reporter

    This will show two rows in the UI instead of one. It will also probably show 2 rows per host per Test in the M/Monit UI. Is this correct?

    We are evaluating M/monit for company-wide use, thats a lot of hosts and checks, creating a separate config entry for each potential return code doesn't sound feasible.

    I also tried: IF STATUS !=100, to handle both 0s and 1s, but this also produced the same result.

    My goal is to handle different statuses of the same check differently. What's the best way to accomplish this?

    Thanks for your help.

  7. dmitry babitsky reporter

    Actually, forget multiple tests, your suggestion does't work even in the simplest single test. Note how it says 'status failed' on success.

    CHECK PROGRAM TestSuccess WITH PATH "/bin/true" 
           IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'" 
           REPEAT EVERY 1 CYCLES
    
    Program Name          = TestSuccess
     Path                 = /bin/true
     Monitoring mode      = active
     On reboot            = start
     Program timeout      = terminate the program if not finished within 5 s
     Status               = if exit value = 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' repeat every 1 cycle(s)
     Every                = Check service every 1 cycles
    
    Starting Monit 5.25.3 daemon
    'd341369-005' Monit 5.25.3 started
    'TestSuccess' program started
    'TestSuccess' status failed (0) -- no output
    'TestSuccess' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
    'TestSuccess' program started
    MONIT_PROGRAM_STATUS=0
    MONIT_DATE=Wed, 30 Jan 2019 10:15:23
    MONIT_HOST=d341369-005
    MONIT_EVENT=Status failed
    MONIT_SERVICE=TestSuccess
    MONIT_DESCRIPTION=status failed (0) -- no output
    SUCCESS
    
  8. Henning Bopp

    I think that Monit "thinks" differently: If you define an event and it gets triggered it is meant to be "a failure". So every IF results in a failure state, if evaluated truely:

    In your example IF STATUS = 0 equals to true ==> so the situation should be different to this assumption ==> failure!

  9. dmitry babitsky reporter

    Thanks Henning. It sounds like a bug to me though, not sure if in the implementation or in design. Monit's own documentation says "By convention, 0 means the program exited normally."

  10. Henning Bopp

    Monit's own documentation says "By convention, 0 means the program exited normally."

    sure, but that refers to the exit status of the program.

    Think of monit as an incident reporting system. So you specify incidents. In your case it is: My incident happens, if my program exits with 0. Basically that means: My program should fail. So the monit internal status is failed, because the execution succeeded.

  11. dmitry babitsky reporter

    thanks again Henning. I understand.

    I think that it's mightily confusing, and at a minimum requires clarified documentation.

    Looks like this achieves the desired effect:

    #CANNOT do IF STATUS = 0, MUST always compare to !0, and ELSE IF SUCCEEDED
    
    CHECK PROGRAM TestFailure WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'" WITH TIMEOUT 5 SECONDS EVERY 1 CYCLES
            IF STATUS != 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"     REPEAT EVERY 1 CYCLES
            ELSE IF SUCCEEDED THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'"  REPEAT EVERY 1 CYCLES
    
  12. dmitry babitsky reporter

    Update: for config shown above, consecutive successes do not result in handler being executed. Consecutive failures do. I see that this is consistent with the docs. So, I’m not out of the woods yet. My goal is to have handler executed ALWAYS,success or failure, with MONIT_ env variables set accordingly.

    The larger goal is to have an entry in my database that shows an up-to-date status of all monitored services.

    If SUCCESS timestamp is not updated, it could mean that my service is still good, or that monit agent is dead, or that it no longer monitors that service for whatever reason. Which is why I want continuous heartbeats for each of my services.

    How to achieve this?

    Thanks.

    @boppy @tildeslash

  13. Clive Lawrence

    I'm adding a comment to say that I have the same use case and "issue" as Dmitry. I have a healthcheck script that returns either a zero exit code (indicating a successful healthcheck) or a non-zero exit code for a healthcheck failure. I want Monit to run one script to set a "Healthy" status in a service registry every time the exit code is zero, and another script to exec when the exit code is non-zero. I've split this into two services as above, with the outcome being that Monit considers one of the two monitors to be in a "failure" state, and it executes my "healthy_update" script each time, which is exactly what I want except for the web status page showing it as "status failed". I tried using the "else if succeeded" logic to execute the "healthy" script but that only triggers on a change of state. It isn't triggered when Monit first starts or reloads, and I need at least one "healthy" update to be fired in to the Service Registry to update the initial "unhealthy" state that is the default on instance registration in the service registry. It would be great if there was a way to tell Monit that a script was actually an "if success" test, rather than the "if failed", or to have a way to specify in a "check program" type of monitor that a particular exit code is actually to be considered a "success" but still take the action. On a related note I had to override the "alert mail@address not {status}" so I don't get an email every 30 seconds when Monit thinks it's detected a failure and runs my "healthy" script.

  14. Sébastien Kirche

    I am seconding the interrogations of Dmitry on the behavior of program test actions. In my case, I need to monitor the change of state of a web service (that uses NTLM authentication, BTW) on a different server. So I made a custom script that returns 0 or 1 and this is correctly interpreted by monit.

    But I need also another specific notification in addition to the one by mail. I then defined 2 additional EXEC actions based on the program returned status and that part is broken.

    It is counter-intuitive in regard to what is described in the manual

    Multiple status tests can be used, for example:
    
    check program hwtest with path /usr/local/bin/hwtest.sh
          with timeout 500 seconds
          if status = 1 then alert
          if status = 3 for 5 cycles then exec "/usr/local/bin/emergency.sh"
    

    But it is coherent with the code that states that the status of the service is considered as "failed" on the first condition that is verified (wait, what ?) (see validate.c:1632)

                    // Evaluate program's exit status against our status checks.
                    const char *output = StringBuffer_length(s->program->inprogressOutput) ? StringBuffer_toString(s->program->inprogressOutput) : "no output";
                    for (Status_T status = s->statuslist; status; status = status->next) {
                            if (status->operator == Operator_Changed) {
                                    if (status->initialized) {
                                            if (Util_evalQExpression(status->operator, s->program->exitStatus, status->return_value)) {
                                                    Event_post(s, Event_Status, State_Changed, status->action, "status changed (%d -> %d) -- %s", status->return_value, s->program->exitStatus, output);
                                                    status->return_value = s->program->exitStatus;
                                            } else {
                                                    Event_post(s, Event_Status, State_ChangedNot, status->action, "status didn't change (%d) -- %s", s->program->exitStatus, output);
                                            }
                                    } else {
                                            status->initialized = true;
                                            status->return_value = s->program->exitStatus;
                                    }
                            } else {
                                    if (Util_evalQExpression(status->operator, s->program->exitStatus, status->return_value)) {
    /* there ===> */                        rv = State_Failed;
                                            Event_post(s, Event_Status, State_Failed, status->action, "status failed (%d) -- %s", s->program->exitStatus, output);
                                    } else {
                                            Event_post(s, Event_Status, State_Succeeded, status->action, "status succeeded (%d) -- %s", s->program->exitStatus, output);
                                    }
                            }
                    }
    

    It would make sense to me that

    • the status of a service is deduced from its returned value
    • we could trigger an acction depending on a check on that return value
    • the status won't be changed by the conditional action (unlike we can see in the log: Succeded (0) and failed (0) at the same time).

    Definition of my service:

    check program CROC with path /home/kirchse/dev/monit/script/check_croc.sh
        with timeout 15 seconds  
        if status = 0 then exec "/bin/bash -c 'custom_up.sh'  "
        if status = 1 for 5 cycles then exec "/bin/bash -c 'custom_down.sh' "
    alert team@mydomain
    

    Verbose log

    Adding 'allow localhost' -- host resolved to [::1]
    Adding 'allow localhost' -- host resolved to [::ffff:127.0.0.1]
    Adding 'allow mymachine' -- host resolved to [::ffff:123.456.78.90]
    Runtime constants:
     Control file       = /home/kirchse/dev/monit/etc/monitrc_debug
     Log file           = /home/kirchse/dev/monit/log/monit.log
     Pid file           = /home/kirchse/dev/monit/var/monit.pid
     Id file            = /home/kirchse/dev/monit/var/monit.id
     State file         = /home/kirchse/dev/monit/var/monit.state
     Debug              = True
     Log                = True
     Use syslog         = False
     Is Daemon          = True
     Use process engine = True
     Limits             = {
                        =   programOutput:     512 B
                        =   sendExpectBuffer:  256 B
                        =   fileContentBuffer: 512 B
                        =   httpContentBuffer: 1 MB
                        =   networkTimeout:    5 s
                        =   programTimeout:    5 m
                        =   stopTimeout:       30 s
                        =   startTimeout:      30 s
                        =   restartTimeout:    30 s
                        = }
     On reboot          = start
     Poll time          = 10 seconds with start delay 0 seconds
     Mail server(s)     = localhost:25 with timeout 30 s
     Mail from          = Automation <toolsteam@mydomain>
     Mail subject       = monit alert --  $event $service
     Mail message       = service: $service - ..(truncated)
     Start monit httpd  = True
     httpd bind address = Any/All
     httpd portnumber   = 2812
     httpd signature    = Enabled
     httpd auth. style  = Host/Net allow list
    
    The service list contains the following entries:
    
    Program Name          = CROC
     Path                 = /home/kirchse/dev/monit/script/check_croc.sh
     Monitoring mode      = active
     On reboot            = start
     Program timeout      = terminate the program if not finished within 15 s
     Status               = if exit value = 1 for 5 cycles then exec '/bin/bash -c custom_ok.sh'
     Status               = if exit value = 0 then exec '/bin/bash -c custom_nok.sh'
     Alert mail to        = team@mydomain
       Alert on           = All events
    
    System Name           = mymachine
     Monitoring mode      = active
     On reboot            = start
    
    -------------------------------------------------------------------------------
    pidfile '/home/kirchse/dev/monit/var/monit.pid' does not exist
    Starting Monit 5.25.3 daemon with http interface at [*]:2812
    Starting Monit HTTP server at [*]:2812
    Monit HTTP server started
    'mymachine' Monit 5.25.3 started
    
    Cannot open proc file '/proc/1/io' -- Permission denied
    Cannot read proc file '/proc/1/attr/current' -- Invalid argument
    [ skippings /proc accessing lines because not root ]
    'CROC' program started
    Cannot open proc file '/proc/1/io' -- Permission denied
    Cannot read proc file '/proc/1/attr/current' -- Invalid argument
    [ skippings /proc accessing lines because not root ]
    'CROC' status succeeded (0) -- no output
    'CROC' status failed (0) -- no output
    -------------------------------------------------------------------------------
        /home/kirchse/dev/monit/monit-5.25.3/monit() [0x42581a]
        /home/kirchse/dev/monit/monit-5.25.3/monit(LogError+0xd3) [0x425e2f]
        /home/kirchse/dev/monit/monit-5.25.3/monit() [0x420887]
        /home/kirchse/dev/monit/monit-5.25.3/monit(Event_post+0x47e) [0x420e01]
        /home/kirchse/dev/monit/monit-5.25.3/monit(check_program+0x3f2) [0x43d8a7]
        /home/kirchse/dev/monit/monit-5.25.3/monit(validate+0x105) [0x43bf7f]
        /home/kirchse/dev/monit/monit-5.25.3/monit() [0x41c58d]
        /home/kirchse/dev/monit/monit-5.25.3/monit() [0x41bbaa]
        /home/kirchse/dev/monit/monit-5.25.3/monit(main+0x7e) [0x41b5b6]
        /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f1514416413]
        /home/kirchse/dev/monit/monit-5.25.3/monit(_start+0x2e) [0x40d06e]
    -------------------------------------------------------------------------------
    Sending Status failed notification to team@mydomain
    Trying to send mail via localhost:25
    'CROC' exec: '/bin/bash -c echo OK'
    'CROC' program started
    OK
    
  15. Lutz Mader

    Hello Sébastien,
    maybe I'm wrong, but I use similar scripts to notify problems via a central information system. Based on your sample a useful check look like this one.

    check program CROC with path /home/kirchse/dev/monit/script/check_croc.sh
        with timeout 15 seconds  
        if status != 0 then exec "/bin/bash -c 'custom_down.sh'"
           else if succeeded then exec "/bin/bash -c 'custom_up.sh'"
        if status = 1 for 5 cycles then exec "/bin/bash -c 'custom_down.sh'"
    alert team@mydomain
    

    Unfortunately Monit knows "bad" alerts only, there is no way to define "good" alerts directly.
    But with the "else if succeeded then" statement I define a "good" alert indirect.

    All return codes not equal to zero are “bad” alerts (exec custom_down.sh), but you get a “good” alert for zero as well (exec custom_up.sh).

    A suggestion only,
    Lutz

  16. Log in to comment