monit reports wrong status

dmitry babitsky reporter

attached monit status bug.jpg

2019-01-23T19:39:16+00:00

Tildeslash repo owner

The test works as designed:

The "if changed" test will trigger anytime the value changed. The new value becomes a new baselin: the test will turn to "OK" next cycle and remain OK until the value changes again.

In your case the "if changed" should trigger when the status changed from 0->1. If the value is still 1 in the next cycle, the status is changed to OK (as the value remains 1 => no change occurred). When the value will change from 1 to something else, the test will match again.

2019-01-29T11:47:53+00:00

Tildeslash repo owner

changed status to closed

2019-01-29T11:52:15+00:00

dmitry babitsky reporter

changed status to open

The test is not "IF Changed" The test is "IF status ! = 100"

Please look at the picture.

2019-01-29T13:21:11+00:00

Tildeslash repo owner

Please can you run monit in debug mode and attach the output?:

1.) stop monit

2.) start it in debug mode: monit -vI

2019-01-29T14:10:27+00:00

dmitry babitsky reporter

attached monit wrong status.png

2019-01-29T16:58:11+00:00

dmitry babitsky reporter

This CHECK PROGRAM generates random successes and failures, using bash's $RANDOM to trigger divide by 0.

As you see, in case of SUCCESS, MONIT_EVENT still shows 'Status failed'

In fact, it always shows this status. The UI paints accordingly.

Relevant configuration:
     43      CHECK PROGRAM NodeManager WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'"
     44        IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'"  
     45        REPEAT EVERY 1 CYCLES
     46
     47        IF STATUS !=0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"
     48        REPEAT EVERY 1 CYCLES

[dab@d129668-005:/home/dab] /home/dab/scripts/monit -c /home/dab/UserDirs/dba/dev/utils/monit.conf -v | egrep -v 'proc file|system statistic|M/Monit'
M/Monit enabled but no httpd allowed -- please add 'set httpd' statement
Runtime constants:
 Control file       = /home/dab/UserDirs/dba/dev/utils/monit.conf
 Log file           = (not defined)
 Pid file           = /home/dab/.monit.pid
 Id file            = /local/data/scratch/monit.id
 State file         = /local/data/scratch/monit.state
 Debug              = True
 Log                = False
 Use syslog         = False
 Is Daemon          = True
 Use process engine = True
 Limits             = {
                    =   programOutput:     512 B
                    =   sendExpectBuffer:  256 B
                    =   fileContentBuffer: 512 B
                    =   httpContentBuffer: 1 kB
                    =   networkTimeout:    5 s
                    =   programTimeout:    30 s
                    =   stopTimeout:       10 s
                    =   startTimeout:      10 s
                    =   restartTimeout:    10 s
                    = }
 On reboot          = start
 Poll time          = 3 seconds with start delay 0 seconds
 Mail from          = Monit Support <monit@foo.bar>
 Mail reply to      = support@domain.com
 Mail subject       = $SERVICE $EVENT at $DATE
 Mail message       = Monit $ACTION $SERVI..(truncated)
 Start monit httpd  = False

The service list contains the following entries:

System Name           = d341369-005
 Monitoring mode      = active
 On reboot            = start
 Every                = Check service every 1 cycles

Program Name          = NodeManager
 Path                 = /bin/bash -c echo $(( 1/($RANDOM%2) ))
 Monitoring mode      = active
 On reboot            = start
 Program timeout      = terminate the program if not finished within 30 s
 Status               = if exit value != 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo FAILURE;echo)' repeat every 1 cycle(s)
 Status               = if exit value = 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' repeat every 1 cycle(s)

-------------------------------------------------------------------------------
pidfile '/home/dab/.monit.pid' does not exist
Starting Monit 5.25.3 daemon
'd341369-005' Monit 5.25.3 started
'NodeManager' program started
'NodeManager' status succeeded (0) -- 1
'NodeManager' status failed (0) -- 1
'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
MONIT_PROGRAM_STATUS=0
MONIT_DATE=Tue, 29 Jan 2019 11:47:44
MONIT_HOST=d341369-005
MONIT_EVENT=Status failed
MONIT_SERVICE=NodeManager
MONIT_DESCRIPTION=status failed (0) -- 1
SUCCESS

'NodeManager' program started
'NodeManager' status failed (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ")
'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo FAILURE;echo)'
'NodeManager' status succeeded (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ")
'NodeManager' program started
MONIT_PROGRAM_STATUS=1
MONIT_DATE=Tue, 29 Jan 2019 11:47:47
MONIT_HOST=d341369-005
MONIT_EVENT=Status failed
MONIT_SERVICE=NodeManager
MONIT_DESCRIPTION=status failed (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ")
FAILURE

'NodeManager' status succeeded (0) -- 1
'NodeManager' status failed (0) -- 1
'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
MONIT_PROGRAM_STATUS=0
MONIT_DATE=Tue, 29 Jan 2019 11:47:50
MONIT_HOST=d341369-005
MONIT_EVENT=Status failed
MONIT_SERVICE=NodeManager
MONIT_DESCRIPTION=status failed (0) -- 1
SUCCESS

'NodeManager' program started
'NodeManager' status succeeded (0) -- 1
'NodeManager' status failed (0) -- 1
'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
MONIT_PROGRAM_STATUS=0
MONIT_DATE=Tue, 29 Jan 2019 11:47:53
MONIT_HOST=d341369-005
MONIT_EVENT=Status failed
MONIT_SERVICE=NodeManager
MONIT_DESCRIPTION=status failed (0) -- 1
SUCCESS

2019-01-29T17:10:11+00:00

Tildeslash repo owner

The NodeManager example problem is configuration issue - each "if status" test is a standalone statement, not "if-else" branch.

Monit thus gets the status and evaluates each rule independently. If the first match, it sets the error state - then the second is evaluated and resets the error (as it's negation of the first rule).

If you need to to have both tests without conflict, you should split the configuration into two services:

CHECK PROGRAM NodeManager_zero WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'"
         IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'"

CHECK PROGRAM NodeManager_nonzero WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'"
         IF STATUS !=0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"

2019-01-29T20:49:16+00:00

dmitry babitsky reporter

This will show two rows in the UI instead of one. It will also probably show 2 rows per host per Test in the M/Monit UI. Is this correct?

We are evaluating M/monit for company-wide use, thats a lot of hosts and checks, creating a separate config entry for each potential return code doesn't sound feasible.

I also tried: IF STATUS !=100, to handle both 0s and 1s, but this also produced the same result.

My goal is to handle different statuses of the same check differently. What's the best way to accomplish this?

Thanks for your help.

2019-01-29T23:21:38+00:00

dmitry babitsky reporter

Actually, forget multiple tests, your suggestion does't work even in the simplest single test. Note how it says 'status failed' on success.

CHECK PROGRAM TestSuccess WITH PATH "/bin/true" 
       IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'" 
       REPEAT EVERY 1 CYCLES

Program Name          = TestSuccess
 Path                 = /bin/true
 Monitoring mode      = active
 On reboot            = start
 Program timeout      = terminate the program if not finished within 5 s
 Status               = if exit value = 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' repeat every 1 cycle(s)
 Every                = Check service every 1 cycles

Starting Monit 5.25.3 daemon
'd341369-005' Monit 5.25.3 started
'TestSuccess' program started
'TestSuccess' status failed (0) -- no output
'TestSuccess' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'
'TestSuccess' program started
MONIT_PROGRAM_STATUS=0
MONIT_DATE=Wed, 30 Jan 2019 10:15:23
MONIT_HOST=d341369-005
MONIT_EVENT=Status failed
MONIT_SERVICE=TestSuccess
MONIT_DESCRIPTION=status failed (0) -- no output
SUCCESS

2019-01-30T17:49:57+00:00

Henning Bopp

I think that Monit "thinks" differently: If you define an event and it gets triggered it is meant to be "a failure". So every IF results in a failure state, if evaluated truely:

In your example IF STATUS = 0 equals to true ==> so the situation should be different to this assumption ==> failure!

2019-01-30T21:26:51+00:00

dmitry babitsky reporter

Thanks Henning. It sounds like a bug to me though, not sure if in the implementation or in design. Monit's own documentation says "By convention, 0 means the program exited normally."

2019-01-30T23:08:59+00:00

Henning Bopp

Monit's own documentation says "By convention, 0 means the program exited normally."

sure, but that refers to the exit status of the program.

Think of monit as an incident reporting system. So you specify incidents. In your case it is: My incident happens, if my program exits with 0. Basically that means: My program should fail. So the monit internal status is failed, because the execution succeeded.

2019-01-31T20:15:14+00:00

dmitry babitsky reporter

thanks again Henning. I understand.

I think that it's mightily confusing, and at a minimum requires clarified documentation.

Looks like this achieves the desired effect:

#CANNOT do IF STATUS = 0, MUST always compare to !0, and ELSE IF SUCCEEDED

CHECK PROGRAM TestFailure WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'" WITH TIMEOUT 5 SECONDS EVERY 1 CYCLES
        IF STATUS != 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"     REPEAT EVERY 1 CYCLES
        ELSE IF SUCCEEDED THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'"  REPEAT EVERY 1 CYCLES

2019-01-31T22:32:40+00:00

dmitry babitsky reporter

Update: for config shown above, consecutive successes do not result in handler being executed. Consecutive failures do. I see that this is consistent with the docs. So, I’m not out of the woods yet. My goal is to have handler executed ALWAYS,success or failure, with MONIT_ env variables set accordingly.

The larger goal is to have an entry in my database that shows an up-to-date status of all monitored services.

If SUCCESS timestamp is not updated, it could mean that my service is still good, or that monit agent is dead, or that it no longer monitors that service for whatever reason. Which is why I want continuous heartbeats for each of my services.

How to achieve this?

Thanks.

@boppy @tildeslash

2019-02-04T16:16:28+00:00

Clive Lawrence

I'm adding a comment to say that I have the same use case and "issue" as Dmitry. I have a healthcheck script that returns either a zero exit code (indicating a successful healthcheck) or a non-zero exit code for a healthcheck failure. I want Monit to run one script to set a "Healthy" status in a service registry every time the exit code is zero, and another script to exec when the exit code is non-zero. I've split this into two services as above, with the outcome being that Monit considers one of the two monitors to be in a "failure" state, and it executes my "healthy_update" script each time, which is exactly what I want except for the web status page showing it as "status failed". I tried using the "else if succeeded" logic to execute the "healthy" script but that only triggers on a change of state. It isn't triggered when Monit first starts or reloads, and I need at least one "healthy" update to be fired in to the Service Registry to update the initial "unhealthy" state that is the default on instance registration in the service registry. It would be great if there was a way to tell Monit that a script was actually an "if success" test, rather than the "if failed", or to have a way to specify in a "check program" type of monitor that a particular exit code is actually to be considered a "success" but still take the action. On a related note I had to override the "alert mail@address not {status}" so I don't get an email every 30 seconds when Monit thinks it's detected a failure and runs my "healthy" script.

2019-03-28T12:23:47+00:00

Sébastien Kirche

I am seconding the interrogations of Dmitry on the behavior of program test actions. In my case, I need to monitor the change of state of a web service (that uses NTLM authentication, BTW) on a different server. So I made a custom script that returns 0 or 1 and this is correctly interpreted by monit.

But I need also another specific notification in addition to the one by mail. I then defined 2 additional EXEC actions based on the program returned status and that part is broken.

It is counter-intuitive in regard to what is described in the manual

Multiple status tests can be used, for example:

check program hwtest with path /usr/local/bin/hwtest.sh
      with timeout 500 seconds
      if status = 1 then alert
      if status = 3 for 5 cycles then exec "/usr/local/bin/emergency.sh"

But it is coherent with the code that states that the status of the service is considered as "failed" on the first condition that is verified (wait, what ?) (see validate.c:1632)

                // Evaluate program's exit status against our status checks.
                const char *output = StringBuffer_length(s->program->inprogressOutput) ? StringBuffer_toString(s->program->inprogressOutput) : "no output";
                for (Status_T status = s->statuslist; status; status = status->next) {
                        if (status->operator == Operator_Changed) {
                                if (status->initialized) {
                                        if (Util_evalQExpression(status->operator, s->program->exitStatus, status->return_value)) {
                                                Event_post(s, Event_Status, State_Changed, status->action, "status changed (%d -> %d) -- %s", status->return_value, s->program->exitStatus, output);
                                                status->return_value = s->program->exitStatus;
                                        } else {
                                                Event_post(s, Event_Status, State_ChangedNot, status->action, "status didn't change (%d) -- %s", s->program->exitStatus, output);
                                        }
                                } else {
                                        status->initialized = true;
                                        status->return_value = s->program->exitStatus;
                                }
                        } else {
                                if (Util_evalQExpression(status->operator, s->program->exitStatus, status->return_value)) {
/* there ===> */                        rv = State_Failed;
                                        Event_post(s, Event_Status, State_Failed, status->action, "status failed (%d) -- %s", s->program->exitStatus, output);
                                } else {
                                        Event_post(s, Event_Status, State_Succeeded, status->action, "status succeeded (%d) -- %s", s->program->exitStatus, output);
                                }
                        }
                }

It would make sense to me that

the status of a service is deduced from its returned value
we could trigger an acction depending on a check on that return value
the status won't be changed by the conditional action (unlike we can see in the log: Succeded (0) and failed (0) at the same time).

Definition of my service:

check program CROC with path /home/kirchse/dev/monit/script/check_croc.sh
    with timeout 15 seconds  
    if status = 0 then exec "/bin/bash -c 'custom_up.sh'  "
    if status = 1 for 5 cycles then exec "/bin/bash -c 'custom_down.sh' "
alert team@mydomain

Verbose log

Adding 'allow localhost' -- host resolved to [::1]
Adding 'allow localhost' -- host resolved to [::ffff:127.0.0.1]
Adding 'allow mymachine' -- host resolved to [::ffff:123.456.78.90]
Runtime constants:
 Control file       = /home/kirchse/dev/monit/etc/monitrc_debug
 Log file           = /home/kirchse/dev/monit/log/monit.log
 Pid file           = /home/kirchse/dev/monit/var/monit.pid
 Id file            = /home/kirchse/dev/monit/var/monit.id
 State file         = /home/kirchse/dev/monit/var/monit.state
 Debug              = True
 Log                = True
 Use syslog         = False
 Is Daemon          = True
 Use process engine = True
 Limits             = {
                    =   programOutput:     512 B
                    =   sendExpectBuffer:  256 B
                    =   fileContentBuffer: 512 B
                    =   httpContentBuffer: 1 MB
                    =   networkTimeout:    5 s
                    =   programTimeout:    5 m
                    =   stopTimeout:       30 s
                    =   startTimeout:      30 s
                    =   restartTimeout:    30 s
                    = }
 On reboot          = start
 Poll time          = 10 seconds with start delay 0 seconds
 Mail server(s)     = localhost:25 with timeout 30 s
 Mail from          = Automation <toolsteam@mydomain>
 Mail subject       = monit alert --  $event $service
 Mail message       = service: $service - ..(truncated)
 Start monit httpd  = True
 httpd bind address = Any/All
 httpd portnumber   = 2812
 httpd signature    = Enabled
 httpd auth. style  = Host/Net allow list

The service list contains the following entries:

Program Name          = CROC
 Path                 = /home/kirchse/dev/monit/script/check_croc.sh
 Monitoring mode      = active
 On reboot            = start
 Program timeout      = terminate the program if not finished within 15 s
 Status               = if exit value = 1 for 5 cycles then exec '/bin/bash -c custom_ok.sh'
 Status               = if exit value = 0 then exec '/bin/bash -c custom_nok.sh'
 Alert mail to        = team@mydomain
   Alert on           = All events

System Name           = mymachine
 Monitoring mode      = active
 On reboot            = start

-------------------------------------------------------------------------------
pidfile '/home/kirchse/dev/monit/var/monit.pid' does not exist
Starting Monit 5.25.3 daemon with http interface at [*]:2812
Starting Monit HTTP server at [*]:2812
Monit HTTP server started
'mymachine' Monit 5.25.3 started

Cannot open proc file '/proc/1/io' -- Permission denied
Cannot read proc file '/proc/1/attr/current' -- Invalid argument
[ skippings /proc accessing lines because not root ]
'CROC' program started
Cannot open proc file '/proc/1/io' -- Permission denied
Cannot read proc file '/proc/1/attr/current' -- Invalid argument
[ skippings /proc accessing lines because not root ]
'CROC' status succeeded (0) -- no output
'CROC' status failed (0) -- no output
-------------------------------------------------------------------------------
    /home/kirchse/dev/monit/monit-5.25.3/monit() [0x42581a]
    /home/kirchse/dev/monit/monit-5.25.3/monit(LogError+0xd3) [0x425e2f]
    /home/kirchse/dev/monit/monit-5.25.3/monit() [0x420887]
    /home/kirchse/dev/monit/monit-5.25.3/monit(Event_post+0x47e) [0x420e01]
    /home/kirchse/dev/monit/monit-5.25.3/monit(check_program+0x3f2) [0x43d8a7]
    /home/kirchse/dev/monit/monit-5.25.3/monit(validate+0x105) [0x43bf7f]
    /home/kirchse/dev/monit/monit-5.25.3/monit() [0x41c58d]
    /home/kirchse/dev/monit/monit-5.25.3/monit() [0x41bbaa]
    /home/kirchse/dev/monit/monit-5.25.3/monit(main+0x7e) [0x41b5b6]
    /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f1514416413]
    /home/kirchse/dev/monit/monit-5.25.3/monit(_start+0x2e) [0x40d06e]
-------------------------------------------------------------------------------
Sending Status failed notification to team@mydomain
Trying to send mail via localhost:25
'CROC' exec: '/bin/bash -c echo OK'
'CROC' program started
OK

2019-04-08T16:27:08+00:00

Lutz Mader

Hello Sébastien,
maybe I'm wrong, but I use similar scripts to notify problems via a central information system. Based on your sample a useful check look like this one.

check program CROC with path /home/kirchse/dev/monit/script/check_croc.sh
    with timeout 15 seconds  
    if status != 0 then exec "/bin/bash -c 'custom_down.sh'"
       else if succeeded then exec "/bin/bash -c 'custom_up.sh'"
    if status = 1 for 5 cycles then exec "/bin/bash -c 'custom_down.sh'"
alert team@mydomain

Unfortunately Monit knows "bad" alerts only, there is no way to define "good" alerts directly.
But with the "else if succeeded then" statement I define a "good" alert indirect.

All return codes not equal to zero are “bad” alerts (exec custom_down.sh), but you get a “good” alert for zero as well (exec custom_up.sh).

A suggestion only,
Lutz

2019-06-10T09:38:21+00:00

Comments (18)