- attached monit status bug.jpg
monit reports wrong status
monit version 5.25.3(this form doesnt have this choice)
Program returns 0 ( success), monit considers it a fail. And vice versa
Please see the picture, it's worth a 1000 words.
Also note that I simplified the config compared to whats shown in the picture to this:
IF status ! = 100 THEN EXEC ...
Comments (18)
-
reporter -
repo owner The test works as designed:
The "if changed" test will trigger anytime the value changed. The new value becomes a new baselin: the test will turn to "OK" next cycle and remain OK until the value changes again.
In your case the "if changed" should trigger when the status changed from 0->1. If the value is still 1 in the next cycle, the status is changed to OK (as the value remains 1 => no change occurred). When the value will change from 1 to something else, the test will match again.
-
repo owner - changed status to closed
-
reporter - changed status to open
The test is not "IF Changed" The test is "IF status ! = 100"
Please look at the picture.
-
repo owner Please can you run monit in debug mode and attach the output?:
1.) stop monit
2.) start it in debug mode: monit -vI
-
reporter - attached monit wrong status.png
-
reporter This CHECK PROGRAM generates random successes and failures, using bash's $RANDOM to trigger divide by 0.
As you see, in case of SUCCESS, MONIT_EVENT still shows 'Status failed'
In fact, it always shows this status. The UI paints accordingly.
Relevant configuration: 43 CHECK PROGRAM NodeManager WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'" 44 IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'" 45 REPEAT EVERY 1 CYCLES 46 47 IF STATUS !=0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'" 48 REPEAT EVERY 1 CYCLES [dab@d129668-005:/home/dab] /home/dab/scripts/monit -c /home/dab/UserDirs/dba/dev/utils/monit.conf -v | egrep -v 'proc file|system statistic|M/Monit' M/Monit enabled but no httpd allowed -- please add 'set httpd' statement Runtime constants: Control file = /home/dab/UserDirs/dba/dev/utils/monit.conf Log file = (not defined) Pid file = /home/dab/.monit.pid Id file = /local/data/scratch/monit.id State file = /local/data/scratch/monit.state Debug = True Log = False Use syslog = False Is Daemon = True Use process engine = True Limits = { = programOutput: 512 B = sendExpectBuffer: 256 B = fileContentBuffer: 512 B = httpContentBuffer: 1 kB = networkTimeout: 5 s = programTimeout: 30 s = stopTimeout: 10 s = startTimeout: 10 s = restartTimeout: 10 s = } On reboot = start Poll time = 3 seconds with start delay 0 seconds Mail from = Monit Support <monit@foo.bar> Mail reply to = support@domain.com Mail subject = $SERVICE $EVENT at $DATE Mail message = Monit $ACTION $SERVI..(truncated) Start monit httpd = False The service list contains the following entries: System Name = d341369-005 Monitoring mode = active On reboot = start Every = Check service every 1 cycles Program Name = NodeManager Path = /bin/bash -c echo $(( 1/($RANDOM%2) )) Monitoring mode = active On reboot = start Program timeout = terminate the program if not finished within 30 s Status = if exit value != 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo FAILURE;echo)' repeat every 1 cycle(s) Status = if exit value = 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' repeat every 1 cycle(s) ------------------------------------------------------------------------------- pidfile '/home/dab/.monit.pid' does not exist Starting Monit 5.25.3 daemon 'd341369-005' Monit 5.25.3 started 'NodeManager' program started 'NodeManager' status succeeded (0) -- 1 'NodeManager' status failed (0) -- 1 'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' MONIT_PROGRAM_STATUS=0 MONIT_DATE=Tue, 29 Jan 2019 11:47:44 MONIT_HOST=d341369-005 MONIT_EVENT=Status failed MONIT_SERVICE=NodeManager MONIT_DESCRIPTION=status failed (0) -- 1 SUCCESS 'NodeManager' program started 'NodeManager' status failed (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ") 'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo FAILURE;echo)' 'NodeManager' status succeeded (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ") 'NodeManager' program started MONIT_PROGRAM_STATUS=1 MONIT_DATE=Tue, 29 Jan 2019 11:47:47 MONIT_HOST=d341369-005 MONIT_EVENT=Status failed MONIT_SERVICE=NodeManager MONIT_DESCRIPTION=status failed (1) -- /bin/bash: 1/(5532%2) : division by 0 (error token is ") ") FAILURE 'NodeManager' status succeeded (0) -- 1 'NodeManager' status failed (0) -- 1 'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' MONIT_PROGRAM_STATUS=0 MONIT_DATE=Tue, 29 Jan 2019 11:47:50 MONIT_HOST=d341369-005 MONIT_EVENT=Status failed MONIT_SERVICE=NodeManager MONIT_DESCRIPTION=status failed (0) -- 1 SUCCESS 'NodeManager' program started 'NodeManager' status succeeded (0) -- 1 'NodeManager' status failed (0) -- 1 'NodeManager' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' MONIT_PROGRAM_STATUS=0 MONIT_DATE=Tue, 29 Jan 2019 11:47:53 MONIT_HOST=d341369-005 MONIT_EVENT=Status failed MONIT_SERVICE=NodeManager MONIT_DESCRIPTION=status failed (0) -- 1 SUCCESS
-
repo owner The NodeManager example problem is configuration issue - each "if status" test is a standalone statement, not "if-else" branch.
Monit thus gets the status and evaluates each rule independently. If the first match, it sets the error state - then the second is evaluated and resets the error (as it's negation of the first rule).
If you need to to have both tests without conflict, you should split the configuration into two services:
CHECK PROGRAM NodeManager_zero WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'" IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'" CHECK PROGRAM NodeManager_nonzero WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'" IF STATUS !=0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'"
-
reporter This will show two rows in the UI instead of one. It will also probably show 2 rows per host per Test in the M/Monit UI. Is this correct?
We are evaluating M/monit for company-wide use, thats a lot of hosts and checks, creating a separate config entry for each potential return code doesn't sound feasible.
I also tried: IF STATUS !=100, to handle both 0s and 1s, but this also produced the same result.
My goal is to handle different statuses of the same check differently. What's the best way to accomplish this?
Thanks for your help.
-
reporter Actually, forget multiple tests, your suggestion does't work even in the simplest single test. Note how it says 'status failed' on success.
CHECK PROGRAM TestSuccess WITH PATH "/bin/true" IF STATUS = 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'" REPEAT EVERY 1 CYCLES Program Name = TestSuccess Path = /bin/true Monitoring mode = active On reboot = start Program timeout = terminate the program if not finished within 5 s Status = if exit value = 0 then exec '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' repeat every 1 cycle(s) Every = Check service every 1 cycles Starting Monit 5.25.3 daemon 'd341369-005' Monit 5.25.3 started 'TestSuccess' program started 'TestSuccess' status failed (0) -- no output 'TestSuccess' exec: '/bin/bash -c /bin/env | grep MONIT| cat - <(echo SUCCESS;echo)' 'TestSuccess' program started MONIT_PROGRAM_STATUS=0 MONIT_DATE=Wed, 30 Jan 2019 10:15:23 MONIT_HOST=d341369-005 MONIT_EVENT=Status failed MONIT_SERVICE=TestSuccess MONIT_DESCRIPTION=status failed (0) -- no output SUCCESS
-
I think that Monit "thinks" differently: If you define an event and it gets triggered it is meant to be "a failure". So every
IF
results in a failure state, if evaluatedtrue
ly:In your example
IF STATUS = 0
equals totrue
==> so the situation should be different to this assumption ==> failure! -
reporter Thanks Henning. It sounds like a bug to me though, not sure if in the implementation or in design. Monit's own documentation says "By convention, 0 means the program exited normally."
-
Monit's own documentation says "By convention, 0 means the program exited normally."
sure, but that refers to the exit status of the program.
Think of monit as an incident reporting system. So you specify incidents. In your case it is: My incident happens, if my program exits with 0. Basically that means: My program should fail. So the monit internal status is failed, because the execution succeeded.
-
reporter thanks again Henning. I understand.
I think that it's mightily confusing, and at a minimum requires clarified documentation.
Looks like this achieves the desired effect:
#CANNOT do IF STATUS = 0, MUST always compare to !0, and ELSE IF SUCCEEDED CHECK PROGRAM TestFailure WITH PATH "/bin/bash -c 'echo $(( 1/($RANDOM%2) ))'" WITH TIMEOUT 5 SECONDS EVERY 1 CYCLES IF STATUS != 0 THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo FAILURE;echo)'" REPEAT EVERY 1 CYCLES ELSE IF SUCCEEDED THEN EXEC "/bin/bash -c '/bin/env | grep MONIT| cat - <(echo SUCCESS;echo)'" REPEAT EVERY 1 CYCLES
-
reporter Update: for config shown above, consecutive successes do not result in handler being executed. Consecutive failures do. I see that this is consistent with the docs. So, I’m not out of the woods yet. My goal is to have handler executed ALWAYS,success or failure, with MONIT_ env variables set accordingly.
The larger goal is to have an entry in my database that shows an up-to-date status of all monitored services.
If SUCCESS timestamp is not updated, it could mean that my service is still good, or that monit agent is dead, or that it no longer monitors that service for whatever reason. Which is why I want continuous heartbeats for each of my services.
How to achieve this?
Thanks.
@boppy @tildeslash
-
I'm adding a comment to say that I have the same use case and "issue" as Dmitry. I have a healthcheck script that returns either a zero exit code (indicating a successful healthcheck) or a non-zero exit code for a healthcheck failure. I want Monit to run one script to set a "Healthy" status in a service registry every time the exit code is zero, and another script to exec when the exit code is non-zero. I've split this into two services as above, with the outcome being that Monit considers one of the two monitors to be in a "failure" state, and it executes my "healthy_update" script each time, which is exactly what I want except for the web status page showing it as "status failed". I tried using the "else if succeeded" logic to execute the "healthy" script but that only triggers on a change of state. It isn't triggered when Monit first starts or reloads, and I need at least one "healthy" update to be fired in to the Service Registry to update the initial "unhealthy" state that is the default on instance registration in the service registry. It would be great if there was a way to tell Monit that a script was actually an "if success" test, rather than the "if failed", or to have a way to specify in a "check program" type of monitor that a particular exit code is actually to be considered a "success" but still take the action. On a related note I had to override the "alert mail@address not {status}" so I don't get an email every 30 seconds when Monit thinks it's detected a failure and runs my "healthy" script.
-
I am seconding the interrogations of Dmitry on the behavior of program test actions. In my case, I need to monitor the change of state of a web service (that uses NTLM authentication, BTW) on a different server. So I made a custom script that returns 0 or 1 and this is correctly interpreted by monit.
But I need also another specific notification in addition to the one by mail. I then defined 2 additional
EXEC
actions based on the program returned status and that part is broken.It is counter-intuitive in regard to what is described in the manual
Multiple status tests can be used, for example: check program hwtest with path /usr/local/bin/hwtest.sh with timeout 500 seconds if status = 1 then alert if status = 3 for 5 cycles then exec "/usr/local/bin/emergency.sh"
But it is coherent with the code that states that the status of the service is considered as "failed" on the first condition that is verified (wait, what ?) (see validate.c:1632)
// Evaluate program's exit status against our status checks. const char *output = StringBuffer_length(s->program->inprogressOutput) ? StringBuffer_toString(s->program->inprogressOutput) : "no output"; for (Status_T status = s->statuslist; status; status = status->next) { if (status->operator == Operator_Changed) { if (status->initialized) { if (Util_evalQExpression(status->operator, s->program->exitStatus, status->return_value)) { Event_post(s, Event_Status, State_Changed, status->action, "status changed (%d -> %d) -- %s", status->return_value, s->program->exitStatus, output); status->return_value = s->program->exitStatus; } else { Event_post(s, Event_Status, State_ChangedNot, status->action, "status didn't change (%d) -- %s", s->program->exitStatus, output); } } else { status->initialized = true; status->return_value = s->program->exitStatus; } } else { if (Util_evalQExpression(status->operator, s->program->exitStatus, status->return_value)) { /* there ===> */ rv = State_Failed; Event_post(s, Event_Status, State_Failed, status->action, "status failed (%d) -- %s", s->program->exitStatus, output); } else { Event_post(s, Event_Status, State_Succeeded, status->action, "status succeeded (%d) -- %s", s->program->exitStatus, output); } } }
It would make sense to me that
- the status of a service is deduced from its returned value
- we could trigger an acction depending on a check on that return value
- the status won't be changed by the conditional action (unlike we can see in the log: Succeded (0) and failed (0) at the same time).
Definition of my service:
check program CROC with path /home/kirchse/dev/monit/script/check_croc.sh with timeout 15 seconds if status = 0 then exec "/bin/bash -c 'custom_up.sh' " if status = 1 for 5 cycles then exec "/bin/bash -c 'custom_down.sh' " alert team@mydomain
Verbose log
Adding 'allow localhost' -- host resolved to [::1] Adding 'allow localhost' -- host resolved to [::ffff:127.0.0.1] Adding 'allow mymachine' -- host resolved to [::ffff:123.456.78.90] Runtime constants: Control file = /home/kirchse/dev/monit/etc/monitrc_debug Log file = /home/kirchse/dev/monit/log/monit.log Pid file = /home/kirchse/dev/monit/var/monit.pid Id file = /home/kirchse/dev/monit/var/monit.id State file = /home/kirchse/dev/monit/var/monit.state Debug = True Log = True Use syslog = False Is Daemon = True Use process engine = True Limits = { = programOutput: 512 B = sendExpectBuffer: 256 B = fileContentBuffer: 512 B = httpContentBuffer: 1 MB = networkTimeout: 5 s = programTimeout: 5 m = stopTimeout: 30 s = startTimeout: 30 s = restartTimeout: 30 s = } On reboot = start Poll time = 10 seconds with start delay 0 seconds Mail server(s) = localhost:25 with timeout 30 s Mail from = Automation <toolsteam@mydomain> Mail subject = monit alert -- $event $service Mail message = service: $service - ..(truncated) Start monit httpd = True httpd bind address = Any/All httpd portnumber = 2812 httpd signature = Enabled httpd auth. style = Host/Net allow list The service list contains the following entries: Program Name = CROC Path = /home/kirchse/dev/monit/script/check_croc.sh Monitoring mode = active On reboot = start Program timeout = terminate the program if not finished within 15 s Status = if exit value = 1 for 5 cycles then exec '/bin/bash -c custom_ok.sh' Status = if exit value = 0 then exec '/bin/bash -c custom_nok.sh' Alert mail to = team@mydomain Alert on = All events System Name = mymachine Monitoring mode = active On reboot = start ------------------------------------------------------------------------------- pidfile '/home/kirchse/dev/monit/var/monit.pid' does not exist Starting Monit 5.25.3 daemon with http interface at [*]:2812 Starting Monit HTTP server at [*]:2812 Monit HTTP server started 'mymachine' Monit 5.25.3 started Cannot open proc file '/proc/1/io' -- Permission denied Cannot read proc file '/proc/1/attr/current' -- Invalid argument [ skippings /proc accessing lines because not root ] 'CROC' program started Cannot open proc file '/proc/1/io' -- Permission denied Cannot read proc file '/proc/1/attr/current' -- Invalid argument [ skippings /proc accessing lines because not root ] 'CROC' status succeeded (0) -- no output 'CROC' status failed (0) -- no output ------------------------------------------------------------------------------- /home/kirchse/dev/monit/monit-5.25.3/monit() [0x42581a] /home/kirchse/dev/monit/monit-5.25.3/monit(LogError+0xd3) [0x425e2f] /home/kirchse/dev/monit/monit-5.25.3/monit() [0x420887] /home/kirchse/dev/monit/monit-5.25.3/monit(Event_post+0x47e) [0x420e01] /home/kirchse/dev/monit/monit-5.25.3/monit(check_program+0x3f2) [0x43d8a7] /home/kirchse/dev/monit/monit-5.25.3/monit(validate+0x105) [0x43bf7f] /home/kirchse/dev/monit/monit-5.25.3/monit() [0x41c58d] /home/kirchse/dev/monit/monit-5.25.3/monit() [0x41bbaa] /home/kirchse/dev/monit/monit-5.25.3/monit(main+0x7e) [0x41b5b6] /lib64/libc.so.6(__libc_start_main+0xf3) [0x7f1514416413] /home/kirchse/dev/monit/monit-5.25.3/monit(_start+0x2e) [0x40d06e] ------------------------------------------------------------------------------- Sending Status failed notification to team@mydomain Trying to send mail via localhost:25 'CROC' exec: '/bin/bash -c echo OK' 'CROC' program started OK
-
Hello Sébastien,
maybe I'm wrong, but I use similar scripts to notify problems via a central information system. Based on your sample a useful check look like this one.check program CROC with path /home/kirchse/dev/monit/script/check_croc.sh with timeout 15 seconds if status != 0 then exec "/bin/bash -c 'custom_down.sh'" else if succeeded then exec "/bin/bash -c 'custom_up.sh'" if status = 1 for 5 cycles then exec "/bin/bash -c 'custom_down.sh'" alert team@mydomain
Unfortunately Monit knows "bad" alerts only, there is no way to define "good" alerts directly.
But with the "else if succeeded then" statement I define a "good" alert indirect.All return codes not equal to zero are “bad” alerts (exec custom_down.sh), but you get a “good” alert for zero as well (exec custom_up.sh).
A suggestion only,
Lutz - Log in to comment