- edited description
-
assigned issue to
monit 5.17.1 - solaris 11 bug with http check
On Solaris we have enabled a http port check. Using -vv we see that the connection is refused but the test succeeds anyway.
[EDT Oct 19 09:04:49] debug : Socket test failed for [127.0.0.1]:4242 -- Connection refused [EDT Oct 19 09:04:49] debug : 'logchipper' succeeded testing protocol [HTTP] at [localhost]:4242/health [TCP/IP] [response time 0.894 ms]
Comments (19)
-
repo owner -
repo owner Please add full monit configuration for 'logchipper' service.
-
reporter Config file
check process logchipper pidfile /opt/inf/var/run/logchipper/logchipper.pid start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart" restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart" stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop" if failed port 4242 protocol http request /health then restart if 3 restarts within 6 cycles then unmonitor if memory greater than 500 MB for 5 cycles then restart
-
reporter I tested this on Linux (Redhat el6 and AIX 1.7) with the same configuration and it worked fine. Only Solaris is showing the issue.
-
repo owner Tested with monit 5.17.1 and monit 5.19.0 on Solaris 11.3 with the following configuration:
set daemon 5 set httpd port 2812 allow localhost check system $HOST check process apache matching "httpd" if failed port 4242 protocol http request /health then restart
Works fine:
'apache' process is running with pid 1058 'apache' zombie check succeeded Socket test failed for [::1]:4242 -- Connection refused Socket test failed for [127.0.0.1]:4242 -- Connection refused 'apache' failed protocol test [HTTP] at [localhost]:4242/health [TCP/IP] -- Connection refused 'apache' trying to restart 'apache' stop skipped -- method not defined 'apache' start method not defined 'apache' monitoring enabled 'apache' process is running with pid 1058 'apache' zombie check succeeded
How was the binary created? (compiled from source or using pre-compiled binary from our site?: https://mmonit.com/monit/dist/binary/)
Please can you test with monit 5.19.0?
If you don't specify the host explicitly, monit will try to connect to all interfaces to which the localhost resolves - usually IPv4 "127.0.0.1" and IPv6 "::1" ... is it possible that your port 4242 responds on IPv6 interface? (the log contains entry just for IPv4 followed by success). You can check it for example using pfiles:
pfiles /proc/* | grep AF_INET | grep 4242
-
reporter I believe we used the binary distribution, 99% sure but need to check with a few folks to confirm (they are out at the moment).
pfiles /proc/* 2>&1 | grep AF_INET | grep 4242 | grep -v permission sockname: AF_INET 0.0.0.0 port: 4242
-
reporter I will download 5.19 and give it a try.
-
reporter I downloaded your 5.19 binary and give it a whirl. I still see the same problem:
'logchipper' process is running with pid 168 'logchipper' zombie check succeeded 'logchipper' mem amount check succeeded [current mem amount=1.5 MB] Socket test failed for [127.0.0.1]:4242 -- Connection refused 'logchipper' succeeded testing protocol [HTTP] at [localhost]:4242/health [TCP/IP] [response time 3.827 s] 'logchipper' connection succeeded to [localhost]:4242/health [TCP/IP]
-
repo owner please can you take tcpdump for port 4242 during the monit test and send it to support@mmonit.com?
-
reporter Looking at your code it appears that the exception thrown on the Socket test failure isn't propogating back correctly on Sun. You aren't falling into the catch section in validate.c in _checkConnection() line 151-158.
I haven't put this into the debugger but the printf in the log "succeeded testing protocol" is just below the Socket_test() call
-
reporter I don't have root access on the box. I will try and see if I can get it but it isn't that easy.
-
reporter Just a note that connection refused is the correct error. I using a bad / reused pid. What needs to happen is that if this test fails then the program should be restarted. monit is correctly detecting that the port can't be accessed but on Solaris (only) the error doesn't result in a restart.
-
repo owner Which Solaris 11 SRU version it is?:
pkg info entire | grep Version
-
reporter pkg info entire | grep Version Version: 0.5.11 (Oracle Solaris 11.2.4.6.0)
-
reporter The problem we are trying to address is that we have seen cases where when a machine crashes and reboots the pid files are there at startup. The pid may be reused by another task. When monit does the pid check it succeeds as the task is alive. However it is not the task that we want (e.g. not logchipper but some other task). We don't have a means (via monit) to confirm that the executable associated with the pid is logchipper or not. Therefore we are using the port health check. Which is failing as expected but not doing the restart.
It might be useful to augment monit to have a check for a pid and a task name - as a one time check during startup or run rarely (say once an hour).
-
repo owner we're unable to reproduce the problem ... is it possible to get remote access to the host which has this problem, or can you provide vmware/virtualbox image of the testing system, where it fails?
-
reporter No unfortunately we can't do that. Are you still interested in the tcpdump?
-
repo owner Yes, please send the tcpdump.
-
repo owner - changed status to resolved
Fixed: Issue
#487: Solaris on SPARC: Monit doesn't trigger an event if protocol test failed.→ <<cset c3b194e80989>>
- Log in to comment