tildeslash / Monit / issues / #487 - monit 5.17.1 - solaris 11 bug with http check

Issue #487 resolved

Scott Halstead created an issue 2016-10-19

On Solaris we have enabled a http port check. Using -vv we see that the connection is refused but the test succeeds anyway.

[EDT Oct 19 09:04:49] debug    : Socket test failed for [127.0.0.1]:4242 -- Connection refused
[EDT Oct 19 09:04:49] debug    : 'logchipper' succeeded testing protocol [HTTP] at [localhost]:4242/health [TCP/IP] [response time 0.894 ms]

Comments (19)

Tildeslash repo owner
- edited description
- assigned issue to
  
  Tildeslash
- 2016-10-19T14:49:34+00:00
Tildeslash repo owner
Please add full monit configuration for 'logchipper' service.
- 2016-10-19T14:50:03+00:00

Scott Halstead reporter

Config file

check process logchipper pidfile /opt/inf/var/run/logchipper/logchipper.pid
  start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
  restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
  stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop"
  if failed port 4242 protocol http request /health then restart
  if 3 restarts within 6 cycles then unmonitor
  if memory greater than 500 MB for 5 cycles then restart

2016-10-19T14:55:02+00:00

Scott Halstead reporter
I tested this on Linux (Redhat el6 and AIX 1.7) with the same configuration and it worked fine. Only Solaris is showing the issue.
- 2016-10-19T14:56:15+00:00

Tildeslash repo owner

Tested with monit 5.17.1 and monit 5.19.0 on Solaris 11.3 with the following configuration:

set daemon 5

set httpd port 2812 allow localhost

check system $HOST

check process apache matching "httpd"
    if failed port 4242 protocol http request /health then restart

Works fine:

'apache' process is running with pid 1058
'apache' zombie check succeeded
Socket test failed for [::1]:4242 -- Connection refused
Socket test failed for [127.0.0.1]:4242 -- Connection refused
'apache' failed protocol test [HTTP] at [localhost]:4242/health [TCP/IP] -- Connection refused
'apache' trying to restart
'apache' stop skipped -- method not defined
'apache' start method not defined
'apache' monitoring enabled
'apache' process is running with pid 1058
'apache' zombie check succeeded

How was the binary created? (compiled from source or using pre-compiled binary from our site?: https://mmonit.com/monit/dist/binary/)

Please can you test with monit 5.19.0?

If you don't specify the host explicitly, monit will try to connect to all interfaces to which the localhost resolves - usually IPv4 "127.0.0.1" and IPv6 "::1" ... is it possible that your port 4242 responds on IPv6 interface? (the log contains entry just for IPv4 followed by success). You can check it for example using pfiles:

pfiles /proc/* | grep AF_INET | grep 4242

2016-10-19T17:48:27+00:00

Scott Halstead reporter

I believe we used the binary distribution, 99% sure but need to check with a few folks to confirm (they are out at the moment).

pfiles /proc/* 2>&1 | grep AF_INET | grep 4242 | grep -v permission                                                                                             
        sockname: AF_INET 0.0.0.0  port: 4242

2016-10-19T17:53:55+00:00

Scott Halstead reporter
I will download 5.19 and give it a try.
- 2016-10-19T17:55:23+00:00

Scott Halstead reporter

I downloaded your 5.19 binary and give it a whirl. I still see the same problem:

'logchipper' process is running with pid 168
'logchipper' zombie check succeeded
'logchipper' mem amount check succeeded [current mem amount=1.5 MB]
Socket test failed for [127.0.0.1]:4242 -- Connection refused
'logchipper' succeeded testing protocol [HTTP] at [localhost]:4242/health [TCP/IP] [response time 3.827 s]
'logchipper' connection succeeded to [localhost]:4242/health [TCP/IP]

2016-10-19T18:13:35+00:00

Tildeslash repo owner
please can you take tcpdump for port 4242 during the monit test and send it to support@mmonit.com?
- 2016-10-19T18:19:16+00:00
Scott Halstead reporter
Looking at your code it appears that the exception thrown on the Socket test failure isn't propogating back correctly on Sun. You aren't falling into the catch section in validate.c in _checkConnection() line 151-158.

I haven't put this into the debugger but the printf in the log "succeeded testing protocol" is just below the Socket_test() call
- 2016-10-19T18:21:34+00:00
Scott Halstead reporter
I don't have root access on the box. I will try and see if I can get it but it isn't that easy.
- 2016-10-19T18:36:34+00:00
Scott Halstead reporter
Just a note that connection refused is the correct error. I using a bad / reused pid. What needs to happen is that if this test fails then the program should be restarted. monit is correctly detecting that the port can't be accessed but on Solaris (only) the error doesn't result in a restart.
- 2016-10-19T18:45:42+00:00
Tildeslash repo owner
Which Solaris 11 SRU version it is?:
```
pkg info entire | grep Version
```
- 2016-10-20T09:50:39+00:00

Scott Halstead reporter

 pkg info entire | grep Version
       Version: 0.5.11 (Oracle Solaris 11.2.4.6.0)

2016-10-20T11:21:25+00:00

Scott Halstead reporter
The problem we are trying to address is that we have seen cases where when a machine crashes and reboots the pid files are there at startup. The pid may be reused by another task. When monit does the pid check it succeeds as the task is alive. However it is not the task that we want (e.g. not logchipper but some other task). We don't have a means (via monit) to confirm that the executable associated with the pid is logchipper or not. Therefore we are using the port health check. Which is failing as expected but not doing the restart.

It might be useful to augment monit to have a check for a pid and a task name - as a one time check during startup or run rarely (say once an hour).
- 2016-10-20T11:30:13+00:00
Tildeslash repo owner
we're unable to reproduce the problem ... is it possible to get remote access to the host which has this problem, or can you provide vmware/virtualbox image of the testing system, where it fails?
- 2016-10-20T19:23:54+00:00
Scott Halstead reporter
No unfortunately we can't do that. Are you still interested in the tcpdump?
- 2016-10-20T19:47:29+00:00
Tildeslash repo owner
Yes, please send the tcpdump.
- 2016-10-21T19:42:43+00:00
Tildeslash repo owner
- changed status to resolved
Fixed: Issue ~~#487~~: Solaris on SPARC: Monit doesn't trigger an event if protocol test failed.

→ <<cset c3b194e80989>>
- 2017-04-17T12:36:18+00:00
Log in to comment

Assignee: Tildeslash

Type: bug

Priority: major

Status: resolved

Component: Monit

Version: 5.17.1

Votes: 0

Watchers: 1