if failed ping4 hangs up monit

Issue #226 resolved

Former user created an issue 2015-07-14

Symptom: Data collected counter freezes for every item monitored. The http daemon is still working but no new checks are performed anymore. New commands are still possible (http or command line) but not executed. Monit start then summary shows "initializing" forever. Cause: check host router with address "192.168.0.x", if failed ping4 then alert rule. The moment 192.168.0.x went offline the next status change hangs. Monit reload, service monit restart does not help. Hangs on router initializing... Workaround: change address "192.168.0.x" to hostname. It looks like monit does not like icmp destination host unreachable responses from the router. Changing to hostname gives a unknown host message witch works. Environment: Android armhf cpu, debian chrooted +ipv6. Got this problem after moving from v5.12 to v5.13.

Comments (31)

Tildeslash repo owner
We're unable to replicate the issue, using the following configuration:
```
check host nonexistenthost with address 192.168.1.100
    if failed ping4 then alert
```
Debug output:
```
Ping response for 192.168.1.100 1/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.100 2/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.100 3/3 timed out -- no response within 5 seconds
'nonexistenthost' ping test failed
```
... and the monitoring continues normally after the error is reported

Please can you run monit in verbose mode (using "-vI" options) and provide output? Please also collect a network trace of ICMP messages between monit host and the target machine.
- 2015-08-27T18:25:08+00:00
Tildeslash repo owner
- changed status to on hold
unable to reproduce the problem, waiting for data
- 2015-08-31T07:24:00+00:00
Andy Miller
- attached debug.log
- attached www.png
- attached console.png
- 2015-09-05T11:10:55+00:00
Andy Miller
Sorry for the delay. I installed on an new box (other Droid version, different HW) and moved to v5.14. The problem is still there and can be reproduced by unplugging the ethernet cable from the target box. The next time monit does the test - bang. I did some tcpdump -ni wlan0 'dst 192.168.0.40 and icmp' but this shows only that monit got stuck after the second of the three (default) pings to the target. Like:

...726247 IP 192.168.0.30 > 192.168.0.40: ICMP echo request, id 9231, seq 0, length 72 ....730562 IP 192.168.0.30 > 192.168.0.40: ICMP echo request, id 9231, seq 1, length 72

I still have the test box handy. If you need a full package trace then please could you provide me the tcpdump parameters you want? Thank you very much for your help.
- 2015-09-05T11:18:47+00:00
Tildeslash repo owner
- changed status to open
Reopen and to be investigated/fixed for version 5.15
- 2015-09-05T12:28:01+00:00

Tildeslash repo owner

Thanks for update.

We're still unable to replicate the problem.

One machine pings an other box (link up on both sides):

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.008851s
'raspberry' ping test succeeded [response time 0.009s]

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.002507s
'raspberry' ping test succeeded [response time 0.003s]

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.023783s
'raspberry' ping test succeeded [response time 0.024s]

cable unplugged from the target box, left monit to repeat the test three times (each time with 3 retries with 5 seconds timeout per retry):

Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds
'raspberry' ping test failed
'raspberry' icmp ping failed, skipping any port connection tests

Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds
'raspberry' ping test failed
'raspberry' icmp ping failed, skipping any port connection tests

Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds
Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds
'raspberry' ping test failed
'raspberry' icmp ping failed, skipping any port connection tests

connected the cable back to the target box, ping succeeded:

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=2.353867s
'raspberry' ping test succeeded [response time 2.354s]

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.005074s
'raspberry' ping test succeeded [response time 0.005s]

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.004035s
'raspberry' ping test succeeded [response time 0.004s]

Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.097907s
'raspberry' ping test succeeded [response time 0.098s]

turned off the link on the source box (the errors are immediate this time - monit doesn't wait for ping timeout and returns error immediately as there is no link):

Ping request for 192.168.1.12 1/3 failed -- No route to host
Ping request for 192.168.1.12 2/3 failed -- No route to host
Ping request for 192.168.1.12 3/3 failed -- No route to host
'raspberry' ping test failed
'raspberry' icmp ping failed, skipping any port connection tests

We'll continue some tests yet, if we won't be able to find the root cause, will prepare debug version for you.

2015-09-06T10:27:53+00:00

Andy Miller
Thank you very much for your effort. Maybe it's really my environment. The target I ping does not matter. I've tried with a Intel IA64 debian or an embeded Linux running in a router or even local services like nmbd (if failed host localhost port 137 type UDP retry 3 then restart). Everything except TCP get stuck. And the source (monit host) is always a chrooted jessie (debian) running on different hardware like HP Touchpad (Kernel v3.0.101 cyanogenmod) or HP Slate7 (Kernel v3.0.8 ICS). What's in common: jessie (armhf) and IPV6. I want to mention again monit v5.12 worked without any problem. But rather than going back I moved all ipaddr to hostnames and replaced all UDP tests by TCP if possible or commented them out. So I'm still very happy with this workaround. Monit is such a great product - I can not live without it!
- 2015-09-06T12:22:22+00:00
Tildeslash repo owner
Thanks for update. If the problem goes away when UDP is not used, it is more likely related to the UDP communication.

Please can you gather the following data?:

1.) attach strace to monit:
```
sudo strace -p <monit's PID> -s 256 -o /tmp/monit.strace
```
2.) trigger the problem

3.) when the problem started, collect the strace yet for ca. 5 minutes, then stop it (^C) and send the /tmp/monit.strace file to support@mmonit.com
- 2015-09-07T09:59:34+00:00
Andy Miller
Done. Thank you. Just to make sure. The problem goes away when ICMP AND UDP is disabled. The trace data contains the ICMP ping 192.168.0.40 sample.
- 2015-09-07T12:27:37+00:00
Tildeslash repo owner
- changed status to resolved
Fix Issue ~~#226~~ : Monit hung during ping test

→ <<cset c5f50df14009>>
- 2015-09-07T18:16:57+00:00
Tildeslash repo owner
Thanks for data, the problem is fixed.

Please can you test a patched monit-5.14, which is available here?: https://mmonit.com/tmp/monit-5.14-p1.tar.gz

To compile:
```
tar -xzf monit-5.14-p1.tar.gz
cd monit-5.14-p1
./configure
make
make install # note: optional, alternatively you can run monit from the current directory
```
- 2015-09-07T18:24:32+00:00
Andy Miller
Thanks for the test version. Got it compiled. Monit -V says version 5.14.p1. Unfortunately the problem is not fixed yet. I'm rebooting - just in case and do another strace. Thanks for reading.

...strace send.
- 2015-09-07T20:04:45+00:00
Tildeslash repo owner
Thanks for update and testing. The problem is, that poll() is called with crazy timeout value:
```
poll([{fd=5, events=POLLIN}], 1, 1452059047)    #note: the last argument should be timeout in [ms], lesser or equal to 5000ms by default
```
One possible root cause was if the time jumped back (fixed in provided 5.14-p1), but if it didn't fix the problem, it seems either as some memory corruption or overflow ... both are strange. The value "1452059047" looks like an unix timestamp, so it seems that the timeout argument was somehow corrupted - will continue with the analysis tomorrow.
- 2015-09-07T20:36:41+00:00
Tildeslash repo owner
- changed status to open
- 2015-09-07T20:36:51+00:00
Andy Miller
Memory overflow? The only box witch I never changed was the WIFI router. I'll try another one and report back.
- 2015-09-08T08:25:28+00:00
Tildeslash repo owner
I meant data type overflow (not system memory) ... your machine is most probably fine, no need for more tests now, i'm trying to reproduce the problem on ARM now, as it seems it could be specific problem for that environment - i think it the corruption could be caused by the local variables declaration vs. ARM's compiler, which probably reused the space, but that is just speculation, i'll update as soon as i'll have more details.
- 2015-09-08T08:36:10+00:00
Tildeslash repo owner
I'm not able to reproduce it on ARM (Raspberry Pi). Please can you test yet the pre-compiled binary to see if it could be a compiler issue?

You can get it here: https://mmonit.com/monit/dist/binary/5.14/monit-5.14-linux-arm.tar.gz
- 2015-09-08T10:55:57+00:00
Andy Miller
Unfortunately no compiler issue. Stuck as before.
- 2015-09-08T11:45:18+00:00
Tildeslash repo owner
I have cleaned up the ping implementation, please can you test the following version?: https://mmonit.com/tmp/monit-5.14-p2.tar.gz
- 2015-09-08T16:10:14+00:00
Andy Miller
First impression. Something changed!!! Second impression. ALL GOOD. Working!!! Congratulation. Ping tests are working beautifully again. Thank you very much Sir!
- 2015-09-08T16:39:07+00:00
Tildeslash repo owner
- changed status to resolved
- 2015-09-11T09:54:35+00:00
Tildeslash repo owner
- changed status to open
it seems that the problem was not solved: new issue with similar symptoms open (issue ~~#251~~ + follow up comment in ~~#252~~) ... waiting for data from ~~#251~~ author
- 2015-09-18T16:43:06+00:00
monituser
I re-tested and found that the bug from ~~#251~~/~~#252~~ only occurs if you have the ping count set to 1.

If the ping count isn't defined (ie. default 3), or another value is set, the patch above works as expected.
- 2015-09-18T18:13:33+00:00
Tildeslash repo owner
I'm not able to reproduce the issue, using the following configuration:
```
set daemon 5
set httpd port 2812 allow 127.0.0.1

check host foobar with address 192.168.1.11
    if failed ping count 1 then alert
```
Output:
```
Ping response for 192.168.1.11 1/1 timed out -- no response within 5 seconds
'foobar' ping test failed
'foobar' icmp ping failed, skipping any port connection tests
```
Please can you send the following data?:
1. monit log
2. monit configuration
3. full output of monit in debug mode: monit -vI
4. strace output ... keep strace running until monit freezes, then keep it running yet for ca. 5 minutes and break with ^C (send the /tmp/monit.strace file to support@mmonit.com)
  
  sudo strace -p <monit's PID> -s 256 -o /tmp/monit.strace
- 2015-09-18T18:59:37+00:00
Tildeslash repo owner
Update for the ping with "count 1": the problem really is different then the issue ~~#226~~, i was testing it more - when the ping count was 1, monit wasn't happy with the ping response as the sequence id sanity check failed (due to bug), and tried to wait up to timeout seconds (by default 5s) for a better response, which won't come. Monit thus didn't hung, but was suspended by timeout for each ping test - if you have 50 hosts as mentioned, the test progress and service initialization will be very slow.

I have prepared a patched 5.14, which includes fix for issue ~~#226~~ and ~~#251~~, please @monituser can you test?: https://mmonit.com/tmp/monit-5.14-p4.tar.gz
- 2015-09-19T11:22:07+00:00
Tildeslash repo owner
- changed status to resolved
monit 5.15-beta with fixes for both problems was released (https://mmonit.com/monit/dist/monit-5.15-beta.tar.gz)
- 2015-10-12T10:51:18+00:00
Steven Christensen
Hello, I am using Monit 5.15 (not beta) on a Raspberry Pi and I appear to be having this very problem. It exhibits the same symptoms - monitoring does not go past the ping test if the host is not present.

I have linked to the relevant debug files - can someone see if this is the same problem? I don't grok strace...

monitrc - http://pastebin.com/nTcNw4CU

monit debug output (stdout) - http://pastebin.com/i8EzVRSs

monit log output - http://pastebin.com/iQxCPQvX

monit strace output - http://pastebin.com/FAby8L9Z
- 2015-11-12T20:56:43+00:00

Steven Christensen

I think I see suspicious poll timeouts in the end of the strace data

poll([{fd=6, events=POLLIN}], 1, 4000)  = 1 ([{fd=6, revents=POLLIN}])
gettimeofday({1447360948, 803242}, NULL) = 0
recvfrom(6, "...data...", 1500, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.0.1")}, [16]) = 120
poll([{fd=6, events=POLLIN}], 1, -2993190) = 1 ([{fd=6, revents=POLLIN}])
gettimeofday({1447360972, 133159}, NULL) = 0
recvfrom(6, "...data...", 1500, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.0.1")}, [16]) = 76
poll([{fd=6, events=POLLIN}], 1, -26323107 <unfinished ...>

Of the three poll activities, only the first one looks like it has a reasonable timeout (4000, which matches the 4 seconds specified in the monitrc file). The other two look suspicious / wrong...

2015-11-12T21:20:30+00:00

Tildeslash repo owner
For @chrissv: It seems that the casting from 'long long' to 'int' didn't work as expected on Raspberry Pi, you can get the fixed version here: https://mmonit.com/tmp/monit-5.15_p1.tar.gz

Compilation:
```
tar -xzf monit-5.15_p1.tar.gz
cd monit-5.15_p1
./configure
make
```
Please can you test?
- 2015-11-13T14:10:27+00:00
Steven Christensen
I downloaded monit 5.15_p1 and installed it on my Raspberry Pi, and it appears to be functioning correctly. It got past the failure to ping correctly. Thank you for your quick response - I would call this fixed.
- 2015-11-16T14:36:51+00:00
Tildeslash repo owner
- removed version
Removing version: 5.14 (automated comment)
- 2016-06-19T18:47:48+00:00
Log in to comment

Assignee: –

Type: bug

Priority: trivial

Status: resolved

Component: Monit

Version: –

Votes: 0

Watchers: 4