if failed ping4 hangs up monit
Symptom: Data collected counter freezes for every item monitored. The http daemon is still working but no new checks are performed anymore. New commands are still possible (http or command line) but not executed. Monit start then summary shows "initializing" forever. Cause: check host router with address "192.168.0.x", if failed ping4 then alert rule. The moment 192.168.0.x went offline the next status change hangs. Monit reload, service monit restart does not help. Hangs on router initializing... Workaround: change address "192.168.0.x" to hostname. It looks like monit does not like icmp destination host unreachable responses from the router. Changing to hostname gives a unknown host message witch works. Environment: Android armhf cpu, debian chrooted +ipv6. Got this problem after moving from v5.12 to v5.13.
Comments (31)
-
repo owner -
repo owner - changed status to on hold
unable to reproduce the problem, waiting for data
-
- attached debug.log
- attached www.png
- attached console.png
-
Sorry for the delay. I installed on an new box (other Droid version, different HW) and moved to v5.14. The problem is still there and can be reproduced by unplugging the ethernet cable from the target box. The next time monit does the test - bang. I did some tcpdump -ni wlan0 'dst 192.168.0.40 and icmp' but this shows only that monit got stuck after the second of the three (default) pings to the target. Like:
...726247 IP 192.168.0.30 > 192.168.0.40: ICMP echo request, id 9231, seq 0, length 72 ....730562 IP 192.168.0.30 > 192.168.0.40: ICMP echo request, id 9231, seq 1, length 72
I still have the test box handy. If you need a full package trace then please could you provide me the tcpdump parameters you want? Thank you very much for your help.
-
repo owner - changed status to open
Reopen and to be investigated/fixed for version 5.15
-
repo owner Thanks for update.
We're still unable to replicate the problem.
One machine pings an other box (link up on both sides):
Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.008851s 'raspberry' ping test succeeded [response time 0.009s] Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.002507s 'raspberry' ping test succeeded [response time 0.003s] Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.023783s 'raspberry' ping test succeeded [response time 0.024s]
cable unplugged from the target box, left monit to repeat the test three times (each time with 3 retries with 5 seconds timeout per retry):
Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds 'raspberry' ping test failed 'raspberry' icmp ping failed, skipping any port connection tests Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds 'raspberry' ping test failed 'raspberry' icmp ping failed, skipping any port connection tests Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds 'raspberry' ping test failed 'raspberry' icmp ping failed, skipping any port connection tests
connected the cable back to the target box, ping succeeded:
Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=2.353867s 'raspberry' ping test succeeded [response time 2.354s] Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.005074s 'raspberry' ping test succeeded [response time 0.005s] Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.004035s 'raspberry' ping test succeeded [response time 0.004s] Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.097907s 'raspberry' ping test succeeded [response time 0.098s]
turned off the link on the source box (the errors are immediate this time - monit doesn't wait for ping timeout and returns error immediately as there is no link):
Ping request for 192.168.1.12 1/3 failed -- No route to host Ping request for 192.168.1.12 2/3 failed -- No route to host Ping request for 192.168.1.12 3/3 failed -- No route to host 'raspberry' ping test failed 'raspberry' icmp ping failed, skipping any port connection tests
We'll continue some tests yet, if we won't be able to find the root cause, will prepare debug version for you.
-
Thank you very much for your effort. Maybe it's really my environment. The target I ping does not matter. I've tried with a Intel IA64 debian or an embeded Linux running in a router or even local services like nmbd (if failed host localhost port 137 type UDP retry 3 then restart). Everything except TCP get stuck. And the source (monit host) is always a chrooted jessie (debian) running on different hardware like HP Touchpad (Kernel v3.0.101 cyanogenmod) or HP Slate7 (Kernel v3.0.8 ICS). What's in common: jessie (armhf) and IPV6. I want to mention again monit v5.12 worked without any problem. But rather than going back I moved all ipaddr to hostnames and replaced all UDP tests by TCP if possible or commented them out. So I'm still very happy with this workaround. Monit is such a great product - I can not live without it!
-
repo owner Thanks for update. If the problem goes away when UDP is not used, it is more likely related to the UDP communication.
Please can you gather the following data?:
1.) attach strace to monit:
sudo strace -p <monit's PID> -s 256 -o /tmp/monit.strace
2.) trigger the problem
3.) when the problem started, collect the strace yet for ca. 5 minutes, then stop it (^C) and send the /tmp/monit.strace file to support@mmonit.com
-
Done. Thank you. Just to make sure. The problem goes away when ICMP AND UDP is disabled. The trace data contains the ICMP ping 192.168.0.40 sample.
-
repo owner - changed status to resolved
Fix Issue
#226: Monit hung during ping test→ <<cset c5f50df14009>>
-
repo owner Thanks for data, the problem is fixed.
Please can you test a patched monit-5.14, which is available here?: https://mmonit.com/tmp/monit-5.14-p1.tar.gz
To compile:
tar -xzf monit-5.14-p1.tar.gz cd monit-5.14-p1 ./configure make make install # note: optional, alternatively you can run monit from the current directory
-
Thanks for the test version. Got it compiled. Monit -V says version 5.14.p1. Unfortunately the problem is not fixed yet. I'm rebooting - just in case and do another strace. Thanks for reading.
...strace send.
-
repo owner Thanks for update and testing. The problem is, that poll() is called with crazy timeout value:
poll([{fd=5, events=POLLIN}], 1, 1452059047) #note: the last argument should be timeout in [ms], lesser or equal to 5000ms by default
One possible root cause was if the time jumped back (fixed in provided 5.14-p1), but if it didn't fix the problem, it seems either as some memory corruption or overflow ... both are strange. The value "1452059047" looks like an unix timestamp, so it seems that the timeout argument was somehow corrupted - will continue with the analysis tomorrow.
-
repo owner - changed status to open
-
Memory overflow? The only box witch I never changed was the WIFI router. I'll try another one and report back.
-
repo owner I meant data type overflow (not system memory) ... your machine is most probably fine, no need for more tests now, i'm trying to reproduce the problem on ARM now, as it seems it could be specific problem for that environment - i think it the corruption could be caused by the local variables declaration vs. ARM's compiler, which probably reused the space, but that is just speculation, i'll update as soon as i'll have more details.
-
repo owner I'm not able to reproduce it on ARM (Raspberry Pi). Please can you test yet the pre-compiled binary to see if it could be a compiler issue?
You can get it here: https://mmonit.com/monit/dist/binary/5.14/monit-5.14-linux-arm.tar.gz
-
Unfortunately no compiler issue. Stuck as before.
-
repo owner I have cleaned up the ping implementation, please can you test the following version?: https://mmonit.com/tmp/monit-5.14-p2.tar.gz
-
First impression. Something changed!!! Second impression. ALL GOOD. Working!!! Congratulation. Ping tests are working beautifully again. Thank you very much Sir!
-
repo owner - changed status to resolved
-
repo owner - changed status to open
-
-
repo owner I'm not able to reproduce the issue, using the following configuration:
set daemon 5 set httpd port 2812 allow 127.0.0.1 check host foobar with address 192.168.1.11 if failed ping count 1 then alert
Output:
Ping response for 192.168.1.11 1/1 timed out -- no response within 5 seconds 'foobar' ping test failed 'foobar' icmp ping failed, skipping any port connection tests
Please can you send the following data?:
- monit log
- monit configuration
- full output of monit in debug mode: monit -vI
-
strace output ... keep strace running until monit freezes, then keep it running yet for ca. 5 minutes and break with ^C (send the /tmp/monit.strace file to support@mmonit.com)
sudo strace -p <monit's PID> -s 256 -o /tmp/monit.strace
-
repo owner Update for the ping with "count 1": the problem really is different then the issue
#226, i was testing it more - when the ping count was 1, monit wasn't happy with the ping response as the sequence id sanity check failed (due to bug), and tried to wait up to timeout seconds (by default 5s) for a better response, which won't come. Monit thus didn't hung, but was suspended by timeout for each ping test - if you have 50 hosts as mentioned, the test progress and service initialization will be very slow.I have prepared a patched 5.14, which includes fix for issue
#226and#251, please @monituser can you test?: https://mmonit.com/tmp/monit-5.14-p4.tar.gz -
repo owner - changed status to resolved
monit 5.15-beta with fixes for both problems was released (https://mmonit.com/monit/dist/monit-5.15-beta.tar.gz)
-
Hello, I am using Monit 5.15 (not beta) on a Raspberry Pi and I appear to be having this very problem. It exhibits the same symptoms - monitoring does not go past the ping test if the host is not present.
I have linked to the relevant debug files - can someone see if this is the same problem? I don't grok strace...
monitrc - http://pastebin.com/nTcNw4CU
monit debug output (stdout) - http://pastebin.com/i8EzVRSs
monit log output - http://pastebin.com/iQxCPQvX
monit strace output - http://pastebin.com/FAby8L9Z
-
I think I see suspicious poll timeouts in the end of the strace data
poll([{fd=6, events=POLLIN}], 1, 4000) = 1 ([{fd=6, revents=POLLIN}]) gettimeofday({1447360948, 803242}, NULL) = 0 recvfrom(6, "...data...", 1500, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.0.1")}, [16]) = 120 poll([{fd=6, events=POLLIN}], 1, -2993190) = 1 ([{fd=6, revents=POLLIN}]) gettimeofday({1447360972, 133159}, NULL) = 0 recvfrom(6, "...data...", 1500, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.0.1")}, [16]) = 76 poll([{fd=6, events=POLLIN}], 1, -26323107 <unfinished ...>
Of the three
poll
activities, only the first one looks like it has a reasonable timeout (4000, which matches the 4 seconds specified in the monitrc file). The other two look suspicious / wrong... -
repo owner For @chrissv: It seems that the casting from 'long long' to 'int' didn't work as expected on Raspberry Pi, you can get the fixed version here: https://mmonit.com/tmp/monit-5.15_p1.tar.gz
Compilation:
tar -xzf monit-5.15_p1.tar.gz cd monit-5.15_p1 ./configure make
Please can you test?
-
I downloaded monit 5.15_p1 and installed it on my Raspberry Pi, and it appears to be functioning correctly. It got past the failure to ping correctly. Thank you for your quick response - I would call this fixed.
-
repo owner - removed version
Removing version: 5.14 (automated comment)
- Log in to comment
We're unable to replicate the issue, using the following configuration:
Debug output:
... and the monitoring continues normally after the error is reported
Please can you run monit in verbose mode (using "-vI" options) and provide output? Please also collect a network trace of ICMP messages between monit host and the target machine.