- edited description
Monit does not close pipes between parent and children
Dear Monit team,
After running Monit for half an hour doing some command checks I have realized that Monit has "forgot" to close the pipes it creates for their child processes. A simple lsof proved my point:
$ lsof | grep monit | grep pipe monit 14615 root 5r FIFO 0,8 0t0 4207282 pipe monit 14615 root 7r FIFO 0,8 0t0 4208207 pipe monit 14615 root 8w FIFO 0,8 0t0 4209883 pipe monit 14615 root 9w FIFO 0,8 0t0 4207283 pipe monit 14615 root 10r FIFO 0,8 0t0 4211080 pipe monit 14615 root 11w FIFO 0,8 0t0 4207284 pipe monit 14615 root 12w FIFO 0,8 0t0 4212244 pipe monit 14615 root 13r FIFO 0,8 0t0 4213451 pipe monit 14615 root 14r FIFO 0,8 0t0 4214344 pipe monit 14615 root 15w FIFO 0,8 0t0 4215194 pipe monit 14615 root 16r FIFO 0,8 0t0 4216460 pipe monit 14615 root 17r FIFO 0,8 0t0 4218540 pipe monit 14615 root 18w FIFO 0,8 0t0 4208208 pipe monit 14615 root 20w FIFO 0,8 0t0 4208209 pipe monit 14615 root 21w FIFO 0,8 0t0 4209884 pipe monit 14615 root 22r FIFO 0,8 0t0 4219104 pipe monit 14615 root 23w FIFO 0,8 0t0 4209885 pipe monit 14615 root 24w FIFO 0,8 0t0 4211081 pipe monit 14615 root 25r FIFO 0,8 0t0 4219105 pipe monit 14615 root 26w FIFO 0,8 0t0 4211082 pipe monit 14615 root 27w FIFO 0,8 0t0 4212245 pipe monit 14615 root 28r FIFO 0,8 0t0 4218690 pipe monit 14615 root 29w FIFO 0,8 0t0 4212246 pipe monit 14615 root 30w FIFO 0,8 0t0 4213452 pipe monit 14615 root 31r FIFO 0,8 0t0 4218691 pipe monit 14615 root 32w FIFO 0,8 0t0 4213453 pipe monit 14615 root 33w FIFO 0,8 0t0 4214345 pipe monit 14615 root 35w FIFO 0,8 0t0 4214346 pipe monit 14615 root 36w FIFO 0,8 0t0 4215195 pipe monit 14615 root 37r FIFO 0,8 0t0 4219106 pipe monit 14615 root 38w FIFO 0,8 0t0 4215196 pipe monit 14615 root 39w FIFO 0,8 0t0 4216461 pipe monit 14615 root 40r FIFO 0,8 0t0 4218692 pipe monit 14615 root 41w FIFO 0,8 0t0 4216462 pipe monit 14615 root 42w FIFO 0,8 0t0 4218541 pipe monit 14615 root 44w FIFO 0,8 0t0 4218542 pipe
Actually, it was not the first time. We were running the Monit with the same configuration for a few days, and realized that it has exhausted the limit for the number of open file descriptors, so the kernel stopped it from opening new ones...
The number of pipes was always a multiple of 3, so it seems that these target the child's stdin, stdout, stderr file descriptors. Would you mind taking a look at it and make sure that you close the pipes on the parent side? The funny thing is that it is not even a consequence of the zombie process issue. (When I ran lsof, monit only had 2 zombie children despite the 39 file descriptors opened for pipes.)
The issue is reproducible, it occurs every time we restart monit and wait a bit for it to create these immense number of open file descriptors. Just to give you some context, we are running two types of command checks.
-
Some shell script that gathers some custom, (usually boolean) metric about another process. We use check program name with path "/path/to/our/metric_script.sh TARGET_PROCESS_NAME ..." in the monitrc.
-
Shell scripts that wrap Java applications. The wrappers invoke the java binary like "exec -a SOME_NAME java -jar ...". Let's say we have a wrapper OUR_JAVA_APP.jvm, then we would write:
check process OUR_JAVA_APP.jvm matching "OUR_JAVA_APP.jvm" start program = "/bin/bash -c '/opt/OUR_JAVA_APP.jvm &'" ...
Also, I suspect it only affects "check command" scenarios, but I'm not entirely sure. I tried with Monit version 5.12.2 and the latest 5.15 beta.
Could you please fix this? We are trying to use Monit in production on several machines and this bug basically stops us from using it in the long run.
Best regards, Gyuri
Comments (10)
-
repo owner -
I have left monit running for the night. Ended up having 1221 file descriptors open. However, the poll cycle in my configuration is extremely small: 3 seconds, so that suggests that Monit probably did not leave 3 file descriptors open at every single time it has spawned a child process. Maybe it only occurs under some particular circumstances, but I will try to figure that out as soon as I got some time. I can give you some more input privately, if you need it to carry on with this issue. Thank you very much guys.
-
repo owner If you have more information that can help us reproduce and understand the issue that would be appreciated. Please post here, if possible, so we can keep the information in one place.
-
repo owner Hello Gyuri, please can you send your monit log to support@mmonit.com?
Best regards, The Monit team
-
repo owner yet one note ... please can you provide more details about the system where monit is running? Is it real machine or some virtual one or container? (docker, etc.?)
(you can send the information to support@mmonit.com if you don't want to disclose it here)
-
Probably no help, but we (AstLinux) have noticed similar failures for 512 MB RAM, 500 Mhz AMD Geode single-core-CPU boards with Linux 3.2. It can often take a few days or a week before the kernel starts killing processes and locks up.
We have spent a lot of time testing this, quite confident it was Monit related, but we gave-up trying to find the exact cause.
Interestingly 2 GB RAM, 1.8 GHz Intel Atom multi-core-CPU boards do not seem to suffer from the same effect.
Might not be related in any way, but thought I would jot a note.
-
Addendum, we performed a quick test with the "512 MB RAM, 500 Mhz AMD Geode" board, Monit started with 6 pipes and after one day still has 6 pipes, so it seems this is unrelated to the the topic at hand.
-
repo owner - changed version to 5.15
-
repo owner - changed status to resolved
Fix Issue
#261: Monit leaked file descriptors of program execution failed (fixed in libmonit: https://bitbucket.org/tildeslash/libmonit/commits/271d3e4bc705/)→ <<cset a2aa9fb13077>>
-
repo owner - removed version
Removing version: 5.15 (automated comment)
- Log in to comment