Monit does not close pipes between parent and children

Issue #261 resolved
Anonymous created an issue

Dear Monit team,

After running Monit for half an hour doing some command checks I have realized that Monit has "forgot" to close the pipes it creates for their child processes. A simple lsof proved my point:

$ lsof | grep monit | grep pipe
monit     14615    root    5r     FIFO                0,8      0t0    4207282 pipe
monit     14615    root    7r     FIFO                0,8      0t0    4208207 pipe
monit     14615    root    8w     FIFO                0,8      0t0    4209883 pipe
monit     14615    root    9w     FIFO                0,8      0t0    4207283 pipe
monit     14615    root   10r     FIFO                0,8      0t0    4211080 pipe
monit     14615    root   11w     FIFO                0,8      0t0    4207284 pipe
monit     14615    root   12w     FIFO                0,8      0t0    4212244 pipe
monit     14615    root   13r     FIFO                0,8      0t0    4213451 pipe
monit     14615    root   14r     FIFO                0,8      0t0    4214344 pipe
monit     14615    root   15w     FIFO                0,8      0t0    4215194 pipe
monit     14615    root   16r     FIFO                0,8      0t0    4216460 pipe
monit     14615    root   17r     FIFO                0,8      0t0    4218540 pipe
monit     14615    root   18w     FIFO                0,8      0t0    4208208 pipe
monit     14615    root   20w     FIFO                0,8      0t0    4208209 pipe
monit     14615    root   21w     FIFO                0,8      0t0    4209884 pipe
monit     14615    root   22r     FIFO                0,8      0t0    4219104 pipe
monit     14615    root   23w     FIFO                0,8      0t0    4209885 pipe
monit     14615    root   24w     FIFO                0,8      0t0    4211081 pipe
monit     14615    root   25r     FIFO                0,8      0t0    4219105 pipe
monit     14615    root   26w     FIFO                0,8      0t0    4211082 pipe
monit     14615    root   27w     FIFO                0,8      0t0    4212245 pipe
monit     14615    root   28r     FIFO                0,8      0t0    4218690 pipe
monit     14615    root   29w     FIFO                0,8      0t0    4212246 pipe
monit     14615    root   30w     FIFO                0,8      0t0    4213452 pipe
monit     14615    root   31r     FIFO                0,8      0t0    4218691 pipe
monit     14615    root   32w     FIFO                0,8      0t0    4213453 pipe
monit     14615    root   33w     FIFO                0,8      0t0    4214345 pipe
monit     14615    root   35w     FIFO                0,8      0t0    4214346 pipe
monit     14615    root   36w     FIFO                0,8      0t0    4215195 pipe
monit     14615    root   37r     FIFO                0,8      0t0    4219106 pipe
monit     14615    root   38w     FIFO                0,8      0t0    4215196 pipe
monit     14615    root   39w     FIFO                0,8      0t0    4216461 pipe
monit     14615    root   40r     FIFO                0,8      0t0    4218692 pipe
monit     14615    root   41w     FIFO                0,8      0t0    4216462 pipe
monit     14615    root   42w     FIFO                0,8      0t0    4218541 pipe
monit     14615    root   44w     FIFO                0,8      0t0    4218542 pipe

Actually, it was not the first time. We were running the Monit with the same configuration for a few days, and realized that it has exhausted the limit for the number of open file descriptors, so the kernel stopped it from opening new ones...

The number of pipes was always a multiple of 3, so it seems that these target the child's stdin, stdout, stderr file descriptors. Would you mind taking a look at it and make sure that you close the pipes on the parent side? The funny thing is that it is not even a consequence of the zombie process issue. (When I ran lsof, monit only had 2 zombie children despite the 39 file descriptors opened for pipes.)

The issue is reproducible, it occurs every time we restart monit and wait a bit for it to create these immense number of open file descriptors. Just to give you some context, we are running two types of command checks.

  1. Some shell script that gathers some custom, (usually boolean) metric about another process. We use check program name with path "/path/to/our/metric_script.sh TARGET_PROCESS_NAME ..." in the monitrc.

  2. Shell scripts that wrap Java applications. The wrappers invoke the java binary like "exec -a SOME_NAME java -jar ...". Let's say we have a wrapper OUR_JAVA_APP.jvm, then we would write:

    check process OUR_JAVA_APP.jvm matching "OUR_JAVA_APP.jvm"
    start program = "/bin/bash -c '/opt/OUR_JAVA_APP.jvm &'"
    ...

Also, I suspect it only affects "check command" scenarios, but I'm not entirely sure. I tried with Monit version 5.12.2 and the latest 5.15 beta.

Could you please fix this? We are trying to use Monit in production on several machines and this bug basically stops us from using it in the long run.

Best regards,
Gyuri

Comments (10)

  1. Gyorgy Demarcsek

    I have left monit running for the night. Ended up having 1221 file descriptors open. However, the poll cycle in my configuration is extremely small: 3 seconds, so that suggests that Monit probably did not leave 3 file descriptors open at every single time it has spawned a child process. Maybe it only occurs under some particular circumstances, but I will try to figure that out as soon as I got some time. I can give you some more input privately, if you need it to carry on with this issue. Thank you very much guys.

  2. Tildeslash repo owner

    If you have more information that can help us reproduce and understand the issue that would be appreciated. Please post here, if possible, so we can keep the information in one place.

  3. Tildeslash repo owner

    yet one note ... please can you provide more details about the system where monit is running? Is it real machine or some virtual one or container? (docker, etc.?)

    (you can send the information to support@mmonit.com if you don't want to disclose it here)

  4. Lonnie Abelbeck

    Probably no help, but we (AstLinux) have noticed similar failures for 512 MB RAM, 500 Mhz AMD Geode single-core-CPU boards with Linux 3.2. It can often take a few days or a week before the kernel starts killing processes and locks up.

    We have spent a lot of time testing this, quite confident it was Monit related, but we gave-up trying to find the exact cause.

    Interestingly 2 GB RAM, 1.8 GHz Intel Atom multi-core-CPU boards do not seem to suffer from the same effect.

    Might not be related in any way, but thought I would jot a note.

  5. Lonnie Abelbeck

    Addendum, we performed a quick test with the "512 MB RAM, 500 Mhz AMD Geode" board, Monit started with 6 pipes and after one day still has 6 pipes, so it seems this is unrelated to the the topic at hand.

  6. Log in to comment