Monit (5.27.1) fails to test the connection to the clamd socket (possibly because the process uptime missing) inside Debian10 LXD container

Issue #963 resolved
Andrzej Perczyk created an issue

I am running monit (5.27.1) on Debian 10 in unprivileged LXD container.

After several days of operation, the clamav-daemon socket communication test reported an error:

failed protocol test [CLAMAV] at /var/run/clamav/clamd.ctl -- CLAMAV: PONG read error -- Resource temporarily unavailable

In this situation monit restarted clamav-deamon, but the still did not register the restoration of communication, even though the socket responded correctly.

Monit status shows (today’s examples):

Process 'clamav-daemon'
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          2441018
  parent pid                   1
  uid                          106
  effective uid                106
  gid                          109
  uptime                       -
  threads                      2
  children                     0
  cpu                          0.0%
  cpu total                    0.0%
  memory                       19.0% [1.1 GB]
  memory total                 19.0% [1.1 GB]
  security attribute           unconfined
  filedescriptors              9 [0.9% of 1024 limit]
  total filedescriptors        9
  read bytes                   0.6 B/s [1.2 GB total]
  disk read bytes              0 B/s [127.9 MB total]
  disk read operations         0.0 reads/s [79728 reads total]
  write bytes                  2.7 B/s [1.1 MB total]
  disk write bytes             34.1 B/s [3.9 MB total]
  disk write operations        0.0 writes/s [2731 writes total]
  unix socket response time    -
  data collected               Wed, 24 Mar 2021 17:01:18

notice the lack of value for the uptime of the process and unix socket response time.

When running monit with “ -vvI” parameters, there are informations about paused connection test because of waiting for process to start:

'clamav-daemon' process is running with pid 2441018
'clamav-daemon' zombie check succeeded
'clamav-daemon' connection test paused for 30 s while the process is starting

This information is repeated in every 120 seconds (as set for daemon timer).

It looks as monit could not determine how much time has passed since the service was started, and was constantly postponing the socket connection test.

I cat get system uptime from /proc/uptime

# cat /proc/uptime 
12279771.42 12157565.71

Also process stats are showed correctly

# cat /proc/2441018/stat
2441018 (clamd) S 1 2441018 2441018 0 -1 4194368 904957 0 10 0 2672 155 0 0 20 0 3 0 1254872419 1453367296 277755 18446744073709551615 94010525196288 94010525304089 140724476752176 0 0 0 2146542132 0 22531 0 0 0 17 1 0 0 0 0 0 94010525374832 94010525401136 94010546954240 140724476753893 140724476753963 140724476753963 140724476755944 0

Unfortunately, such a situation repeats itself from time to time, and each time the information about restoring the connection to the socket appears 5 days after detecting the problem.

Other processess configured in monit, are showing proper uptime values.

check configration:

check process clamav-daemon with pidfile /var/run/clamav/clamd.pid
  start program = "/etc/init.d/clamav-daemon start"
  stop program = "/etc/init.d/clamav-daemon stop"
  if failed unixsocket /var/run/clamav/clamd.ctl protocol clamav then restart
  if 3 restarts for 3 cycles then alert
    alert admin@example.com

Comments (4)

  1. Tildeslash repo owner

    Hello Andrzej,

    please can you attach a monit log? (you can enable it via the “set logfile” statement if not present already.

    Checking the source code, it is possible it could be LXD bug, it seems it may report a future boot time compared to the current time (hence the uptime is still in the initializing state).

  2. Andrzej Perczyk reporter

    Hello,

    I have attached monit.log.

    I have noticed, that there is a difference in uptime showed by monit status and uptime comands:

    System 'smtp-server'
      status                       OK
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      load average                 [0.96] [1.62] [1.95]
      cpu                          0.1%usr 0.0%sys 0.0%nice 0.0%iowait 0.0%hardirq 0.0%softirq 0.0%steal 0.0%guest 0.0%guestnice 
      memory usage                 1.7 GB [30.9%]
      swap usage                   0 B [0.0%]
      uptime                       148d 14h 14m
      boot time                    Wed, 28 Oct 2020 17:33:16
      filedescriptors              81088 [0.0% of 9223372036854775807 limit]
      data collected               Fri, 26 Mar 2021 07:47:52
    
      root@smtp-server:~# uptime
     07:47:56 up 143 days, 17:47,  1 user,  load average: 0,89, 1,60, 1,94
    

    There is a 5 days difference, so it can be the reason that after socket connection test failure, informations about connection test success is almost always after 5 days.

    Additional informations:

    • We do not use snap for LXD.
      We build LXD package from sources.
      Actually we use LXD 4.0.4-3
    • We also build lxcfs package from oryginal sources (without any changes).
      LXCFS version that we use actually: 4.0.6-2

  3. Andrzej Perczyk reporter
      <div class="preview-container wiki-content"><!-- loaded via ajax --></div>
      <div class="mask"></div>
    </div>
    

    </div> </form>

  4. Tildeslash repo owner

    Thanks for data, it is really related to the boot time, the problem was fixed in monit 5.27.2 already, snip from changelog:

    Fixed: LXC container: Monit may ignore the "start delay" option of the "set daemon" statement when the container was rebooted, while the host was not rebooted (the LXC cotainer's boot time is not virtualized - it is inherited from the host).

  5. Log in to comment