postgresql monitoring gives spurious errors

Issue #574 duplicate
Former user created an issue

Monitoring postgresql database with monit gives around one error per day.Database is not restarted, no entries in database event log or anything. Except for monit, no other program has any problems connecting to the database (one job is started every five minutes, others connect once approximately every two days)

Dummy database for root was created, but the error Messages came before that as well.

Configuration entry in monitrc-file (mostly default settings, eg. checks every 30 seconds):

check process postgresql with pidfile /var/postgres/data/postmaster.pid
    if does not exist then alert
    if failed unixsocket /tmp/.s.PGSQL.5432 protocol pgsql then alert

Error message is

Connection failed Service postgresql

Date:        Fri, 10 Mar 2017 22:47:24
Action:      alert
Host:        ********
Description: failed protocol test [PGSQL] at /tmp/.s.PGSQL.5432 -- PGSQL: error receiving data -- Resource temporarily unavailable

directly followed (next run) by

Connection succeeded Service postgresql

Date:        Fri, 10 Mar 2017 22:47:55
Action:      alert
Host:        ********
Description: connection succeeded to /tmp/.s.PGSQL.5432

System: FreeBSD 10.3-RELEASE-p11 (generic) Database: postgresql95-server-9.5.5_1 Monit: monit-5.20.0 (list only offered up to 5.19)

System load is around 1.0 (but has 8 CPUs (16 with HT)) DB-Backup is between 1:00 and 2:00 in the morning (usually monitoring works fine during that time and gives no errors)

I could of course increase the number of checks to avoid this problem, but actually I wonder why the connection test at all fails. Note: no messages about failed connections or anything else within the database logs.

Comments (6)

  1. Tildeslash repo owner

    The problem "Resource temporarily unavailable" is reported if the read timed out ... the message is confusing, we have fixed the problem in the upcoming Monit release (5.22.0).

    If the error is spurious and you want to ignore it, you can modify the test to wait for more errors before sending alert, for example to send alert only if the error persists for 3 cycles:

    if failed unixsocket /tmp/.s.PGSQL.5432 protocol pgsql for 3 cycles then alert
    
  2. Holger Kipp

    I'm not sure if read timeout for a service is not an issue. Is it possible to - get more Information (eg. the timeout-value, or debugging Information about the connection attempt to see where the problem lies) - set the monit timeout value to some defaults per test (there are no timeout issues or connection problems with other programs, so...)

    Imho the problem is not that the problem text is not correct, the problem is that monit is reporting a problem if there is no problem. So something seems to be flawed elsewhere, and I'd like to understand what is really going on here.

    Help appreciated.

  3. Tildeslash repo owner

    The default timeout is 5 seconds, you can override it either on per-connection test basis:

    if failed unixsocket /tmp/.s.PGSQL.5432 protocol pgsql with timeout 10 seconds for 3 cycles then alert
    

    or using "set limits" globally (https://mmonit.com/monit/documentation/monit.html#LIMITS).

    You can run monit in debug mode using "-v" option.

    It could be useful to get network trace and analyse the timeframe where the alert was triggered (it'll be necessary to switch from unix socket to tcp).

  4. Roland Pihlakas

    Monit version 5.26.0 and Postgres Ubuntu 13.1-1.pgdg20.04+1 seems still to have this problem.
    Monit version 5.16 and Postgres 9.6.18 does not have the problem.
    NB! The problem manifests only with unix socket, not with tcp port.
    ”with timeout 10 seconds” does not help.
    The message is: “'postgres' failed protocol test [PGSQL] at /var/run/postgresql/.s.PGSQL.5432 -- PGSQL: connection terminator write error -- Broken pipe”

  5. Log in to comment