How to monitor a PostgreSQL replication

Written by - 3 comments

Published on - last updated on June 15th 2021 - Listed in DB Postgres Database Monitoring Icinga Nagios


There are multiple ways of monitoring a working master-slave-replication on PostgreSQL servers.

Using PSQL

First of all there is of course the replication status which can be read directly from the master PostgreSQL server:

postgres@dbmaster:~$ psql -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+------------------------------
pid              | 13014
usesysid         | 16387
usename          | replica
application_name | dbslave
client_addr      | 10.10.10.11
client_hostname  |
client_port      | 48596
backend_start    | 2017-07-26 13:07:00.617621+00
backend_xmin     |
state            | streaming
sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290
sync_priority    | 1
sync_state       | sync

This information can only be read on the master. If you try that on the slave (hot_standby = on), you don't get to see anything:

postgres@dbslave:~$ psql -x -c "select * from pg_stat_replication;"
(0 rows)

Obviously most important information here is the sync_state:

postgres@dbmaster:~$ psql -x -c "select sync_state from pg_stat_replication;"
-[ RECORD 1 ]----
sync_state | sync

Possible values of sync_state are:

  • async: This standby server is asynchronous -> CRITICAL!
  • potential: This standby server is asynchronous, but can potentially become synchronous if one of current synchronous ones fails -> WARNING
  • sync: This standby server is synchronous -> OK
  • quorum: This standby server is considered as a candidate for quorum standbys -> OK

Other important values are the different "locations":

sent_location    | 0/6000290
write_location   | 0/6000290
flush_location   | 0/6000290
replay_location  | 0/6000290

From the documentation:

  • sent_location: Last write-ahead log location sent on this connection
  • write_location: Last write-ahead log location written to disk by this standby server
  • flush_location: Last write-ahead log location flushed to disk by this standby server
  • replay_location: Last write-ahead log location replayed into the database on this standby server

This basically shows where the slave server is. If all values are the same it is caught up 100%.

Using monitoring plugin check_postgres

Note: There are different monitoring plugins with the name "check_postgres". Make sure you use check_postgresl.pl!

The monitoring plugin check_postgres also features a replication check (hot_standby_delay). The trick is to correctly understand this check. Using the hot_standby_delay check, the plugin connects to both the master and slave and compares the replay delay and receive delay to the given warning and critical thresholds. In order to connect to both the master and the slave, the pg_hba.conf must be adapted accordingly.

On the master (IP 10.10.10.10) I added the following lines:

# Monitoring
host    all             monitoring      127.0.0.1/32          md5
host    all             monitoring      10.10.10.11/32        md5

On the slave (IP 10.10.10.11) I added the following lines:

# Monitoring
host    all             monitoring      127.0.0.1/32          md5

The plugin will be executed on the slave server ergo there the monitoring line for localhost is enough.

To not use the db password with the plugin (the password would show up in cleartext in the process list), I created a .pgpass file for the nagios user (under which this plugin will run). This file contains two entries; first for the localhost connection and secondly for the remote connection to the master server:

nagios@dbslave:~$ whoami
nagios

nagios@dbslave:~$ ls -la .pgpass
-rw------- 1 nagios nagios 94 Jul 26 15:25 .pgpass

nagios@dbslave:~$ cat .pgpass
localhost:5432:*:monitoring:mysupersecretpassword
dbmaster:5432:*:monitoring:mysupersecretpassword

Make sure the .pgpass file has correct permissions (chmod 0600), otherwise it won't be used for psql commands!

Now the plugin can be executed with the hot_standby_delay check:

nagios@dbslave:~$ /usr/lib/nagios/plugins/check_postgres.pl -H localhost,dbmaster -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
POSTGRES_HOT_STANDBY_DELAY OK: DB "mydb" (host:localhost) 0 and 432 seconds | time=0.05s replay_delay=0;60;600  receive-delay=0;60;600 time_delay=432;

Note the -H parameter uses two hostnames (comma separated). The plugin will connect to both localhost and the dbmaster host using the SQL user "monitoring" (password will automatically be read from .pgpass file). I set a delay warning to 60 seconds, a critical delay to 600 seconds (10 minutes).


Add a comment

Show form to leave a comment

Comments (newest first)

ck from Switzerland wrote on Aug 28th, 2017:

Majales, I wrote the answer in the Github issue. For sake of completeness here again:

I went through the source code and time_delay is $time_delta:

if ($version >= 9.1) {
$db->{perf} .= sprintf ' %s=%s;%s;%s',
perfname(msg('hs-time-delay')), $time_delta, $wtime, $ctime;
}



time_delta is read from xlog positions (line 5110 and below). So if you run the SQL command you see these "seconds":

postgres=# SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay , COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;
receive | replay | seconds
------------+------------+---------
2/266055D8 | 2/266055D8 | 151763
(1 row)



So seconds is defined by the function pg_last_xact_replay_timestamp(). The value represents the "time passed since last sync operation" (it doesn't mean that the servers are out of sync then).


ck from Switzerland wrote on Aug 19th, 2017:

Hi Majales. This is indeed a very good question and I don't know either. I found this issue on the check_postgres Github repository: https://github.com/bucardo/check_postgres/issues/119. The OP asks the same question as you. This Github issue is probably the best way to get a correct answer.
Update: Oh!! I just saw that YOU are the OP of that Github issue. Loop :D


Majales from wrote on Aug 18th, 2017:

What actualy means that time_delay=432? I can probably understand replay_delay or receive-delay. time should be check execution time, but what is time_delay?


RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder