There are multiple ways of monitoring a working master-slave-replication on PostgreSQL servers.
First of all there is of course the replication status which can be read directly from the master PostgreSQL server:
postgres@dbmaster:~$ psql -x -c "select * from pg_stat_replication;"
-[ RECORD 1 ]----+------------------------------
pid | 13014
usesysid | 16387
usename | replica
application_name | dbslave
client_addr | 10.10.10.11
client_hostname |
client_port | 48596
backend_start | 2017-07-26 13:07:00.617621+00
backend_xmin |
state | streaming
sent_location | 0/6000290
write_location | 0/6000290
flush_location | 0/6000290
replay_location | 0/6000290
sync_priority | 1
sync_state | sync
This information can only be read on the master. If you try that on the slave (hot_standby = on), you don't get to see anything:
postgres@dbslave:~$ psql -x -c "select * from pg_stat_replication;"
(0 rows)
Obviously most important information here is the sync_state:
postgres@dbmaster:~$ psql -x -c "select sync_state from pg_stat_replication;"
-[ RECORD 1 ]----
sync_state | sync
Possible values of sync_state are:
Other important values are the different "locations":
sent_location | 0/6000290
write_location | 0/6000290
flush_location | 0/6000290
replay_location | 0/6000290
From the documentation:
This basically shows where the slave server is. If all values are the same it is caught up 100%.
Note: There are different monitoring plugins with the name "check_postgres". Make sure you use check_postgresl.pl!
The monitoring plugin check_postgres also features a replication check (hot_standby_delay). The trick is to correctly understand this check. Using the hot_standby_delay check, the plugin connects to both the master and slave and compares the replay delay and receive delay to the given warning and critical thresholds. In order to connect to both the master and the slave, the pg_hba.conf must be adapted accordingly.
On the master (IP 10.10.10.10) I added the following lines:
# Monitoring
host all monitoring 127.0.0.1/32 md5
host all monitoring 10.10.10.11/32 md5
On the slave (IP 10.10.10.11) I added the following lines:
# Monitoring
host all monitoring 127.0.0.1/32 md5
The plugin will be executed on the slave server ergo there the monitoring line for localhost is enough.
To not use the db password with the plugin (the password would show up in cleartext in the process list), I created a .pgpass file for the nagios user (under which this plugin will run). This file contains two entries; first for the localhost connection and secondly for the remote connection to the master server:
nagios@dbslave:~$ whoami
nagios
nagios@dbslave:~$ ls -la .pgpass
-rw------- 1 nagios nagios 94 Jul 26 15:25 .pgpass
nagios@dbslave:~$ cat .pgpass
localhost:5432:*:monitoring:mysupersecretpassword
dbmaster:5432:*:monitoring:mysupersecretpassword
Make sure the .pgpass file has correct permissions (chmod 0600), otherwise it won't be used for psql commands!
Now the plugin can be executed with the hot_standby_delay check:
nagios@dbslave:~$ /usr/lib/nagios/plugins/check_postgres.pl -H localhost,dbmaster -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
POSTGRES_HOT_STANDBY_DELAY OK: DB "mydb" (host:localhost) 0 and 432 seconds | time=0.05s replay_delay=0;60;600 receive-delay=0;60;600 time_delay=432;
Note the -H parameter uses two hostnames (comma separated). The plugin will connect to both localhost and the dbmaster host using the SQL user "monitoring" (password will automatically be read from .pgpass file). I set a delay warning to 60 seconds, a critical delay to 600 seconds (10 minutes).
ck from Switzerland wrote on Aug 28th, 2017:
Majales, I wrote the answer in the Github issue. For sake of completeness here again:
I went through the source code and time_delay is $time_delta:
if ($version >= 9.1) {
$db->{perf} .= sprintf ' %s=%s;%s;%s',
perfname(msg('hs-time-delay')), $time_delta, $wtime, $ctime;
}
postgres=# SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay , COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;
receive | replay | seconds
------------+------------+---------
2/266055D8 | 2/266055D8 | 151763
(1 row)
ck from Switzerland wrote on Aug 19th, 2017:
Hi Majales. This is indeed a very good question and I don't know either. I found this issue on the check_postgres Github repository: https://github.com/bucardo/check_postgres/issues/119. The OP asks the same question as you. This Github issue is probably the best way to get a correct answer.
Update: Oh!! I just saw that YOU are the OP of that Github issue. Loop :D
Majales from wrote on Aug 18th, 2017:
What actualy means that time_delay=432? I can probably understand replay_delay or receive-delay. time should be check execution time, but what is time_delay?
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder