MySQL replication not working - but in SHOW SLAVE STATUS everything is OK

Written by - 0 comments

Published on - Listed in MySQL Database Solaris Nagios Monitoring


A strange problem has hit me recently where a MySQL replication on Solaris zones failed and the slave did not get any new log files from the master anymore.

The slave is of course being monitored (with the Nagios plugin check_mysql_slavestatus.sh) but everything was always OK... until it suddenly became CRITICAL because of the following error:

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

What happened? It seems that for a couple of days, the replication silently failed and the master and slave didn't communicate correctly with each other anymore. While the master continued to update its binary log files, the slave did not retrieve the changed binary logs from the master. However there was no error indicated in the SHOW SLAVE STATUS output:

mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.17.20.100
                  Master_User: replica
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.000408
          Read_Master_Log_Pos: 24547311
               Relay_Log_File: relay-log.000330
                Relay_Log_Pos: 4
        Relay_Master_Log_File: bin.000408
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]
          Exec_Master_Log_Pos: 24547311
              Relay_Log_Space: 120
[...]
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
[...]
1 row in set (0.00 sec)

check_mysql_slavestatus reads all these values, and because everything seems to be OK according to the 'show slave status' output, no issues were found.

But the non-working synchronisation could easily be checked, by doing a simple write operation on the master and check the result on the slave. Here I create a new database on the master and then check for it appearing on the slave:

#On MASTER:
mysql> create database claudiotest;
Query OK, 1 row affected (0.02 sec)

mysql> show master status;
+------------+----------+--------------+------------------+-------------------+
| File       | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------+----------+--------------+------------------+-------------------+
| bin.000408 | 47461189 |              |                  |                   |
+------------+----------+--------------+------------------+-------------------+

#On SLAVE nothing was done
[root@slave ~]# ll /var/lib/mysql/ | grep claudio
[root@slave ~]# mysql -e "show databases" | grep claudio

#... and nothing moved either!
mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.17.20.100
                  Master_User: replica
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.000408
          Read_Master_Log_Pos: 24547311
               Relay_Log_File: relay-log.000331
                Relay_Log_Pos: 4
        Relay_Master_Log_File: bin.000408
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]

So although everything seems to be in order according to the slave status output, nothing was actually done. The slave didn't even get the relevant information from the master, that the master log file position has changed.
This particular MySQL (5.6) replication runs on two virtual Solaris servers (zones), each with two virtual nics. The replication happens over the secondary interface (backend). Now I strongly suspect a networking issue/bug of some kind of the operating system, although telnet and ping show a correct communication between master and slave. A restart of the MySQL server on the slave didn't help either.

I finally got the replication working again, by using the primary network interface of the zone.
To catch such replication/connectivity issues, I have modified check_mysql_slavestatus with a new check type. The change will be published soon.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder