MySQL replication not working - but in SHOW SLAVE STATUS everything is OK

Written by Claudio Kuenzler - 0 comments

Published on January 16th 2015 - Listed in MySQL Database Solaris Nagios Monitoring

A strange problem has hit me recently where a MySQL replication on Solaris zones failed and the slave did not get any new log files from the master anymore.

The slave is of course being monitored (with the Nagios plugin check_mysql_slavestatus.sh) but everything was always OK... until it suddenly became CRITICAL because of the following error:

Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'

What happened? It seems that for a couple of days, the replication silently failed and the master and slave didn't communicate correctly with each other anymore. While the master continued to update its binary log files, the slave did not retrieve the changed binary logs from the master. However there was no error indicated in the SHOW SLAVE STATUS output:

mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.17.20.100
                  Master_User: replica
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.000408
          Read_Master_Log_Pos: 24547311
               Relay_Log_File: relay-log.000330
                Relay_Log_Pos: 4
        Relay_Master_Log_File: bin.000408
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]
          Exec_Master_Log_Pos: 24547311
              Relay_Log_Space: 120
[...]
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
[...]
1 row in set (0.00 sec)

check_mysql_slavestatus reads all these values, and because everything seems to be OK according to the 'show slave status' output, no issues were found.

But the non-working synchronisation could easily be checked, by doing a simple write operation on the master and check the result on the slave. Here I create a new database on the master and then check for it appearing on the slave:

#On MASTER:
mysql> create database claudiotest;
Query OK, 1 row affected (0.02 sec)

mysql> show master status;
+------------+----------+--------------+------------------+-------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------+----------+--------------+------------------+-------------------+
| bin.000408 | 47461189 | | | |
+------------+----------+--------------+------------------+-------------------+

#On SLAVE nothing was done
[root@slave ~]# ll /var/lib/mysql/ | grep claudio
[root@slave ~]# mysql -e "show databases" | grep claudio

#... and nothing moved either!
mysql> show slave status\G;
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 172.17.20.100
                  Master_User: replica
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: bin.000408
          Read_Master_Log_Pos: 24547311
               Relay_Log_File: relay-log.000331
                Relay_Log_Pos: 4
        Relay_Master_Log_File: bin.000408
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
[...]

So although everything seems to be in order according to the slave status output, nothing was actually done. The slave didn't even get the relevant information from the master, that the master log file position has changed.
This particular MySQL (5.6) replication runs on two virtual Solaris servers (zones), each with two virtual nics. The replication happens over the secondary interface (backend). Now I strongly suspect a networking issue/bug of some kind of the operating system, although telnet and ping show a correct communication between master and slave. A restart of the MySQL server on the slave didn't help either.

I finally got the replication working again, by using the primary network interface of the zone.
To catch such replication/connectivity issues, I have modified check_mysql_slavestatus with a new check type. The change will be published soon.

Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

Blog Tags:

AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder