Comparing Galera cluster wsrep sst methods rsync vs. mariabackup

Written by - 0 comments

Published on - last updated on December 22nd 2020 - Listed in Database MySQL MariaDB Galera


In the first years of Galera, there were only a few "cluster sync" methods available, which could be defined using the wsrep_sst_method configuration parameter. "xtrabackup" seemed to be the way to go at first, but once issues related to xtrabackup were experienced after a (minor) MariaDB 10.0 upgrade, we switched to the "rsync" method.

The negative side of the rsync method: It locks the donor node. Not just for write operations, but also for read operations. If you have a two-node cluster in a testing environment, this results in a complete cluster downtime. If you run a three-node cluster (there should always be at least three) in production it depends on how the applications access the cluster. If they use their own local balancing or failover mechanism, a situation might arise where the primary DB node still listens on tcp/3306 however it is the current donor and will not answer to the queries anymore (they will queue up). If you need to do a full SST sync to a cluster node, one will have to select a standby node as donor and make sure all the applications don't access this donor node (and the node which needs will join the cluster). In general a lot of considerations and error happen quickly, leading to downtimes.

Since MariaDB 10.1.26 and 10.2.10 a new sst method is available: mariabackup. This method is based on the xtrabackup-v2 method and, according to the documentation, does not lock the donor node:

The mariabackup SST method uses the Mariabackup utility for performing SSTs. It is one of the methods that does not block the donor node.

While upgrading a 2-node test cluster from MariaDB 10.0 to 10.1 (part of a multi-version upgrade task), the new wsrep_sst_method was tested to see if it really keeps the applications running, even when a full SST needs to be performed.

SST with rsync

After one node (node02) was upgraded from 10.0.38 to 10.1.41, it was time to upgrade the remaining node (node01). This was the moment when a full SST was tested with the rsync method.

root@mysql01:~# mv /var/lib/mysql/mysql /tmp/
root@mysql01:~# rm -rf /var/lib/mysql/*
root@mysql01:~# mv /tmp/mysql /var/lib/mysql/

Hammer time!

root@mysql01:~# systemctl start mariadb

The rsync process could be seen in the process list and, as expected, the applications using node2 (or in general any node in this two-node cluster) started to fail. Monitoring confirmed that write operations were not working on both cluster nodes.

Once the rsync process was completed and the Galera cluster was in sync again, monitoring confirmed both nodes were working correctly again and recovery notifications arrived for the applications using the test cluster.

SST with mariabackup

Note: To use mariabackup as sst method, the package mariadb-backup-[version] must first be installed. For MariaDB 10.1, this would be:

root@mysql01:~# apt-get install mariadb-backup-10.1

As before, node01 was used again and data was completely removed:

root@mysql01:~# systemctl stop mariadb
root@mysql01:~# mv /var/lib/mysql/mysql /tmp/
root@mysql01:~# rm -rf /var/lib/mysql/*
root@mysql01:~# mv /tmp/mysql /var/lib/mysql/

The wsrep_sst_method was changed from rsync to mariabackup:

root@mysql01:~# cat /etc/mysql/conf.d/galera.cnf | grep sst_method
wsrep_sst_method=mariabackup

Hammer time, again!

root@mysql01:~# systemctl start mariadb

The sync process started and by looking at the process list, the details could be seen:

mysql    31027  0.0  0.2 2308944 47072 ?       Ssl  16:15   0:00 /usr/sbin/mysqld --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1
mysql    31035  0.0  0.0   4628   772 ?        S    16:15   0:00  \_ sh -c wsrep_sst_mariabackup --role 'joiner' --address '192.168.253.81' --datadir '/var/lib/mysql/'   --parent '31027' --binlog '/var/log/mysql/mariadb-bin' --binlog-index '/var/log/mysql/mariadb-bin.index'
mysql    31036  0.0  0.0  13384  3644 ?        S    16:15   0:00      \_ /bin/bash -ue /usr//bin/wsrep_sst_mariabackup --role joiner --address 192.168.253.81 --datadir /var/lib/mysql/ --parent 31027 --binlog /var/log/mysql/mariadb-bin --binlog-index /var/log/mysql/mariadb-bin.index
mysql    31282  0.0  0.0  13280  2072 ?        S    16:15   0:00          \_ /bin/bash -ue /usr//bin/wsrep_sst_mariabackup --role joiner --address 192.168.253.81 --datadir /var/lib/mysql/ --parent 31027 --binlog /var/log/mysql/mariadb-bin --binlog-index /var/log/mysql/mariadb-bin.index
mysql    31284  0.0  0.0  26060  1404 ?        S    16:15   0:00          |   \_ logger -p daemon err -t -wsrep-sst-joiner
mysql    31325  0.0  0.0  13384  2436 ?        S    16:15   0:00          \_ /bin/bash -ue /usr//bin/wsrep_sst_mariabackup --role joiner --address 192.168.253.81 --datadir /var/lib/mysql/ --parent 31027 --binlog /var/log/mysql/mariadb-bin --binlog-index /var/log/mysql/mariadb-bin.index
mysql    31329  0.0  0.0  24824  1972 ?        S    16:15   0:06          |   \_ socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql    31330  0.0  0.0  96344 11380 ?        Sl   16:15   0:07          |   \_ mbstream -x
mysql    32729  0.0  0.0   7468   740 ?        S    16:15   0:00          \_ sleep 0.1

Time of truth: Were the applications still working? What did monitoring say? And indeed; the MySQL queries still worked on the donor node (node02), the applications were still up and running!

What about speed?

If mariabackup is so much better than rsync by not blocking the donor node, there must certainly be a disadvantage, right? But according to the monitoring the network throughput during the mariabackup sync was higher than during the rsync sync!

>Galera wsrep sst rsync vs mariabackup

It rarely happens that everyone's happy, but it just seems to be the case here: Applications don't experience downtime anymore and the cluster sync is faster than before!

Note: As mentioned before, this was tested on MariaDB 10.1, as part of a multi-version cluster upgrade. As of this writing, MariaDB 10.4 and a newer Galera version (galera-4) are available. Which (probably) have further improvements for SST sync.

Update: Verify privileges for SST user

September 23rd 2019: If you come across problems during SST sync with the error "xtrabackup_checkpoints missing, failed innobackupex/SST on donor", check out our article Galera cluster unable to sync SST: xtrabackup checkpoints missing, failed innobackupex on donor.

Need help in Galera troubleshooting?

Problems in Galera Clusters are not always easy to spot. Need help troubleshooting a Galera cluster? Contact us on Infiniroot.com.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder