For a new project Consul was checked if it could be used as a backend system for data. By quick-reading through the documentation, Consul looks promising, so far. Although the installation was quick and painless, a couple of communication errors occurred due to blocked firewall ports. Unfortunately the logged errors were not really helpful nor did they give a hint what exactly could be the problem. Here's an overview of the Consul installation, configuration and how to solve communication errors in the Consul cluster.
The setup is pretty easy as Consul is just a single binary file. And a config file you create which the binary will read. That's it. This means that installation is equal to a download, unzip and move:
root@consul01:~# wget https://releases.hashicorp.com/consul/1.5.3/consul_1.5.3_linux_amd64.zip
root@consul01:~# unzip consul_1.5.3_linux_amd64.zip
root@consul01:~# mv consul /usr/local/bin/
Create a dedicated user/group for Consul and create the destination directory for its data:
root@consul01:~# useradd -m -d /home/consul -s /sbin/nologin consul
root@consul01:~# mkdir /var/lib/consul
root@consul01:~# chown -R consul:consul /var/lib/consul
Create the config file. This config file can have any name you want and place it in any directory, as long as the "consul" user is able to read it.
Here's an example config file with placeholders:
root@consul01:~# cat /etc/consul.json
{
"server": true,
"node_name": "$NODE_NAME",
"datacenter": "dc1",
"data_dir": "$CONSUL_DATA_PATH",
"bind_addr": "0.0.0.0",
"client_addr": "0.0.0.0",
"advertise_addr": "$ADVERTISE_ADDR",
"bootstrap_expect": 3,
"retry_join": ["$JOIN1", "$JOIN2", "$JOIN3"],
"ui": true,
"log_level": "DEBUG",
"enable_syslog": true,
"acl_enforce_version_8": false
}
In order to launch Consul as a service, a Systemd unit file should be created:
root@consul01:~# cat /etc/systemd/system/consul.service
### BEGIN INIT INFO
# Provides: consul
# Required-Start: $local_fs $remote_fs
# Required-Stop: $local_fs $remote_fs
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: Consul agent
# Description: Consul service discovery framework
### END INIT INFO
[Unit]
Description=Consul server agent
Requires=network-online.target
After=network-online.target
[Service]
User=consul
Group=consul
PIDFile=/var/run/consul/consul.pid
PermissionsStartOnly=true
ExecStartPre=-/bin/mkdir -p /var/run/consul
ExecStartPre=/bin/chown -R consul:consul /var/run/consul
ExecStart=/usr/local/bin/consul agent \
-config-file=/etc/consul.json \
-pid-file=/var/run/consul/consul.pid
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
KillSignal=SIGTERM
Restart=on-failure
RestartSec=42s
[Install]
WantedBy=multi-user.target
Make sure, the paths are correct!
Now enable the service:
root@consul01:~# systemctl daemon-reload
root@consul01:~# systemctl enable consul.service
A Consul cluster should have a minimum of three cluster nodes. In this example I created three nodes:
Note: Although the third note runs in a different range, it's still part of the same LAN.
Once Consul is set up and prepared on all three nodes, the config files must be adjusted on each node. To do this quickly (and automated if you like):
# sed -i "s/\$NODE_NAME/$(hostname)/" /etc/consul.json
# sed -i "s/\$CONSUL_DATA_PATH/\/var\/lib\/consul/" /etc/consul.json
# sed -i "s/\$ADVERTISE_ADDR/$(ip addr show dev eth0 | awk '/inet.*eth0/ {print $2}' | cut -d "/" -f1)/" /etc/consul.json
# sed -i "s/\$JOIN1/192.168.253.8/" /etc/consul.json
# sed -i "s/\$JOIN2/192.168.253.9/" /etc/consul.json
# sed -i "s/\$JOIN3/10.10.1.45/" /etc/consul.json
This will place the local values into the variables NODE_NAME, CONSUL_DATA_PATH and ADVERTISE_ADDRESS. The JOINn variables are used to define the cluster nodes (see above).
Once this is completed, the cluster is ready and Consul can be started on each node.
# systemctl start consul
Before going into the communication errors, the needed ports should be explained. These need to be opened between all the nodes, in any direction. The following network architecture (from the official documentation) graph helps understanding how the nodes communicate between each other:
Something the documentation isn't telling: tcp,udp/8302 need to be opened between LAN nodes, too - even though it is marked as "WAN Gossip".
Fortunately the setup, configuration and cluster architecture of Consul is not complex at all. But understanding the logged errors, interpreting them to find the exact problem, can be a challenge.
The command consul members gives a quick overview on all cluster members and their current state. This is helpful to find major communication problems. However certain errors might still appear in logs, yet no problems are shown in the output.
root@consul01:~# consul members
Node Address Status Type Build Protocol DC Segment
consul01 192.168.253.8:8301 alive server 1.5.3 2 dc1
consul02 192.168.253.9:8301 alive server 1.5.3 2 dc1
consul03 10.10.1.45:8301 alive server 1.5.3 2 dc1
The first error occurred right after the first cluster start:
Aug 19 13:50:46 consul02 consul[11395]: 2019/08/19 13:50:46 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: 2019/08/19 13:50:46 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: 2019/08/19 13:50:46 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:46 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:50:47 consul02 consul[11395]: 2019/08/19 13:50:47 [DEBUG] serf: messageJoinType: consul02
Aug 19 13:50:47 consul02 consul[11395]: serf: messageJoinType: consul02
Aug 19 13:51:01 consul02 consul[11395]: 2019/08/19 13:51:01 [ERR] agent: Coordinate update error: No cluster leader
Aug 19 13:51:01 consul02 consul[11395]: agent: Coordinate update error: No cluster leader
This happened because not all nodes were able to communicate with each other (in any direction) on port 8301. Using telnet, a quick verification can be done on each node:
root@consul01:~# telnet 192.168.253.8 8301
Trying 192.168.253.8...
Connected to 192.168.253.8.
Escape character is '^]'.
root@consul02:~# telnet 192.168.253.8 8301
Trying 192.168.253.8...
Connected to 192.168.253.8.
Escape character is '^]'.
root@consul03:~# telnet 192.168.253.8 8301
Trying 192.168.253.8...
telnet: Unable to connect to remote host: Connection timed out
It's pretty obvious: Node consul03 is not able to communicate with the other nodes on port 8301. Remember: consul03 is the one in the different network range.
Although all cluster members are shown as alive in consul members output, a lot of warnings might be logged, indicating failed pings and a misconfigured network:
Aug 26 14:43:40 consul01 consul[15046]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 14:43:42 consul01 consul[15046]: 2019/08/26 14:43:42 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 14:43:42 consul01 consul[15046]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 14:43:43 consul01 consul[15046]: 2019/08/26 14:43:43 [DEBUG] memberlist: Stream connection from=192.168.253.9:42330
Aug 26 14:43:43 consul01 consul[15046]: memberlist: Stream connection from=192.168.253.9:42330
Aug 26 14:43:45 consul01 consul[15046]: 2019/08/26 14:43:45 [DEBUG] memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 14:43:45 consul01 consul[15046]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 14:43:47 consul01 consul[15046]: 2019/08/26 14:43:47 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 14:43:47 consul01 consul[15046]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 12:43:39 consul02 consul[20492]: 2019/08/26 14:43:39 [DEBUG] memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:39 consul02 consul[20492]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:41 consul02 consul[20492]: 2019/08/26 14:43:41 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 12:43:41 consul02 consul[20492]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 12:43:47 consul02 consul[20492]: 2019/08/26 14:43:47 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.1.45:8301
Aug 26 12:43:47 consul02 consul[20492]: memberlist: Initiating push/pull sync with: 10.10.1.45:8301
Aug 26 12:43:49 consul02 consul[20492]: 2019/08/26 14:43:49 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.253.8:8302
Aug 26 12:43:49 consul02 consul[20492]: memberlist: Initiating push/pull sync with: 192.168.253.8:8302
Aug 26 12:43:54 consul02 consul[20492]: 2019/08/26 14:43:54 [DEBUG] memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:54 consul02 consul[20492]: memberlist: Failed ping: consul03.dc1 (timeout reached)
Aug 26 12:43:56 consul02 consul[20492]: 2019/08/26 14:43:56 [WARN] memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
Aug 26 12:43:56 consul02 consul[20492]: memberlist: Was able to connect to consul03.dc1 but other probes failed, network may be misconfigured
A helpful hint to correctly interpret these log entries can be found on issue #3058 in Consul's GitHub repository. Although all nodes are running as LAN nodes in the cluster, the WAN gossip ports tcp,udp/8302 need to be opened. As soon as this port was opened, these log entries disappeared.
Aug 26 15:34:50 consul03 consul[13097]: 2019/08/26 15:34:50 [ERR] agent: Coordinate update error: rpc error getting client: failed to get conn: dial tcp
Aug 26 15:34:50 consul03 consul[13097]: agent: Coordinate update error: rpc error getting client: failed to get conn: dial tcp
Aug 26 15:34:50 consul03 consul[13097]: 2019/08/26 15:34:50 [WARN] agent: Syncing node info failed. rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 15:34:50 consul03 consul[13097]: agent: Syncing node info failed. rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 15:34:50 consul03 consul[13097]: 2019/08/26 15:34:50 [ERR] agent: failed to sync remote state: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
Aug 26 15:34:50 consul03 consul[13097]: agent: failed to sync remote state: rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection
This error turned out to be a misunderstanding when the firewall rules were created. Instead of opening all the ports mentioned above in both directions (bi-directional), the ports were only opened from Range 1 -> Range 2. The other way around (Range 2 -> Range 1) was still blocked by the firewall. Once this was fixed, these errors on consul03 disappeared, too.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder