These days I'm testing Rancher as a potential candidate for a new Docker infrastructure. It's appealing so far: Rancher does have a nice and intuitive user interface and more importantly a nice API to automatically trigger container creation (for example through Travis).
During a fail over test, I rebooted one of the Rancher hosts and when it came back up, the connectivity to Rancher was lost. Why? Because I forgot to add the separate file system for /var/lib/docker, which I prepared as a logical volume, into /etc/fstab - therefore all previous docker data was gone and of course also the rancher-agent container.
Unfortunately I didn't see the error as fast and I just decided to simply remove the host in Rancher and re-add it manually. Of course when I fixed the file system mount problem and rebooted, Rancher would not connect anymore, because meanwhile there is a new rancher-agent with a new ID installed.
To force a reset or cleanup of the Rancher host, one can do the following:
1. Deactivate the affected host in Rancher, then remove the host
2. Stop Docker service
service docker stop
3. Remove Docker and Rancher data:
rm -rf /var/lib/docker/*
rm -rf /var/lib/rancher/*
4. Start Docker service
service docker start
5. Add the host in Rancher
The above commands apply to a Rancher 1.x environment. In Rancher 2.x more directories must be cleaned up:
1. Deactivate (drain) the affected host in Rancher, then remove the host. Either in the Rancher UI or for the "local" cluster in RKE's YAML config.
2. Stop Docker service
service docker stop
3. Remove Docker, Rancher, RKE and Kubernetes related data:
mount|grep kubelet | awk '{print $3}' | while read mount; do umount $mount; done
rm -rf /var/lib/docker/*
rm -rf /var/lib/rancher/*
rm -rf /var/lib/etcd
rm -rf /var/lib/kubelet/*
rm -rf /etc/kubernetes
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/cni
rm -rf /var/run/calico
rm -rf /run/secrets/kubernetes.io
test -d /opt/rancher && rm -rf /opt/rancher # For Single Rancher installs
test -d /opt/containerd && rm -rf /opt/containerd
test -d /opt/rke && rm -rf /opt/rke
4. Restart Docker service
service docker restart
Yes, although the Docker service was previously stopped, a simple "start" does not re-create the directories within /var/lib/docker (since Docker 20.10.x; see article Docker unable to pull images after-clean up for more information):
root@node:~# service docker start
root@node:~# ll /var/lib/docker/
total 0
A service restart however re-creates the missing directories:
root@node:~# service docker restart
root@node:~# ll /var/lib/docker/
total 44
drwx--x--x 4 root root 4096 Nov 11 14:06 buildkit
drwx--x--- 2 root root 4096 Nov 11 14:06 containers
drwx------ 3 root root 4096 Nov 11 14:06 image
drwxr-x--- 3 root root 4096 Nov 11 14:06 network
drwx--x--- 3 root root 4096 Nov 11 14:06 overlay2
drwx------ 4 root root 4096 Nov 11 14:06 plugins
drwx------ 2 root root 4096 Nov 11 14:06 runtimes
drwx------ 2 root root 4096 Nov 11 14:06 swarm
drwx------ 2 root root 4096 Nov 11 14:06 tmp
drwx------ 2 root root 4096 Nov 11 14:06 trust
drwx-----x 2 root root 4096 Nov 11 14:06 volumes
5. Add the host into a cluster using the sudo docker... command (shown in Rancher UI) or in RKE YAML
[... in progress, to be verified ... ]
Kubernetes nodes in Rancher managed downstream clusters run containers with their own deployment of containerd. The binaries are located in /var/lib/rancher/rke2/bin. These are not installed through the system package repositories.
To reset a Rancher 2.7 downstream cluster node, use the following steps.
1. Deactivate (drain) the affected host in Rancher, then delete the node. Either in the Rancher UI or for the "local" cluster in RKE's YAML config.
2. Stop RKE2 and Rancher-System service, delete related Systemd service units
systemctl stop rke2-server.service
systemctl stop rancher-system-agent.service
rm -f /etc/systemd/system/rancher-system*
rm -f /usr/local/lib/systemd/system/rke2-server.service
systemctl daemon-reload
This should (hopefully) stop all the containers (TO BE VERIFIED).
3. Remove Rancher, RKE and Kubernetes related data:
mount|grep kubelet | awk '{print $3}' | while read mount; do umount $mount; done
test -d /var/lib/docker && rm -rf /var/lib/docker/*
rm -rf /var/lib/rancher/*
rm -rf /var/lib/etcd
rm -rf /var/lib/kubelet/*
rm -rf /etc/kubernetes
rm -rf /etc/cni
rm -rf /opt/cni
rm -rf /var/lib/cni
rm -rf /var/run/calico
rm -rf /run/secrets/kubernetes.io
test -d /opt/rancher && rm -rf /opt/rancher # For Single Rancher installs
test -d /opt/containerd && rm -rf /opt/containerd
test -d /opt/rke && rm -rf /opt/rke
4. Reboot
reboot
Reboot the node and verify no containerd-shim-runc-v2 processes are running.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder