When an (old) Rancher 2 managed Kubernetes cluster needed to be upgraded from Kubernetes 1.14 to 1.15, the upgrade failed. Even when all the steps from the tutorial How to upgrade Kubernetes version in a Rancher 2 managed cluster were followed. What happened?
To enforce another (or newer) Kubernetes version, the original 3-node rke configuration yaml can be adjusted to specify the Kubernetes version to be used:
ck@linux:~/rancher$ cat RANCHER2_STAGE/3-node-rancher-stage.yml
nodes:
- address: 10.10.153.15
user: ansible
role: [controlplane,etcd,worker]
ssh_key_path: ~/.ssh/id_rsa
- address: 10.10.153.16
user: ansible
role: [controlplane,etcd,worker]
ssh_key_path: ~/.ssh/id_rsa
- address: 10.10.153.17
user: ansible
role: [controlplane,etcd,worker]
ssh_key_path: ~/.ssh/id_rsa
kubernetes_version: "v1.15.12-rancher2-2"
services:
etcd:
snapshot: true
creation: 6h
retention: 24h
Note: Please read the special notes on How to upgrade Kubernetes version in a Rancher 2 managed cluster which Kubernetes version to use - it depends on the RKE version.
With this minor change in the yaml config, rke up can be launched against the cluster and the cluster should upgrade the Kubernetes version (all other settings remain in place). In the following output rke 1.1.2 was used against this staging cluster:
ck@linux:~/rancher$ ./rke_linux-amd64-1.1.2 up --config RANCHER2_STAGE/3-node-rancher-stage.yml
INFO[0000] Running RKE version: v1.1.2
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [state] Possible legacy cluster detected, trying to upgrade
INFO[0000] [reconcile] Rebuilding and updating local kube config
[...]
INFO[0056] Pre-pulling kubernetes images
INFO[0056] Pulling image [rancher/hyperkube:v1.15.12-rancher2] on host [10.10.153.15], try #1
INFO[0056] Pulling image [rancher/hyperkube:v1.15.12-rancher2] on host [10.10.153.17], try #1
INFO[0056] Pulling image [rancher/hyperkube:v1.15.12-rancher2] on host [10.10.153.16], try #1
INFO[0079] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.17]
INFO[0083] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.16]
INFO[0088] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.15]
INFO[0088] Kubernetes images pulled successfully
INFO[0088] [etcd] Building up etcd plane..
[...]
INFO[0167] [etcd] Successfully started etcd plane.. Checking etcd cluster health
WARN[0197] [etcd] host [10.10.153.15] failed to check etcd health: failed to get /health for host [10.10.153.15]: Get https://10.10.153.15:2379/health: remote error: tls: bad certificate
WARN[0223] [etcd] host [10.10.153.16] failed to check etcd health: failed to get /health for host [10.10.153.16]: Get https://10.10.153.16:2379/health: remote error: tls: bad certificate
WARN[0240] [etcd] host [10.10.153.17] failed to check etcd health: failed to get /health for host [10.10.153.17]: Get https://10.10.153.17:2379/health: remote error: tls: bad certificate
FATA[0240] [etcd] Failed to bring up Etcd Plane: etcd cluster is unhealthy: hosts [10.10.153.15,10.10.153.16,10.10.153.17] failed to report healthy. Check etcd container logs on each host for more information
The newer Kubernetes 1.15.12 images were successfully pulled. But at the end the Kubernetes upgrade failed due to a bad certificate.
Unfortunately the error message does not contain the exact reason, why the Kubernetes certificates would be shown as "bad". But the most plausible reason is that the certificates expired. A similar problem was already discovered once and documented in Rancher 2 Kubernetes certificates expired! How to rotate your expired certificates. rke can be used to rotate and renew the Kubernetes certificates:
ck@linux:~/rancher$ ./rke_linux-amd64-1.1.2 cert rotate --config RANCHER2_STAGE/3-node-rancher-stage.yml
INFO[0000] Running RKE version: v1.1.2
INFO[0000] Initiating Kubernetes cluster
INFO[0000] Rotating Kubernetes cluster certificates
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating Kubernetes API server certificates
INFO[0000] [certificates] Generating Kube Controller certificates
INFO[0000] [certificates] Generating Kube Scheduler certificates
INFO[0000] [certificates] Generating Kube Proxy certificates
INFO[0001] [certificates] Generating Node certificate
INFO[0001] [certificates] Generating admin certificates and kubeconfig
INFO[0001] [certificates] Generating Kubernetes API server proxy client certificates
INFO[0001] [certificates] Generating kube-etcd-192-168-253-15 certificate and key
INFO[0001] [certificates] Generating kube-etcd-192-168-253-16 certificate and key
INFO[0001] [certificates] Generating kube-etcd-192-168-253-17 certificate and key
INFO[0002] Successfully Deployed state file at [RANCHER2_STAGE/3-node-rancher-stage.rkestate]
INFO[0002] Rebuilding Kubernetes cluster with rotated certificates
INFO[0002] [dialer] Setup tunnel for host [10.10.153.15]
INFO[0002] [dialer] Setup tunnel for host [10.10.153.16]
INFO[0002] [dialer] Setup tunnel for host [10.10.153.17]
INFO[0010] [certificates] Deploying kubernetes certificates to Cluster nodes
[...]
INFO[0058] [controlplane] Successfully restarted Controller Plane..
INFO[0058] [worker] Restarting Worker Plane..
INFO[0058] Restarting container [kubelet] on host [10.10.153.17], try #1
INFO[0058] Restarting container [kubelet] on host [10.10.153.15], try #1
INFO[0058] Restarting container [kubelet] on host [10.10.153.16], try #1
INFO[0060] [restart/kubelet] Successfully restarted container on host [10.10.153.17]
INFO[0060] Restarting container [kube-proxy] on host [10.10.153.17], try #1
INFO[0060] [restart/kubelet] Successfully restarted container on host [10.10.153.16]
INFO[0060] Restarting container [kube-proxy] on host [10.10.153.16], try #1
INFO[0060] [restart/kubelet] Successfully restarted container on host [10.10.153.15]
INFO[0060] Restarting container [kube-proxy] on host [10.10.153.15], try #1
INFO[0066] [restart/kube-proxy] Successfully restarted container on host [10.10.153.17]
INFO[0066] [restart/kube-proxy] Successfully restarted container on host [10.10.153.16]
INFO[0073] [restart/kube-proxy] Successfully restarted container on host [10.10.153.15]
INFO[0073] [worker] Successfully restarted Worker Plane..
The rke cert rotate command ran through successfully and the cluster remained up. After waiting a couple of minutes to make sure all services and pods were correctly restarted on the Kubernetes cluster, the Kubernetes upgrade can launched again.
The big question is of course: Was the certificate rotation enough to fix the problem? Let's find out:
ck@linux:~/rancher$ ./rke_linux-amd64-1.1.2 up --config RANCHER2_STAGE/3-node-rancher-stage.yml
INFO[0000] Running RKE version: v1.1.2
INFO[0000] Initiating Kubernetes cluster
INFO[0000] [certificates] GenerateServingCertificate is disabled, checking if there are unused kubelet certificates
INFO[0000] [certificates] Generating admin certificates and kubeconfig
INFO[0000] Successfully Deployed state file at [RANCHER2_STAGE/3-node-rancher-stage.rkestate]
INFO[0000] Building Kubernetes cluster
INFO[0000] [dialer] Setup tunnel for host [10.10.153.17]
INFO[0000] [dialer] Setup tunnel for host [10.10.153.15]
INFO[0000] [dialer] Setup tunnel for host [10.10.153.16]
[...]
INFO[0027] Pre-pulling kubernetes images
INFO[0027] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.16]
INFO[0027] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.15]
INFO[0027] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.17]
INFO[0027] Kubernetes images pulled successfully
INFO[0027] [etcd] Building up etcd plane..
[...]
INFO[0126] Checking if container [kubelet] is running on host [10.10.153.15], try #1
INFO[0126] Image [rancher/hyperkube:v1.15.12-rancher2] exists on host [10.10.153.15]
INFO[0126] Checking if container [old-kubelet] is running on host [10.10.153.15], try #1
INFO[0126] Stopping container [kubelet] on host [10.10.153.15] with stopTimeoutDuration [5s], try #1
INFO[0127] Waiting for [kubelet] container to exit on host [10.10.153.15]
INFO[0127] Renaming container [kubelet] to [old-kubelet] on host [10.10.153.15], try #1
INFO[0127] Starting container [kubelet] on host [10.10.153.15], try #1
INFO[0127] [worker] Successfully updated [kubelet] container on host [10.10.153.15]
INFO[0127] Removing container [old-kubelet] on host [10.10.153.15], try #1
INFO[0128] [healthcheck] Start Healthcheck on service [kubelet] on host [10.10.153.15]
INFO[0155] [healthcheck] service [kubelet] on host [10.10.153.15] is healthy
[...]
INFO[0387] [sync] Syncing nodes Labels and Taints
INFO[0387] [sync] Successfully synced nodes Labels and Taints
INFO[0387] [network] Setting up network plugin: canal
INFO[0387] [addons] Saving ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0387] [addons] Successfully saved ConfigMap for addon rke-network-plugin to Kubernetes
INFO[0387] [addons] Executing deploy job rke-network-plugin
INFO[0402] [addons] Setting up coredns
INFO[0402] [addons] Saving ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0402] [addons] Successfully saved ConfigMap for addon rke-coredns-addon to Kubernetes
INFO[0402] [addons] Executing deploy job rke-coredns-addon
INFO[0418] [addons] CoreDNS deployed successfully
INFO[0418] [dns] DNS provider coredns deployed successfully
INFO[0418] [addons] Setting up Metrics Server
INFO[0418] [addons] Saving ConfigMap for addon rke-metrics-addon to Kubernetes
INFO[0418] [addons] Successfully saved ConfigMap for addon rke-metrics-addon to Kubernetes
INFO[0418] [addons] Executing deploy job rke-metrics-addon
INFO[0438] [addons] Metrics Server deployed successfully
INFO[0438] [ingress] Setting up nginx ingress controller
INFO[0438] [addons] Saving ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0438] [addons] Successfully saved ConfigMap for addon rke-ingress-controller to Kubernetes
INFO[0438] [addons] Executing deploy job rke-ingress-controller
INFO[0448] [ingress] ingress controller nginx deployed successfully
INFO[0448] [addons] Setting up user addons
INFO[0448] [addons] no user addons defined
INFO[0448] Finished building Kubernetes cluster successfully
This time, rke continued its job to upgrade the Kubernetes cluster. Instead of a certificate error showing up, rke deployed the new Kubernetes version and renamed the previous K8s related containers (e.g. old-kubelet). While the upgrade was running, the current status could also be seen in the user interface where the current active node is seen as cordoned.
Depending on the load balancing or reverse proxy setup, after this Kubernetes version upgrade a redirect problem might happen when trying to access the Rancher 2 user interface:
With this Kubernetes upgrade also came a newer Nginx ingress version, which stopped listening on http and switched to https. If the reverse proxy was previously using plain http to the Rancher 2 node, it should now be replaced with using https. Here the snippet for HAProxy:
#####################
# Rancher
#####################
backend rancher-out
balance hdr(X-Forwarded-For)
timeout server 1h
option httpchk GET /healthz HTTP/1.1\r\nHost:\ rancher2-stage.example.com\r\nConnection:\ close
server onl-ran01-s 10.10.153.15:443 check ssl verify none inter 1000 fall 1 rise 2
server onl-ran02-s 10.10.153.16:443 check ssl verify none inter 1000 fall 1 rise 2
server onl-ran03-s 10.10.153.17:443 check ssl verify none inter 1000 fall 1 rise 2
Because the Rancher 2 generated Kubernetes certificates (which is the default setting in rke) are self-signed, the option "ssl verify none" should be added in HAProxy's backend server configuration.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder