Having a working monitoring, you can rely on, is key for a production environment. But monitoring should not just be there to return green or red, OK or CRITICAL or 0/1. A well-implemented monitoring also reduces troubleshooting time, already pointing into the direction where a problem occurs.
And sometimes a good monitoring can also detect broken things, which are not shown by the application itself.
This happened a few weeks ago with a Kubernetes cluster, managed by Rancher 2. As we are using the open-source monitoring plugin check_rancher2 for Rancher managed Kubernetes clusters, our monitoring started to alert about a node being stuck in cluster registering phase:
On the command line, the output looks like this:
$ ./check_rancher2.sh -H rancher.example.com -U token-xxxxx -P "secret" -S -t node
CHECK_RANCHER2 CRITICAL - null in cluster c-zs42v is registering -|'nodes_total'=67;;;; 'node_errors'=1;;;; 'node_ignored'=0;;;;
There are a couple of eyebrows which went up when this alert appeared. Why is the node's name set to "null" instead of a real host name? Why is this particular node stuck in "registering" phase? And why does this not show up in the Rancher 2 user interface?
At least the cluster name is shown by check_rancher2, so we have an additional hint to follow. By using the -t info check type, all Kubernetes clusters (managed by this Rancher 2 setup) can be listed:
$ ./check_rancher2.sh -H rancher.example.com -U token-xxxxx -P "secret" -S -t info
CHECK_RANCHER2 OK - Found 9 clusters: c-6p529 alias me-prod - c-dzfvn alias prod-ext - c-gsczw alias aws-prod - c-hmgcp alias prod-int - c-pls9j alias vamp - c-s2c8b alias gamma - c-xjvzp alias et-prod - c-zhsdr alias azure-prod - local alias local - and 25 projects: [...] |'clusters'=9;;;; 'projects'=25;;;;
The important part: There is no such cluster with the ID c-zs42v!
By running kubectl against the Rancher 2 (local) cluster, additional information from the Kubernetes API can be retrieved. In this particular situation we focus on the namespaces, as each cluster created by Rancher 2 (RKE) also creates a namespace in Kubernetes:
$ kubectl get ns
NAME STATUS AGE
c-6p529 Active 482d
c-dzfvn Active 138d
c-gsczw Active 524d
c-hmgcp Active 54d
c-jfxkq Active 54d
c-pls9j Active 606d
c-s2c8b Active 3y15d
c-xjvzp Active 628d
c-zhsdr Active 523d
c-zs42v Active 55d
cattle-global-data Active 2y12d
cattle-global-nt Active 273d
cattle-system Active 3y15d
cluster-fleet-default-c-6p529-0a63de8fc176 Active 17m
cluster-fleet-default-c-dzfvn-db0ece01cc3b Active 17m
cluster-fleet-default-c-gsczw-b67c2a857200 Active 17m
cluster-fleet-default-c-hmgcp-684fbe9142cb Active 17m
cluster-fleet-default-c-pls9j-b8ab525e0c29 Active 17m
cluster-fleet-default-c-s2c8b-4e26ad7ae3c1 Active 17m
cluster-fleet-default-c-xjvzp-0b65f14fef6c Active 17m
cluster-fleet-default-c-zhsdr-955f3b1ac907 Active 17m
cluster-fleet-local-local-1a3d67d0a899 Active 17m
default Active 3y15d
fleet-clusters-system Active 18m
fleet-default Active 17m
fleet-local Active 17m
[...]
All the known cluster IDs (seen before with the -t info check of check_rancher2) show up. But one more cluster shows up: c-zs42v. Our missing cluster!
As we know from check_rancher2, there is a node stuck (trying) register in this cluster. By looking at the cluster registration tokens of this namespace, we can find out which user launched this operation:
$ kubectl get clusterregistrationtokens.management.cattle.io --namespace c-zs42v -o json
{
"apiVersion": "v1",
"items": [
{
"apiVersion": "management.cattle.io/v3",
"kind": "ClusterRegistrationToken",
"metadata": {
"annotations": {
"field.cattle.io/creatorId": "u-buuctqjhrm"
},
"creationTimestamp": "2021-09-28T12:08:57Z",
"generateName": "crt-",
"generation": 1,
"labels": {
"cattle.io/creator": "norman"
},
"managedFields": [
[...]
With this information we now have an exact timestamp (creationTimestamp) and the user id (creatorId).
After the user ID could be matched to another cluster administrator, we asked this user what happened on that day. It turned out that this person tried to create a new cluster in Rancher 2 but forgot to create required security groups (firewall rules). This led to a cluster in failed state, unable to actually deploy Kubernetes. The user then deleted the cluster in the Rancher 2 user interface. As the cluster disappeared, the user thought all is good and went on to create another cluster (this time successfully).
But - as our monitoring shows - something was still happening in the background. We know the reason why - but we still need to clean this up.
The check_rancher2 monitoring plugin reads the node information from the Rancher 2 API (accessible under the /v3 path). Even though the node's name is shown as "null", we can still query the API and use jq to filter the json output for a specific cluster:
$ curl -s -u token-xxxxx:secret https://rancher.example.com/v3/nodes | jq -r '.data[] | select(.clusterId == "c-zs42v")'
{
"appliedNodeVersion": 0,
"baseType": "node",
"clusterId": "c-zs42v",
"conditions": [
{
"status": "True",
"type": "Initialized"
},
{
"message": "waiting to register with Kubernetes",
"status": "Unknown",
"type": "Registered"
},
{
"status": "True",
"type": "Provisioned"
}
],
"controlPlane": true,
"created": "2021-09-28T12:46:57Z",
"createdTS": 1632833217000,
"creatorId": null,
"customConfig": {
"address": "10.10.204.124",
"type": "/v3/schemas/customConfig"
},
"dockerInfo": {
"debug": false,
"experimentalBuild": false,
"type": "/v3/schemas/dockerInfo"
},
"etcd": true,
"id": "c-zs42v:m-a4c0d00d69b6",
"imported": true,
"info": {
"cpu": {
"count": 0
},
"kubernetes": {
"kubeProxyVersion": "",
"kubeletVersion": ""
},
"memory": {
"memTotalKiB": 0
},
"os": {
"dockerVersion": "",
"kernelVersion": "",
"operatingSystem": ""
}
},
"ipAddress": "10.10.204.124",
"links": {
"remove": "https://rancher.example.com/v3/nodes/c-zs42v:m-a4c0d00d69b6",
"self": "https://rancher.example.com/v3/nodes/c-zs42v:m-a4c0d00d69b6",
"update": "https://rancher.example.com/v3/nodes/c-zs42v:m-a4c0d00d69b6"
},
"name": "",
"namespaceId": null,
"nodePoolId": "",
"nodeTemplateId": null,
"requestedHostname": "xyz-node1",
"sshUser": "root",
"state": "registering",
"transitioning": "yes",
"transitioningMessage": "waiting to register with Kubernetes",
"type": "node",
"unschedulable": false,
"uuid": "56e0a435-8ad5-48c3-9560-69d0592a9afa",
"worker": false
}
Thanks to this detailed output, we now also now the original IP address ("ipAddress": "10.10.204.124") and the host name ("requestedHostname": "xyz-node1"). We can also see the same information ("state": "registering") from the monitoring. And even though check_rancher2 did show "null" as node name (retrieved from the empty "name" field), there is a unique ID of this node: "id": "c-zs42v:m-a4c0d00d69b6".
The API output also shows a specific API URL (links) to access this specific node. By accessing the URL (while already being logged in to the Rancher 2 UI), the same output can be seen in the browser. But additionally to the JSON output, multiple operations, including delete, can be triggered on the right side.
This opens a a "API Request" layer where the resulting API request is shown as curl command. But it can also be executed directly by clicking on the [Send Request] button:
As soon as this was done, the node was finally (properly) deleted from the API.
The same check_rancher2 node check now returns OK:
$ ./check_rancher2.sh -H rancher.example.com -U token-xxxxx -P "secret" -S -t node
CHECK_RANCHER2 OK - All 66 nodes are active|'nodes_total'=66;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;;
Although currently called "the de-facto container infrastructure", Kubernetes is anything but easy. The complexity adds additional problems and considerations. We at Infiniroot love to share our troubleshooting knowledge when we need to tackle certain issues - but we also know this is not for everyone ("it just needs to work"). So if you are looking for a managed and dedicated Kubernetes environment, managed by Rancher 2, with server location Switzerland or even in your own on-premise datacenter, check out our Private Kubernetes Container Cloud Infrastructure service at Infiniroot.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder