For the last couple of days we've witnessed an increasing number of slow nodes in one of our production Kubernetes clusters managed by Rancher 2. The workaround was always to cordon and drain the affected node and then uncordon the node again. But what is actually causing this slowness? And why only on a single node and not on all nodes of the cluster? We're about to find out, the troubleshooting way!
Our Icinga 2 monitoring alerted about very high IOWAITs on the CPU of one particular Kubernetes node. These IOWAITs started all of a sudden (at 7am), as can be nicely seen in the following graph:
On the node itself, the IOWAITs were confirmed with iostat:
root@k8snode:~# iostat 1
avg-cpu: %user %nice %system %iowait %steal %idle
2.01 0.00 1.63 45.04 1.76 49.56
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0 0
loop3 0.00 0.00 0.00 0 0
loop4 0.00 0.00 0.00 0 0
loop5 0.00 0.00 0.00 0 0
nvme0n1 83.00 0.00 1020.00 0 1020
nvme1n1 100.00 400.00 32.00 400 32
dm-0 132.00 0.00 1020.00 0 1020
After a hardware defect (a dying disk could be the reason for such IOWAIT spikes) was ruled out, it was time to find the container causing this.
Obviously some container is using a lot of IO. Using docker stats all containers can be listed with their block i/o statistics:
root@k8snode:~# docker stats --format "table {{.Container}}\t{{.BlockIO}}" --no-stream --no-trunc
CONTAINER BLOCK I/O
62c08d56d45d 262kB / 0B
82f35e25edf3 0B / 57.3kB
13fa7f80c99b 0B / 24.6kB
e445587401aa 0B / 847MB
04eb41b7a084 0B / 1.86GB
27eaa630ec84 0B / 0B
3341a7bea2d7 0B / 0B
21ac333b0e33 0B / 32.8kB
700aee319ee6 0B / 0B
e36a3d2dd55b 0B / 32.8kB
dafe40a694ed 0B / 32.8kB
293e4febf708 0B / 32.8kB
6c75c58a1830 0B / 0B
a27279e0681b 0B / 0B
7a1e141f83f9 0B / 0B
0ab7ce23d3ab 0B / 0B
7ae0a5533bfb 0B / 0B
0080651a7fc8 0B / 0B
5e49d58e98c7 0B / 0B
e0dcf15afd92 262kB / 0B
260a9dc57cda 61.5GB / 478MB
f60d8c63409b 0B / 0B
a428551fed97 0B / 0B
2ecb2bc61ab1 0B / 79.4MB
34a18c45cb50 0B / 0B
One container (260a9dc57cda) stand out with a total of 61.5 GB (!!!) read and 478 MB written to the disk. Let's get some additional stats about that container:
root@k8snode:~# docker stats 260a9dc57cda --format "table {{.Container}}\t{{.BlockIO}}\t{{.CPUPerc}}\t{{.MemPerc}}\t{{.MemUsage}}" --no-stream
CONTAINER BLOCK I/O CPU % MEM % MEM USAGE / LIMIT
260a9dc57cda 62.5GB / 478MB 0.58% 99.97% 999.7MiB / 1000MiB
And while we're at it, let's find out the containers (human) name:
root@k8snode:~# docker stats 260a9dc57cda --format "table {{.Container}}\t{{.Name}}" --no-stream
CONTAINER NAME
260a9dc57cda k8s_prometheus_prometheus-cluster-monitoring-0_cattle-prometheus_ca3e7366-fc1f-11ea-acee-067cfe15a57a_1
So we've got two interesting facts here:
1) This is the container running Prometheus monitoring. In this Rancher 2 cluster monitoring is enabled. When cluster monitoring is enabled, Rancher 2 fires up Prometheus in the background. Seems like Prometheus is running on this particular node and is somehow causing problems.
2) The memory usage in percent column shows almost 100% memory used. According to the docker stats documentation, MemPerc means:
the percentage of the host’s CPU and memory the container is using
But the host itself still has plenty of memory available. This can be verified with free:
root@k8snode:~# free
total used free shared buff/cache available
Mem: 32525732 7304924 12375740 4940 12845068 26222492
Swap: 4194300 1069696 3124604
And systems monitoring also confirms there is still a lot of memory available:
Now that it is known that Prometheus (the Kubernetes monitoring enabled by Rancher 2) is causing problems, what does the Rancher 2 user interface have to say?
Interestingly the overview of the affected cluster already showed a red "Monitoring API is not ready", which kind of gives a hint:
To view details, one can change into the "System" project and within the "cattle-prometheus" namespace, the "prometheus-cluster-monitoring" workload is showing problems (although still marked in Active State). To be more precise: The pod(s) of the workload are showing up as red (see "Scale" column):
With the current findings we know that:
Using iotop on the node reveals kworker flush and our now known prometheus container as top IO processes:
root@k8snode:~# iotop
Total DISK READ : 0.00 B/s | Total DISK WRITE : 0.00 B/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 971.53 K/s
PID PRIO USER DISK READ DISK WRITE SWAPIN IO COMMAND
9854 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % [kworker/u16:2+flush-259:1]
22841 be/4 ubuntu 0.00 B/s 0.00 B/s 0.00 % 27.46 % prometheus --web.console.templates=/etc/prometheus/consoles --web.console.libraries=/etc/prometheu
s/console_libraries --config.file=/etc/prometheus/config_out/prometheus.env.yaml --storage.tsdb.path=/prometheus --storage.tsdb.retention=12h --web.enable-lif
ecycle --storage.tsdb.no-lockfile --web.route-prefix=/ --web.listen-address=127.0.0.1:9090
1243 be/4 root 0.00 B/s 0.00 B/s 0.00 % 1.49 % containerd
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init
[...]
Could this container be swapping? By using the process ID (22841), the PIDs memory consumption can be checked in /proc/22841/smaps. Each swap entry can be collected to find the full swap usage of this process:
root@k8snode:~# SWAPUSAGE=0; for SWAP in $(grep Swap /proc/22841/smaps|awk '{print $2}'); do let SWAPUSAGE=${SWAPUSAGE}+${SWAP}; done; echo $SWAPUSAGE
1992728
Yes, this process (and therefore this container) is definitely swapping. As it uses 100% memory (from what? we're about to find out later), memory is constantly being swapped out read again - causing extremely high I/O. So far this would explain the IOWAITs, but why is this happening?
By using docker stats above, we found out that the container is using 100% of memory. But the node itself has still plenty of memory available. There is only one possibility: This container has resource limits!
However docker inspect didn't show any information related to limits. Maybe the container start command (docker run) would show more relevant information? To retrieve this information from an already running container, the runlike command is very helpful. This is a python program which can be installed using pip:
root@k8snode:~# pip install runlike
Once installed, runlike awaits the container id as input. From the previous troubleshooting steps we already know that ID:
root@k8snode:~# runlike 260a9dc57cda
docker run --name=k8s_prometheus_prometheus-cluster-monitoring-0_cattle-prometheus_ca3e7366-fc1f-11ea-acee-067cfe15a57a_1 --hostname=prometheus-cluster-monitoring-0 --user=1000 --env=ACCESS_PROMETHEUS_PORT_80_TCP_PORT=80 --env=KUBERNETES_PORT_443_TCP_PROTO=tcp --env=ACCESS_GRAFANA_SERVICE_HOST=10.43.47.52 --env=KUBERNETES_SERVICE_PORT_HTTPS=443 --env=KUBERNETES_PORT=tcp://10.43.0.1:443 --env=KUBERNETES_PORT_443_TCP=tcp://10.43.0.1:443 --env=ACCESS_PROMETHEUS_SERVICE_HOST=10.43.132.44 --env=ACCESS_PROMETHEUS_SERVICE_PORT=80 --env=ACCESS_PROMETHEUS_SERVICE_PORT_HTTP=80 --env=ACCESS_PROMETHEUS_PORT_80_TCP_PROTO=tcp --env=KUBERNETES_SERVICE_HOST=10.43.0.1 --env=ACCESS_GRAFANA_PORT_80_TCP_PROTO=tcp --env=ACCESS_GRAFANA_PORT_80_TCP_PORT=80 --env=KUBERNETES_PORT_443_TCP_PORT=443 --env=KUBERNETES_PORT_443_TCP_ADDR=10.43.0.1 --env=ACCESS_GRAFANA_SERVICE_PORT_HTTP=80 --env=ACCESS_GRAFANA_PORT=tcp://10.43.47.52:80 --env=ACCESS_GRAFANA_PORT_80_TCP=tcp://10.43.47.52:80 --env=ACCESS_PROMETHEUS_PORT=tcp://10.43.132.44:80 --env=ACCESS_PROMETHEUS_PORT_80_TCP_ADDR=10.43.132.44 --env=KUBERNETES_SERVICE_PORT=443 --env=ACCESS_GRAFANA_SERVICE_PORT=80 --env=ACCESS_GRAFANA_PORT_80_TCP_ADDR=10.43.47.52 --env=ACCESS_PROMETHEUS_PORT_80_TCP=tcp://10.43.132.44:80 --env=PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~empty-dir/config-out:/etc/prometheus/config_out:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~empty-dir/prometheus-cluster-monitoring-db:/prometheus' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~configmap/prometheus-cluster-monitoring-rulefiles-0:/etc/prometheus/rules/prometheus-cluster-monitoring-rulefiles-0:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~secret/secret-exporter-etcd-cert:/etc/prometheus/secrets/exporter-etcd-cert:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~configmap/configmap-prometheus-cluster-monitoring-nginx:/etc/prometheus/configmaps/prometheus-cluster-monitoring-nginx:ro' --volume='/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/volumes/kubernetes.io~secret/cluster-monitoring-token-5zhd7:/var/run/secrets/kubernetes.io/serviceaccount:ro' --volume=/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/etc-hosts:/etc/hosts --volume=/var/lib/kubelet/pods/ca3e7366-fc1f-11ea-acee-067cfe15a57a/containers/prometheus/415dc099:/dev/termination-log --volume=/prometheus --network=container:ac6587b4a86b605b7937586f70efbc9e2fedf6fae45ea91ff46a8a7a4a0d0973 --workdir=/prometheus --expose=9090/tcp --restart=no --label='maintainer=The Prometheus Authors
But even in the original docker run command, no limits could be found. Is it surprising? Probably not because Kubernetes is used as orchestration layer to manage all these containers. So it's probably time to get into the Kubernetes cluster and search for answers there.
First let's find out what exactly is running inside the cattle-prometheus namespace (which we found out from the Rancher UI above):
> kubectl get all -n cattle-prometheus
NAME READY STATUS RESTARTS AGE
pod/exporter-kube-state-cluster-monitoring-75567dff98-5nl8c 1/1 Running 0 11d
pod/exporter-node-cluster-monitoring-4clsp 1/1 Running 0 14d
pod/exporter-node-cluster-monitoring-7mmtr 1/1 Running 3 97d
pod/exporter-node-cluster-monitoring-bwv7w 1/1 Running 0 97d
pod/exporter-node-cluster-monitoring-mzmfz 1/1 Running 0 97d
pod/exporter-node-cluster-monitoring-tj2hd 1/1 Running 1 97d
pod/exporter-node-cluster-monitoring-vhntm 1/1 Running 0 97d
pod/exporter-node-cluster-monitoring-wgtbz 1/1 Running 0 97d
pod/exporter-node-cluster-monitoring-wjxdw 1/1 Running 0 97d
pod/grafana-cluster-monitoring-58459df585-96w4r 2/2 Running 0 39h
pod/prometheus-cluster-monitoring-0 5/5 Running 31 39h
pod/prometheus-operator-monitoring-operator-c474f7475-8zsqq 1/1 Running 0 97d
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/access-grafana ClusterIP 10.43.47.52
service/access-prometheus ClusterIP 10.43.132.44
service/expose-grafana-metrics ClusterIP None
service/expose-kube-cm-metrics ClusterIP None
service/expose-kube-etcd-metrics ClusterIP None
service/expose-kube-scheduler-metrics ClusterIP None
service/expose-kubelets-metrics ClusterIP None
service/expose-kubernetes-metrics ClusterIP None
service/expose-node-metrics ClusterIP None
service/expose-operator-metrics ClusterIP None
service/expose-prometheus-metrics ClusterIP None
service/prometheus-operated ClusterIP None
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
daemonset.apps/exporter-node-cluster-monitoring 8 8 6 8 6 beta.kubernetes.io/os=linux 97d
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/exporter-kube-state-cluster-monitoring 1/1 1 1 97d
deployment.apps/grafana-cluster-monitoring 1/1 1 1 97d
deployment.apps/prometheus-operator-monitoring-operator 1/1 1 1 97d
NAME DESIRED CURRENT READY AGE
replicaset.apps/exporter-kube-state-cluster-monitoring-75567dff98 1 1 1 97d
replicaset.apps/grafana-cluster-monitoring-576c5d879f 0 0 0 97d
replicaset.apps/grafana-cluster-monitoring-58459df585 1 1 1 39h
replicaset.apps/grafana-cluster-monitoring-664d8494c6 0 0 0 39h
replicaset.apps/prometheus-operator-monitoring-operator-c474f7475 1 1 1 97d
NAME READY AGE
statefulset.apps/prometheus-cluster-monitoring 1/1 97d
The kubectl get all command inside the cattle-prometheus namespace (-n cattle-prometheus) shows all the pods, services and actual deployments. Most interesting here is the pod/prometheus-cluster-monitoring-0 (highlighted in the above output). It was already restarted 31 times - definitely an eye-catching fact.
Let's find out more about this specific pod:
> kubectl describe pods -n cattle-prometheus prometheus-cluster-monitoring-0
Name: prometheus-cluster-monitoring-0
Namespace: cattle-prometheus
Priority: 0
PriorityClassName:
Node: k8snode/10.11.3.61
Start Time: Mon, 21 Sep 2020 15:33:25 +0000
Labels: app=prometheus
chart=prometheus-0.0.1
controller-revision-hash=prometheus-cluster-monitoring-6ff5559496
monitoring.coreos.com=true
prometheus=cluster-monitoring
release=cluster-monitoring
statefulset.kubernetes.io/pod-name=prometheus-cluster-monitoring-0
Annotations: cattle.io/timestamp: 2020-09-21T15:22:26Z
cni.projectcalico.org/podIP: 10.42.5.152/32
field.cattle.io/ports:
[[{"containerPort":80,"dnsName":"prometheus-cluster-monitoring-","name":"http","protocol":"TCP","sourcePort":0}],[{"containerPort":9090,"d...
Status: Running
IP: 10.42.5.152
Controlled By: StatefulSet/prometheus-cluster-monitoring
Containers:
prometheus:
Container ID: docker://260a9dc57cdaae59106fad9eabfd3a346641a71ee103453e4335afa46ada6706
Image: rancher/prom-prometheus:v2.7.1
Image ID: docker-pullable://rancher/prom-prometheus@sha256:b32864cfe5f7f5d146820ddc6dcc5f78f40d8ab6d55e80decd5fc222b16344dc
Port: 80/TCP
Host Port: 0/TCP
Args:
--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention=12h
--web.enable-lifecycle
--storage.tsdb.no-lockfile
--web.route-prefix=/
--web.listen-address=127.0.0.1:9090
State: Running
Started: Mon, 21 Sep 2020 15:33:39 +0000
Ready: True
Restart Count: 1
Limits:
cpu: 1
memory: 1000Mi
Requests:
cpu: 750m
memory: 750Mi
Environment:
[... cutting mounts and rest until events ...]
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 43m (x223 over 24h) kubelet, onl-radoaie26-p Liveness probe failed: Get http://10.42.5.152:9090/-/healthy: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning BackOff 9m13s (x126 over 103m) kubelet, onl-radoaie26-p Back-off restarting failed container
Warning Unhealthy 4m16s (x328 over 24h) kubelet, onl-radoaie26-p Readiness probe failed: Get http://10.42.5.152:9090/-/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
And at last, the limits were found! From the pod's description we can see that a cpu limit of 1 (which means Prometheus can use 100% of one CPU core of the node) and a memory limit of 1000MB were defined. 1000MB - haven't we seen this before?
Memory usage: 999.7MiB / 1000MiB
Ah yes, we did! This finally explains why docker stats showed a ~100% memory usage of this container; because it has a 1000MB memory limit and Prometheus in it simply uses up everything. As this is a 8 node Kubernetes cluster in production and with quite some workloads and pods deployed, it makes sense that Prometheus after a short period of time (39 hours according to kubectl get all output above) is simply getting overwhelmed with the amount of data coming in.
The solution? Increase that memory limit and give Prometheus more air (memory) to breath!
In Kubernetes, limits are defined on a deployment (workload in Rancher 2 terms) and not on a single pod - or basically the deployment (e.g. compose file) tells Kubernetes what kinds of limits to apply on which pods part of that deployment.
Now the big question is: Can we adjust these limits in Kubernetes? The answer is: YES (see this stackoverflow question)! With kubectl edit an already existing/deployed deployment can be adjusted.
From the pod description above we can see which service is responsible for the pod (= manages the pod):
> kubectl describe pods -n cattle-prometheus prometheus-cluster-monitoring-0 | grep -i controlled
Controlled By: StatefulSet/prometheus-cluster-monitoring
With this information, we can now run kubectl edit on StatefulSet/prometheus-cluster-monitoring:
> kubectl edit StatefulSet/prometheus-cluster-monitoring -n cattle-prometheus
[...]
resources:
limits:
cpu: "1"
memory: 1000Mi
requests:
cpu: 750m
memory: 750Mi
[...]
kubectl edit (on a correct target deployment) basically opens an editor where the relevant configuration can be adjusted. In this case the 1000Mi limit was increased to 2000Mi:
So far so good - but the Prometheus workload or at least the prometheus-cluster-monitoring-0 pod needs to be restarted.
The service responsible for the pod prometheus-cluster-monitoring-0 (which we know is StatefulSet/prometheus-cluster-monitoring) can now be told to downscale its number of pods to zero by setting --replicas=0:
> kubectl scale StatefulSet/prometheus-cluster-monitoring -n cattle-prometheus --replicas=0
statefulset.apps/prometheus-cluster-monitoring scaled
Right after this command, the pods inside the cattle-prometheus can be listed and this shows that pod prometheus-cluster-monitoring-0 is terminating:
> kubectl get pods -n cattle-prometheus
NAME READY STATUS RESTARTS AGE
exporter-kube-state-cluster-monitoring-75567dff98-5nl8c 1/1 Running 0 11d
exporter-node-cluster-monitoring-4clsp 1/1 Running 0 14d
exporter-node-cluster-monitoring-7mmtr 1/1 Running 3 97d
exporter-node-cluster-monitoring-bwv7w 1/1 Running 0 97d
exporter-node-cluster-monitoring-mzmfz 1/1 Running 0 97d
exporter-node-cluster-monitoring-tj2hd 1/1 Running 1 97d
exporter-node-cluster-monitoring-vhntm 1/1 Running 0 97d
exporter-node-cluster-monitoring-wgtbz 1/1 Running 0 97d
exporter-node-cluster-monitoring-wjxdw 1/1 Running 0 97d
grafana-cluster-monitoring-58459df585-96w4r 2/2 Running 0 40h
prometheus-cluster-monitoring-0 4/5 Terminating 35 40h
prometheus-operator-monitoring-operator-c474f7475-8zsqq 1/1 Running 0 97d
A few seconds later, the pod is gone:
> kubectl get pods -n cattle-prometheus
NAME READY STATUS RESTARTS AGE
exporter-kube-state-cluster-monitoring-75567dff98-5nl8c 1/1 Running 0 11d
exporter-node-cluster-monitoring-4clsp 1/1 Running 0 14d
exporter-node-cluster-monitoring-7mmtr 1/1 Running 3 97d
exporter-node-cluster-monitoring-bwv7w 1/1 Running 0 97d
exporter-node-cluster-monitoring-mzmfz 1/1 Running 0 97d
exporter-node-cluster-monitoring-tj2hd 1/1 Running 1 97d
exporter-node-cluster-monitoring-vhntm 1/1 Running 0 97d
exporter-node-cluster-monitoring-wgtbz 1/1 Running 0 97d
exporter-node-cluster-monitoring-wjxdw 1/1 Running 0 97d
grafana-cluster-monitoring-58459df585-96w4r 2/2 Running 0 40h
prometheus-operator-monitoring-operator-c474f7475-8zsqq 1/1 Running 0 97d
Now the service can be set to scale 1 again:
> kubectl scale StatefulSet/prometheus-cluster-monitoring -n cattle-prometheus --replicas=1
statefulset.apps/prometheus-cluster-monitoring scaled
And the pod prometheus-cluster-monitoring-0 shows up again:
> kubectl get pods -n cattle-prometheus
NAME READY STATUS RESTARTS AGE
exporter-kube-state-cluster-monitoring-75567dff98-5nl8c 1/1 Running 0 11d
exporter-node-cluster-monitoring-4clsp 1/1 Running 0 14d
exporter-node-cluster-monitoring-7mmtr 1/1 Running 3 97d
exporter-node-cluster-monitoring-bwv7w 1/1 Running 0 97d
exporter-node-cluster-monitoring-mzmfz 1/1 Running 0 97d
exporter-node-cluster-monitoring-tj2hd 1/1 Running 1 97d
exporter-node-cluster-monitoring-vhntm 1/1 Running 0 97d
exporter-node-cluster-monitoring-wgtbz 1/1 Running 0 97d
exporter-node-cluster-monitoring-wjxdw 1/1 Running 0 97d
grafana-cluster-monitoring-58459df585-96w4r 2/2 Running 0 40h
prometheus-cluster-monitoring-0 5/5 Running 1 16s
prometheus-operator-monitoring-operator-c474f7475-8zsqq 1/1 Running 0 97d
Note: According to this article on medium, Kubernetes 1.15 introduced the command kubectl rollout restart for this purpose. But as this cluster still runs on Kubernetes 1.14 the command would not work and we had to go with the (re-)scale solution.
Now to the most important part: Did the rescaling/redeployment do the trick and apply the new memory limit? Let's verify with kubectl describe:
> kubectl describe pods -n cattle-prometheus prometheus-cluster-monitoring-0 | grep memory
memory: 2000Mi
memory: 750Mi
memory: 50Mi
memory: 50Mi
memory: 100Mi
memory: 50Mi
memory: 200Mi
memory: 100Mi
memory: 25Mi
memory: 25Mi
Yes, the very first memory limit is now showing up with 2000Mi (instead of 1000Mi)!
A lot of troubleshooting, in-depth Docker and Kubernetes commands and finally a (dynamic) change of pod limits - but did this help? Immediately after the Prometheus pod was redeployed, iostat on the node confirmed that the IOWAITs were gone:
root@k8snode:~# iostat 1
avg-cpu: %user %nice %system %iowait %steal %idle
5.74 0.00 1.12 0.00 1.75 91.40
Device tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 0.00 0.00 0.00 0 0
loop1 0.00 0.00 0.00 0 0
loop2 0.00 0.00 0.00 0 0
loop3 0.00 0.00 0.00 0 0
loop4 0.00 0.00 0.00 0 0
loop5 0.00 0.00 0.00 0 0
nvme0n1 0.00 0.00 0.00 0 0
nvme1n1 5.00 0.00 108.00 0 108
dm-0 0.00 0.00 0.00 0 0
And the graphs from our monitoring confirmed, too (shortly before 9:40 am):
In Rancher 2, the cluster overview started showing the cluster statistics (read from Prometheus) again:
And inside the "System" project, the prometheus-cluster-monitoring-0 pod inside the prometheus-cluster-monitoring workload shows up green again:
The new memory limits can also be seen in Rancher 2 when manually editing the prometheus-cluster-monitoring workload:
Will the new limit of 2000MB be enough for Prometheus in this particular cluster? Only time will tell...
Although the active prometheus-cluster-monitoring service in the Kubernetes cluster is now set to a new limit of 2000MB, the original monitoring settings in Rancher 2 are still the same. So if monitoring were to be disabled on the cluster level and the enabled again (or a newer version of monitoring would be deployed), the original settings (with a limit of 1000MB) would be defined again.
But the default Rancher 2 default settings can be changed, too. Inside the cluster, click on Tools -> Monitoring.
Then increase the memory limit (Prometheus Memory Limit) in the form and save:
Wayne Philip from Cape Town wrote on Nov 30th, 2021:
Great article -- same issue -- this gives us something to work with.
Thank you
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder