check_rancher2 1.8.0: More performance data, new resource threshold (cpu, memory, pods) monitoring

Written by - 0 comments

Published on - Listed in Docker Kubernetes Rancher Internet Cloud Monitoring


A new version of check_rancher2, an open source monitoring plugin for Kubernetes clusters managed by SUSE Rancher, is available! Version 1.8.0 features a lot of new cool stuff. 

Give me more data!

Version 1.8.0 adds more performance data (statistics) to the plugin. When monitoring the Kubernetes nodes (using -t node), the plugin previously only reported the number of nodes in the cluster:

$ ./check_rancher2-1.7.1.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t node
CHECK_RANCHER2 OK - All 73 nodes are active|'nodes_total'=73;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;;

The new version now also shows resource usage statistics, across the whole Rancher 2 environment:

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t node
CHECK_RANCHER2 OK - All 73 nodes are active|'nodes_total'=73;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;; 'nodes_cpu_total'=98149;;;0;372000 'nodes_memory_total'=211285966848B;;;0;1321235513344 'nodes_pods_total'=1430;;;0;8030

When checking the nodes on a specific cluster (using -c c:xxxxx), the performance data shows the resource usage of each node in the cluster:

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t node -c c-xxxxx
CHECK_RANCHER2 OK - All 9 nodes are active|'nodes_total'=9;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;; 'nodes_cpu_total'=21100;;;0;66000 'nodes_memory_total'=39153827840B;;;0;205824241664 'nodes_pods_total'=200;;;0;990 k-node17-p_cpu=3410;;;0;8000 k-node17-p_memory=4833935360B;;;0;25221185536 k-node17-p_pods=19;;;0;110 k-node14-p_cpu=2550;;;0;8000 k-node14-p_memory=4949278720B;;;0;25221185536 k-node14-p_pods=27;;;0;110 onl-radoade11-p_cpu=400;;;0;2000 onl-radoade11-p_memory=62914560B;;;0;4054773760 onl-radoade11-p_pods=4;;;0;110 k-node18-p_cpu=2400;;;0;8000 k-node18-p_memory=4508876800B;;;0;25221185536 k-node18-p_pods=24;;;0;110 k-node11-p_cpu=2440;;;0;8000 k-node11-p_memory=5253365760B;;;0;25221185536 k-node11-p_pods=28;;;0;110 k-node16-p_cpu=2230;;;0;8000 k-node16-p_memory=4466933760B;;;0;25221177344 k-node16-p_pods=26;;;0;110 k-node13-p_cpu=2510;;;0;8000 k-node13-p_memory=5001707520B;;;0;25221185536 k-node13-p_pods=28;;;0;110 k-node12-p_cpu=2660;;;0;8000 k-node12-p_memory=5117050880B;;;0;25221177344 k-node12-p_pods=23;;;0;110 k-node15-p_cpu=2500;;;0;8000 k-node15-p_memory=4959764480B;;;0;25221185536 k-node15-p_pods=21;;;0;110

When checking a single cluster (-t cluster -c c:xxxxx), the resource usage of the whole cluster is showing up in the performance data:

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t cluster -c c:xxxxx
CHECK_RANCHER2 OK - Cluster my-cluster is healthy|'cluster_healthy'=1;;;; 'component_errors'=0;;;; 'cpu'=20700;;;;64000 'memory'=39090913280B;;;0;201769467904 'pods'=196;;;;880 'usage_cpu'=32%;;;0;100 'usage_memory'=19%;;;0;100 'usage_pods'=22%;;;0;100

However note that these statistics show only up on a specific cluster check. When checking all clusters (without specifying -c), the performance data only shows the number of discovered clusters:

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t cluster
CHECK_RANCHER2 OK - All clusters (9) are healthy|'clusters_total'=9;;;; 'clusters_errors'=0;;;;

Long parameters

Release 1.8.0 handles the parameter specification a bit different than before. This allows the usage of "long parameters". For example the -H parameter can now also be declared as --apihost

The documentation of check_rancher2 was updated and shows all the parameters.

Resource thresholds (warning and critical)

The new version adds the possibility to use threshold checks against CPU, Memory and Pod Usage. This can be achieved with the newly added (long) parameters:

  • --cpu-warn
  • --cpu-crit
  • --memory-warn
  • --memory-crit
  • --pods-warn
  • --pods-crit

For example if you want to monitor the CPU usage across a cluster and want to be alerted if 30% or more of the capacity is used, use the following:

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t cluster -c c:xxxxx --cpu-warn 30 --cpu-crit 60
CHECK_RANCHER2 CRITICAL - Cluster my-cluster has resource problems|'cluster_healthy'=0;;;; 'component_errors'=0;;;; 'cpu'=20700;;;;64000 'memory'=39090913280B;;;0;201769467904 'pods'=196;;;;880 'usage_cpu'=32%;30;60;0;100 'usage_memory'=19%;;;0;100 'usage_pods'=22%;;;0;100
CPU usage 32 higher than warn threshold of 30

The resource threshold checks work on checks on a specific cluster (-t cluster -c c:xxxxx) and also on node checks (-t node):

$ ./check_rancher2.sh -H rancher2.example.com -U token-xxxxx -P "secret" -S -t node -c c:xxxxx --cpu-warn 30 --cpu-crit 60
CHECK_RANCHER2 CRITICAL - Nodes with resource problems|'nodes_total'=9;;;; 'node_errors'=0;;;; 'node_ignored'=0;;;; 'nodes_cpu_total'=21100;;;0;66000 'nodes_memory_total'=39153827840B;;;0;205824241664 'nodes_pods_total'=200;;;0;990 k-node17-p_cpu=3410;;;0;8000 k-node17-p_memory=4833935360B;;;0;25221185536 k-node17-p_pods=19;;;0;110 k-node14-p_cpu=2550;;;0;8000 k-node14-p_memory=4949278720B;;;0;25221185536 k-node14-p_pods=27;;;0;110 onl-radoade11-p_cpu=400;;;0;2000 onl-radoade11-p_memory=62914560B;;;0;4054773760 onl-radoade11-p_pods=4;;;0;110 k-node18-p_cpu=2400;;;0;8000 k-node18-p_memory=4508876800B;;;0;25221185536 k-node18-p_pods=24;;;0;110 k-node11-p_cpu=2440;;;0;8000 k-node11-p_memory=5253365760B;;;0;25221185536 k-node11-p_pods=28;;;0;110 k-node16-p_cpu=2230;;;0;8000 k-node16-p_memory=4466933760B;;;0;25221177344 k-node16-p_pods=26;;;0;110 k-node13-p_cpu=2510;;;0;8000 k-node13-p_memory=5001707520B;;;0;25221185536 k-node13-p_pods=28;;;0;110 k-node12-p_cpu=2660;;;0;8000 k-node12-p_memory=5117050880B;;;0;25221177344 k-node12-p_pods=23;;;0;110 k-node15-p_cpu=2500;;;0;8000 k-node15-p_memory=4959764480B;;;0;25221185536 k-node15-p_pods=21;;;0;110
k-node17-p - CPU usage 42 higher than warn threshold of 30
k-node14-p - CPU usage 31 higher than warn threshold of 30
k-node13-p - CPU usage 31 higher than warn threshold of 30
k-node12-p - CPU usage 33 higher than warn threshold of 30
k-node15-p - CPU usage 31 higher than warn threshold of 30 

Kudos to Steffen Eichler!

This is a very big change in the monitoring plugin and the PR #31 is certainly a major improvement of check_rancher2! Thanks and credits go to Steffen Eichler who is behind this big pull request!



Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder