Troubleshooting service check output and performance data in Icinga 2 using debug console

Written by - 0 comments

Published on - Listed in Monitoring Icinga


Sometimes a check is configured in Icinga 2 but for some reason the output differs from the check run on the command line. In such a situation it may help to use advanced debug tools, such as Icinga 2's console. 

Different plugin performance data in cli and gui

When a monitoring plugin (check_netapp_ontap) was executed on the command line to check for the current usage of a NetApp volume, the plugin executed fine and nicely displayed the volume's current disk usage:

# /usr/lib/nagios/plugins/check_netapp_ontap.pl -H netapp -u "user" -p "pass" -o volume_health -m include,cifsserver/v_media_share1
OK - No problem found (1 checked) | 'cifsserver/v_media_share1_usage'=160400691200B;;;0;255013683200

However when the check was added in Icinga 2, the performance data showed something completely different in the Icingaweb user interface:

Instead of the expected "_usage", the performance data showed "_inodes". Where is this coming from?

Using Icinga 2 console to find out more

To look into more details of this particular check, the Icinga 2 console is a very helpful tool. The console uses the Icinga 2 API in the background. This of course requires, that the API feature is enabled and a user with the relevant credentials exist. Check if /etc/icinga2/features-enabled/api.conf exists, if not enable the API:

# icinga2 feature enable api

Then make sure you have valid API users configured. The default is set in /etc/icinga2/conf.d/api-users.conf. A "root" user usually already exists. If not, create one like this:

object ApiUser "root" {
  password = "secret"
  // client_cn = ""

  permissions = [ "*" ]
}

Now the console can be connected to the local Icinga 2 API, using the credentials from above

# icinga2 console --connect 'https://root:secret@localhost:5665/'

By using the console, the specific check (service) can be retrieved using the get_service() function:

<1> => get_service("netapp", "Volume v_media_share1")
{
        __name = "netapp!Volume v_media_share1"
        acknowledgement = 0.000000
        acknowledgement_expiry = 0.000000
        active = true
        check_attempt = 1.000000
        check_command = "check_netapp_ontap"
        check_interval = 7200.000000
        check_period = "24x7"
        check_timeout = null
        command_endpoint = ""
        display_name = "Volume v_media_share1"
        downtime_depth = 0.000000
[...]

The output can be pretty big. In this particular situation we want to find out what exactly happened at the last check execution. This part can be found further down, inside the nested last_check_result:

        last_check_result = {
                active = true
                check_source = "inf-monm02-p"
                command = [ "/usr/lib/nagios/plugins/check_netapp_ontap.pl", "-H", "netapp", "-c", "95", "-m", "include,cifsserver/v_media_share1", "-o", "volume_health", "-p", "pass", "-u", "user", "-w", "80" ]
                execution_end = 1607095462.990065
                execution_start = 1607095458.837354
                exit_status = 0.000000
                output = "OK - No problem found (1 checked) "
                performance_data = [ "cifsserver/v_media_share1_inodes=908161B;;;0;7782389" ]
                schedule_end = 1607095462.990105
                schedule_start = 1607095458.000000
                state = 0.000000
                ttl = 0.000000
                type = "CheckResult"
                vars_after = {
                        attempt = 1.000000
                        reachable = true
                        state = 0.000000
                        state_type = 1.000000
                }
                vars_before = {
                        attempt = 1.000000
                        reachable = true
                        state = 0.000000
                        state_type = 1.000000
                }
        }

Thanks to this, we now know the exact command (including all parameters) which was executed on the monitoring server "inf-monm02-p" (see check_source).  The check results are also shown, split into the output and performance_data fields. And yes, the performance_data clearly shows the inodes performance data - the same as the Icingaweb interface shows.

Re-running the plugin using this information

Thanks to the console output, we now know that thresholds were added. The exact same command can now be launched on the same server to see whether the output differs:

$ /usr/lib/nagios/plugins/check_netapp_ontap.pl -H "netapp" -c "95" -m "include,cifsserver/v_media_share1" -o "volume_health" -p "pass" -u "user" -w "80"
OK - No problem found (1 checked) | 'cifsserver/v_media_share1_inodes'=908187B;;;0;7782389

And indeed, now only inodes show up in the performance data!

Taking a closer look at the threshold documentation of this plugin reveals:

volume_health
        desc: Check the space and inode health of a vServer volume. If space % and space in *B are both defined the smaller value of the two will be used when deciding if the volume is in a warning or critical state. This allows you to better accomodate large volume monitoring.
thresh: Space % used, space in *B (i.e MB) remaining, inode count remaining, inode % used (Usage example: 80%i), "offline" keyword.
        node: The node option restricts this check by vserver name.

Interpreting this means that for space/usage thresholds either the percentage sign % or (G|M)B must follow. Otherwise the check will only check for inodes. By adding the percentage sign to the thresholds now finally shows the correct volume usage:

$ /usr/lib/nagios/plugins/check_netapp_ontap.pl -H "netapp" -c "95%" -m "include,cifsserver/v_media_share1" -o "volume_health" -p "pass" -u "user" -w "80%"
OK - No problem found (1 checked) | 'cifsserver/v_media_share1_usage'=160409325568B;;;0;255013683200

Once the thresholds were adjusted in the Service object, the usage performance data also appeared in Icingaweb user interface.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder