Creating custom PNP4Nagios templates in Icinga 2 for NRPE checks

Written by - 0 comments

Published on - Listed in Nagios Icinga Monitoring


Since my early Nagios days (2005), I've used Nagiosgraph as my graphing service of choice. But in the last few years, other technologies came up. PNP4Nagios has became the de facto graphing standard for Nagios and Icinga installations. On big setups with several hundreds of hosts and thousands of services this is a wise choice; PNP4Nagios is a lot faster than Nagiosgraph. But Nagiosgraph can be more easily adapted to create custom graphs using the "map" file. That's why I ran PNP4Nagios and Nagiosgraph in parallel for the last few years on my Icinga 2 installation.

The main reason why I couldn't get rid of Nagiosgraph were performance data which were retrieved by plugins executed through check_nrpe. For example the monitoring plugin check_netio:

$ /usr/lib/nagios/plugins/check_nrpe -H remotehost -c check_netio -a eth0
NETIO OK - eth0: RX=2849414346, TX=1809023474|NET_eth0_RX=2849414346B;;;; NET_eth0_TX=1809023474B;;;;

The plugin reads the RX and TX values from the ifconfig command. As we know, these are counter values; a value which starts from 0 (at boot time) and increases with the number of Bytes passed through that interface.
While a check_disk through NRPE gives correct graphs in PNP4Nagios, the mentioned check_netio didn't:

PNP4Nagios NRPE check_disk graph
PNP4Nagios NRPE check_netio graph FAIL

The first graph on top shows the values from a check_disk plugin. The second graph below represents the values from the check_netio plugin. Both plugins were executed through NRPE on the remote host.

The comparison between the two graphs shows pretty clearly that only unique values (GAUGE in RRD terms; a good example: temperature) are working correctly. The counter values are shown with their increasing value instead of the difference between two values to determine the change.

Where does this come from? Why does PNP4Nagios doesn't reflect these values correctly? The problem can be found in the communication between Icinga 2 and PNP4Nagios.
Each time a host or service is checked in Icinga 2, the perfdata feature writes the performance data log file - by default in /var/spool/icinga2/perfdata. Inside such a log file Icinga 2 shows the following information:

$ cat /var/spool/icinga2/perfdata/service-perfdata*
[...]
DATATYPE::SERVICEPERFDATA    TIMET::1483441246    HOSTNAME::remotehost    SERVICEDESC::Network IO eth0    SERVICEPERFDATA::NET_eth0_RX=2316977534837B;;;; NET_eth0_TX=41612087322B;;;;    SERVICECHECKCOMMAND::nrpe    HOSTSTATE::UP    HOSTSTATETYPE::HARD    SERVICESTATE::OK    SERVICESTATETYPE::HARD
[...]

Take a closer look at the variable SERVICECHECKCOMMAND and you now see that it only contains nrpe - for each remote plugin executed through NRPE, whether this is check_disk, check_netio, check_ntp or whatever.
So Icinga 2 feeds this infromation to poor PNP4Nagios, which of course thinks all the checks are the same (nrpe) and handle all the graphs exactly the same (GAUGE by default). Which explains why the graphs for plugins with COUNTER results fail.

In order to tell PNP4Nagios that we're running different kinds of plugins and values behind NRPE, Icinga 2's PerfdataWriter needs to be adapted a little bit. I edited the default PerfdataWriter object called "perfdata":

$ cat /etc/icinga2/features-enabled/perfdata.conf
object PerfdataWriter "perfdata" {
  service_format_template = "DATATYPE::SERVICEPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tSERVICEDESC::$service.name$\tSERVICEPERFDATA::$service.perfdata$\tSERVICECHECKCOMMAND::$service.check_command$$pnp_check_arg1$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.state_type$\tSERVICESTATE::$service.state$\tSERVICESTATETYPE::$service.state_type$"
  rotation_interval = 15s
}

I only changed the definition of the service_format_template. All other configurable options are still default. And even this is only a minor change, which in short looks like this:

SERVICECHECKCOMMAND::$service.check_command$$pnp_check_arg1$

With that change, Icinga 2's PerfdataWriter is ready. But the variable needs yet to be set within the service object. As I use apply rules on such generic service checks as "Network IO", this was a quick modification in the apply rule of this service:

$ cat /etc/icinga2/zones.d/global-templates/applyrules/networkio.conf
apply Service "Network IO " for (interface in host.vars.interfaces) {
  import "generic-service"

  check_command = "nrpe"
  vars.nrpe_command = "check_netio"
  vars.nrpe_arguments = [ interface ]
  vars.pnp_check_arg1 = "_$nrpe_command$"

  assign where host.address && host.vars.interfaces && host.vars.os == "Linux"
  ignore where host.vars.applyignore.networkio == true
}

In this apply rule, where the "Network IO" service object is assigned to all Linux hosts (host.vars.os == "Linux") with existing interfaces (host.vars.interfaces), I simply added the value for the vars.pnp_check_arg1 variable. Which, in this case, is an underscore followed by the actual command launched by NRPE: "_check_netio".

After a reload of Icinga 2 and a manual check in the performance log file, all things look good. Which means: The SERVICECHECKCOMMAND now contains both nrpe and the remote command (nrpe_check_netio):

$ cat /var/spool/icinga2/perfdata/service-perfdata*
[...]
DATATYPE::SERVICEPERFDATA    TIMET::1483441246    HOSTNAME::remotehost    SERVICEDESC::Network IO eth0    SERVICEPERFDATA::NET_eth0_RX=2316977634837B;;;; NET_eth0_TX=41612088322B;;;;    SERVICECHECKCOMMAND::nrpe_check_netio    HOSTSTATE::UP    HOSTSTATETYPE::HARD    SERVICESTATE::OK    SERVICESTATETYPE::HARD
[...]

Icinga 2 now gives the correct and unique information to PNP4Nagios. But PNP4Nagios still needs to be told what to do. PNP4Nagios parses every line of the performance data it gets from Icinga 2 and checks if there is any template for the found command. Prior to the changes in the PerfdataWriter this was always only "nrpe", so PNP4Nagios used the following file: /etc/pnp4nagios/check_commands/check_nrpe.cfg. This is a standard file which comes with the PNP4Nagios installation.
Now that the command is "nrpe_check_netio", PNP4Nagios checks if there is any command definition called like this. When log level >=2 is activated within PNP4Nagios' perfdata process (set LOG_LEVEL to at least 2 in /etc/pnp4nagios/process_perfdata.cfg), the LOG_FILE (usually /var/log/pnp4nagios/perfdata.log) will show the following information:

$ cat /var/log/pnp4nagios/perfdata.log
[...]
2017-01-03 12:22:55 [15957] [3] DEBUG: RAW Command -> nrpe_check_netio
2017-01-03 12:22:55 [15958] [3]   -- name -> pl
2017-01-03 12:22:55 [15958] [3]   -- rrd_heartbeat -> 8460
2017-01-03 12:22:55 [15957] [2] No Custom Template found for nrpe_check_netio (/etc/pnp4nagios/check_commands/nrpe_check_netio.cfg)
[...]

PNP4Nagios now correctly understood that this is performance data for the command "nrpe_check_netio". And now we can create this config file and tell PNP4Nagios to create DERIVE graphs. DERIVE is another kind of COUNTER data type with the difference that DERIVE values can be resetted to 0, which is the case for the values in ifconfig.

$ cat /etc/pnp4nagios/check_commands/nrpe_check_netio.cfg
#
# Adapt the Template if check_command should not be the PNP Template
#
# check_command check_nrpe!check_disk!20%!10%
# ________0__________|          |      |  |
# ________1_____________________|      |  |
# ________2____________________________|  |
# ________3_______________________________|
#
CUSTOM_TEMPLATE = 1
#
# Change the RRD Datatype based on the check_command Name.
# Defaults to GAUGE.
#
# Adjust the whole RRD Database
DATATYPE = DERIVE
#
# Adjust every single DS by using a List of Datatypes.
DATATYPE = DERIVE,DERIVE

# Use the MIN value for newly created RRD Databases.
# This value defaults to 0
# USE_MIN_ON_CREATE = 1
#
# Use the MAX value for newly created RRD Databases.
# This value defaults to 0
# USE_MAX_ON_CREATE = 1

# Use a single RRD Database per Service
# This Option is only used while creating new RRD Databases
#
#RRD_STORAGE_TYPE = SINGLE
#
# Use multiple RRD Databases per Service
# One RRD Database per Datasource.
# RRD_STORAGE_TYPE = MULTIPLE
#
RRD_STORAGE_TYPE = MULTIPLE

# RRD Heartbeat in seconds
# This Option is only used while creating new RRD Databases
# Existing RRDs can be changed by "rrdtool tune"
# More on http://oss.oetiker.ch/rrdtool/doc/rrdtune.en.html
#
# This value defaults to 8640
# RRD_HEARTBEAT = 305

After a new check of a Network IO service, the xml file of that particular service was re-created with the following information:

# cat /var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0.xml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<NAGIOS>
  <DATASOURCE>
    <TEMPLATE>nrpe_check_netio</TEMPLATE>
    <RRDFILE>/var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0_NET_eth0_RX.rrd</RRDFILE>
    <RRD_STORAGE_TYPE>MULTIPLE</RRD_STORAGE_TYPE>
    <RRD_HEARTBEAT>8460</RRD_HEARTBEAT>
    <IS_MULTI>0</IS_MULTI>
    <DS>1</DS>
    <NAME>NET_eth0_RX</NAME>
    <LABEL>NET_eth0_RX</LABEL>
    <UNIT>B</UNIT>
    <ACT>1462883655</ACT>
    <WARN></WARN>
    <WARN_MIN></WARN_MIN>
    <WARN_MAX></WARN_MAX>
    <WARN_RANGE_TYPE></WARN_RANGE_TYPE>
    <CRIT></CRIT>
    <CRIT_MIN></CRIT_MIN>
    <CRIT_MAX></CRIT_MAX>
    <CRIT_RANGE_TYPE></CRIT_RANGE_TYPE>
    <MIN></MIN>
    <MAX></MAX>
  </DATASOURCE>
  <DATASOURCE>
    <TEMPLATE>nrpe_check_netio</TEMPLATE>
    <RRDFILE>/var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0_NET_eth0_TX.rrd</RRDFILE>
    <RRD_STORAGE_TYPE>MULTIPLE</RRD_STORAGE_TYPE>
    <RRD_HEARTBEAT>8460</RRD_HEARTBEAT>
    <IS_MULTI>0</IS_MULTI>
    <DS>1</DS>
    <NAME>NET_eth0_TX</NAME>
    <LABEL>NET_eth0_TX</LABEL>
    <UNIT>B</UNIT>
    <ACT>1567726688</ACT>
    <WARN></WARN>
    <WARN_MIN></WARN_MIN>
    <WARN_MAX></WARN_MAX>
    <WARN_RANGE_TYPE></WARN_RANGE_TYPE>
    <CRIT></CRIT>
    <CRIT_MIN></CRIT_MIN>
    <CRIT_MAX></CRIT_MAX>
    <CRIT_RANGE_TYPE></CRIT_RANGE_TYPE>
    <MIN></MIN>
    <MAX></MAX>
  </DATASOURCE>
  <RRD>
    <RC>0</RC>
    <TXT>successful updated</TXT>
  </RRD>
  <NAGIOS_AUTH_HOSTNAME>remotehost</NAGIOS_AUTH_HOSTNAME>
  <NAGIOS_AUTH_SERVICEDESC>Network IO eth0</NAGIOS_AUTH_SERVICEDESC>
  <NAGIOS_CHECK_COMMAND>nrpe_check_netio</NAGIOS_CHECK_COMMAND>
  <NAGIOS_DATATYPE>SERVICEPERFDATA</NAGIOS_DATATYPE>
  <NAGIOS_DISP_HOSTNAME>remotehost</NAGIOS_DISP_HOSTNAME>
  <NAGIOS_DISP_SERVICEDESC>Network IO eth0</NAGIOS_DISP_SERVICEDESC>
  <NAGIOS_HOSTNAME>remotehost</NAGIOS_HOSTNAME>
  <NAGIOS_HOSTSTATE>UP</NAGIOS_HOSTSTATE>
  <NAGIOS_HOSTSTATETYPE>HARD</NAGIOS_HOSTSTATETYPE>
  <NAGIOS_MULTI_PARENT></NAGIOS_MULTI_PARENT>
  <NAGIOS_PERFDATA>NET_eth0_RX=1462883655B;;;; NET_eth0_TX=1567726688B;;;; </NAGIOS_PERFDATA>
  <NAGIOS_RRDFILE></NAGIOS_RRDFILE>
  <NAGIOS_SERVICECHECKCOMMAND>nrpe_check_netio</NAGIOS_SERVICECHECKCOMMAND>
  <NAGIOS_SERVICEDESC>Network_IO_eth0</NAGIOS_SERVICEDESC>
  <NAGIOS_SERVICEPERFDATA>NET_eth0_RX=1462883655B;;;; NET_eth0_TX=1567726688B;;;;</NAGIOS_SERVICEPERFDATA>
  <NAGIOS_SERVICESTATE>OK</NAGIOS_SERVICESTATE>
  <NAGIOS_SERVICESTATETYPE>HARD</NAGIOS_SERVICESTATETYPE>
  <NAGIOS_TIMET>1483442747</NAGIOS_TIMET>
  <NAGIOS_XMLFILE>/var/lib/pnp4nagios/perfdata/remotehost/Network_IO_eth0.xml</NAGIOS_XMLFILE>
  <XML>
   <VERSION>4</VERSION>
  </XML>
</NAGIOS>

 The xml file shows that the nrpe_check_netio  PNP4Nagios template is now used:

<TEMPLATE>nrpe_check_netio</TEMPLATE>

and the service check command is correctly identified as nrpe_check_netio:

<NAGIOS_CHECK_COMMAND>nrpe_check_netio</NAGIOS_CHECK_COMMAND>

Once /etc/pnp4nagios/check_commands/nrpe_check_netio.cfg was created, all the other hosts with this "Network IO" check were adapted and are now showing the correct graphs.

The same procedure can now be created for all kinds of plugins which are executed through NRPE and output counter/derive values, for example check_diskio


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder