Major release: check_smart plugin now allows multi attribute checks, supports multiple warning thresholds

Written by - 0 comments

Published on - Listed in Monitoring Icinga Nagios Perl Hardware


With much joy I can finally announce version 6.0 of the monitoring plugin check_smart, to monitor the SMART status of physical block devices (whether they are hard drives or solid state drives). I've been thinking for a while already to introduce multi attribute checks but also wanted to offer a multi threshold option. Today is the day!

New features! Yay!

The new version is a major release and there are were couple of new features added:

  •  The plugin will now check the raw values for multiple attributes. Previously this happened only on the "Current_Pending_Sector" attribute. Default (as of now) list of attributes which are checked:
    - Current_Pending_Sector
    - Reallocated_Sector_Ct
    - Program_Fail_Cnt_Total
    - Uncorrectable_Error_Cnt
    - Offline_Uncorrectable
    - Runtime_Bad_Block
    This default list might of course change over time. You want to override it? Sure, take a look at the next added feature.

  • New -r / --raw parameter allows to override the default list of SMART attributes to be checked for their raw values.

  • New -w / --warn parameter allows to set thresholds for each SMART attribute. This deprecates the previously used -b (bad) parameter for ATA drives.

With these added features, the plugin is now capable to detect and report failures or increased numbers on multiple SMART attributes. Yet you still have the freedom to override the defaults or to set thresholds to your liking. 

This is especially helpful when you want to see how a drive ages over time and create graphs with the performance data, but don't want to be alarmed every time there was a new defect sector detected.

Some examples, please

Sure! Here they are.

In the following example we have a (likely) dying SSD. Running the plugin now shows several issues.

# ./check_smart.pl -d /dev/sdc -i ata
WARNING: Reallocated_Sector_Ct is non-zero (16), Program_Fail_Cnt_Total is non-zero (16), Runtime_Bad_Block is non-zero (16), Uncorrectable_Error_Cnt is non-zero (33281)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249342716 Total_LBAs_Read=2781582604

Before version 6.0 only the Current_Pending_Sector attribute was checked but on this SSD there is no such attribute. The plugin would have returned OK all the time although there's clearly something bad happening with this solid state drive.

Let's assume that we don't want to be alerted yet, only if the values of these attributes worsen. We can now set thresholds for each attribute:

# ./check_smart.pl -d /dev/sdc -i ata -w 'Reallocated_Sector_Ct=17,Program_Fail_Cnt_Total=17,Runtime_Bad_Block=17,Uncorrectable_Error_Cnt=34000'
OK: no SMART errors detected. Reallocated_Sector_Ct is non-zero (16) (but less than threshold 17), Program_Fail_Cnt_Total is non-zero (16) (but less than threshold 17), Runtime_Bad_Block is non-zero (16) (but less than threshold 17), Uncorrectable_Error_Cnt is non-zero (33281) (but less than threshold 34000)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249534028 Total_LBAs_Read=2781582604

We can also tell the plugin to completely ignore certain attributes with the already existing -e (--exclude) parameter:

# ./check_smart.pl -d /dev/sdc -i ata -e 'Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Runtime_Bad_Block'
WARNING: Uncorrectable_Error_Cnt is non-zero (33281)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249583900 Total_LBAs_Read=2781582604

As stated above, the plugin by default checks SMART attributes which were defined as a list in the plugin code. But this list can be overridden. Let's assume you only want to check the raw values for the "Reallocated_Sector_Ct":

# ./check_smart.pl -d /dev/sdc -i ata -r 'Reallocated_Sector_Ct'
WARNING: Reallocated_Sector_Ct is non-zero (16)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249682452 Total_LBAs_Read=2781582604

A word to users of the -b parameter

The -b parameter was used to set manual thresholds for "defect sectors". In the plugin this meant that only the ATA attribute "Current_Pending_Sector" was checked. For SCSI drives this checked the value of the "grown defect list".

Because of the new -w parameter, there is no need anymore for this -b parameter, at least not on ATA drives. To support backward-compatibility, I adjusted the plugin that a used -b parameter will automatically add "Current_Pending_Sector" and its value to the threshold list.

But on SCSI drives there are no SMART attributes as we know it in ATA drives; there is only the "grown defect list". So this parameter can still be used to set thresholds for this value on SCSI drives.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder