Monitoring plugin check_smart 6.14.0 released: Multiple improvements and enhancements

Written by - 0 comments

Published on - Listed in Hardware Linux Monitoring


A new version of check_smart, a monitoring plugin to monitor physical hard drives, solid state drives and NVMe drives, is now available. One year has passed since the last release; It's time for a new version!

Version 6.14.0 contains three improvements or enhancements on the widely used check_smart monitoring plugin.

Show drive(s) causing UNKNOWN state (when using multi-drive check)

When using the -g / --global parameter to do a multi-drive check, the plugin may return an UNKNOWN status when one of the drives matching the glob pattern doesn't return valid SMART information from smartctl.

This can be the case when mixed drives are attached to an operating system (e.g. a memory stick) or when the glob pattern is used on device interfaces, such as -g 'cciss[0-9]' and one of matches is the raid controller itself.

In this situation, the plugin would return an UNKNOWN state, but would not show the drive(s) causing the UNKNOWN state:

root@nas:~# ./check_smart.pl -g '/dev/sd[a-z]' -i auto
UNKNOWN: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean --- [/dev/sdc] - Device is clean --- [/dev/sdd] - Device is clean --- [/dev/sdf] - Device is clean|

Since check_smart.pl 6.14.0, the causing drive(s) are now shown in the output:

root@nas:~# ./check_smart.pl -g '/dev/sd[a-z]' -i auto
UNKNOWN: [/dev/sde] - No health status line found[/dev/sde] - [/dev/sde] -  --- [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean --- [/dev/sdc] - Device is clean --- [/dev/sdd] - Device is clean --- [/dev/sdf] - Device is clean|

The output clearly shows that the drive /dev/sde has caused the UNKNOWN exit status. This improvement potentially saves several minutes of troubleshooting.

This code change was added by Nick Bertrand in pull request #89. Thanks a lot for the contribution!

Hide Serial Number in output

On a single drive check, the plugin shows the drive's model/product number and the serial number. This is a handy information to identify the drive (e.g. in an asset database) for further support or investigation. 

root@nas:~# ./check_smart.pl -d /dev/sdf -i auto
OK: Drive  WDC WD20SPZX-22CRAT0 S/N WD-WX31XXXXX222: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=1850 Start_Stop_Count=8 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=6 Power-Off_Retract_Count=4 Load_Cycle_Count=1239994 Temperature_Celsius=29 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

Sometimes this information should remain hidden though. By using the newly added --hide-sn parameter, the serial number is now shown as <HIDDEN> in the plugin output (and therefore also in your central monitoring GUI or alerts):

root@nas:~# ./check_smart.pl -d /dev/sdf -i auto --hide-sn
OK: Drive  WDC WD20SPZX-22CRAT0 S/N <HIDDEN>: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=1850 Start_Stop_Count=8 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=6 Power-Off_Retract_Count=4 Load_Cycle_Count=1239994 Temperature_Celsius=29 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

Alert on high load/unload cycles on hard drives

Due to a recent server issue caused by hard drives with high load/unload cycles, I personally became aware of this underestimated but important SMART attribute 193: Load_Cycle_Count.

According to a quick research, the hard drive vendors seem to align (on most models) and mention a "safe value" of 600'000 (600K) load/unload cycles. A value above can cause serious performance and/or durability problems.

To safe you the troubleshooting time figuring this out, the check_smart.pl plugin now (by default) alerts and shows a CRITICAL state when the Load_Cycle_Count value has reached or is higher than 600K. A WARNING state is issued when the value is above 550'000 cycles.

ck@mint ~/Git/check_smart $ sudo ./check_smart.pl -d /dev/sdd -i auto
CRITICAL: Drive  WDC WD20SPZX-22CRAT0 S/N WD-WX31XXXXX222:  Load_Cycle_Count is above 600K load cycles (1239994) causing possible performance and durability impact, |Raw_Read_Error_Rate=0 Spin_Up_Time=1841 Start_Stop_Count=9 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=7 Power-Off_Retract_Count=5 Load_Cycle_Count=1239994 Temperature_Celsius=30 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

My advise in such a situation: Replace the drive. It's not as urgent as when bad sectors are detected, but better plan and order a replacement drive.

If the plugin should not alert on the Load_Cycle_Count number (for whatever reason), use the newly added --skip-load-cycles parameter:

ck@mint ~/Git/check_smart $ ./check_smart.pl -d /dev/sdd -i auto --skip-load-cycles
OK: Drive  WDC WD20SPZX-22CRAT0 S/N WD-WX31XXXXX222: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=1841 Start_Stop_Count=9 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=7 Power-Off_Retract_Count=5 Load_Cycle_Count=1239994 Temperature_Celsius=30 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0

In this case, the plugin happily reports that everything is well with that drive.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder