A new version of check_smart, a monitoring plugin to monitor physical hard drives, solid state drives and NVMe drives, is now available. One year has passed since the last release; It's time for a new version!
Version 6.14.0 contains three improvements or enhancements on the widely used check_smart monitoring plugin.
When using the -g / --global parameter to do a multi-drive check, the plugin may return an UNKNOWN status when one of the drives matching the glob pattern doesn't return valid SMART information from smartctl.
This can be the case when mixed drives are attached to an operating system (e.g. a memory stick) or when the glob pattern is used on device interfaces, such as -g 'cciss[0-9]' and one of matches is the raid controller itself.
In this situation, the plugin would return an UNKNOWN state, but would not show the drive(s) causing the UNKNOWN state:
root@nas:~# ./check_smart.pl -g '/dev/sd[a-z]' -i auto
UNKNOWN: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean --- [/dev/sdc] - Device is clean --- [/dev/sdd] - Device is clean --- [/dev/sdf] - Device is clean|
Since check_smart.pl 6.14.0, the causing drive(s) are now shown in the output:
root@nas:~# ./check_smart.pl -g '/dev/sd[a-z]' -i auto
UNKNOWN: [/dev/sde] - No health status line found[/dev/sde] - [/dev/sde] - --- [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean --- [/dev/sdc] - Device is clean --- [/dev/sdd] - Device is clean --- [/dev/sdf] - Device is clean|
The output clearly shows that the drive /dev/sde has caused the UNKNOWN exit status. This improvement potentially saves several minutes of troubleshooting.
This code change was added by Nick Bertrand in pull request #89. Thanks a lot for the contribution!
On a single drive check, the plugin shows the drive's model/product number and the serial number. This is a handy information to identify the drive (e.g. in an asset database) for further support or investigation.
root@nas:~# ./check_smart.pl -d /dev/sdf -i auto
OK: Drive WDC WD20SPZX-22CRAT0 S/N WD-WX31XXXXX222: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=1850 Start_Stop_Count=8 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=6 Power-Off_Retract_Count=4 Load_Cycle_Count=1239994 Temperature_Celsius=29 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0
Sometimes this information should remain hidden though. By using the newly added --hide-sn parameter, the serial number is now shown as <HIDDEN> in the plugin output (and therefore also in your central monitoring GUI or alerts):
root@nas:~# ./check_smart.pl -d /dev/sdf -i auto --hide-sn
OK: Drive WDC WD20SPZX-22CRAT0 S/N <HIDDEN>: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=1850 Start_Stop_Count=8 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=6 Power-Off_Retract_Count=4 Load_Cycle_Count=1239994 Temperature_Celsius=29 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0
Due to a recent server issue caused by hard drives with high load/unload cycles, I personally became aware of this underestimated but important SMART attribute 193: Load_Cycle_Count.
According to a quick research, the hard drive vendors seem to align (on most models) and mention a "safe value" of 600'000 (600K) load/unload cycles. A value above can cause serious performance and/or durability problems.
To safe you the troubleshooting time figuring this out, the check_smart.pl plugin now (by default) alerts and shows a CRITICAL state when the Load_Cycle_Count value has reached or is higher than 600K. A WARNING state is issued when the value is above 550'000 cycles.
ck@mint ~/Git/check_smart $ sudo ./check_smart.pl -d /dev/sdd -i auto
CRITICAL: Drive WDC WD20SPZX-22CRAT0 S/N WD-WX31XXXXX222: Load_Cycle_Count is above 600K load cycles (1239994) causing possible performance and durability impact, |Raw_Read_Error_Rate=0 Spin_Up_Time=1841 Start_Stop_Count=9 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=7 Power-Off_Retract_Count=5 Load_Cycle_Count=1239994 Temperature_Celsius=30 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0
My advise in such a situation: Replace the drive. It's not as urgent as when bad sectors are detected, but better plan and order a replacement drive.
If the plugin should not alert on the Load_Cycle_Count number (for whatever reason), use the newly added --skip-load-cycles parameter:
ck@mint ~/Git/check_smart $ ./check_smart.pl -d /dev/sdd -i auto --skip-load-cycles
OK: Drive WDC WD20SPZX-22CRAT0 S/N WD-WX31XXXXX222: no SMART errors detected. |Raw_Read_Error_Rate=0 Spin_Up_Time=1841 Start_Stop_Count=9 Reallocated_Sector_Ct=0 Seek_Error_Rate=0 Power_On_Hours=38056 Spin_Retry_Count=0 Calibration_Retry_Count=0 Power_Cycle_Count=7 Power-Off_Retract_Count=5 Load_Cycle_Count=1239994 Temperature_Celsius=30 Reallocated_Event_Count=0 Current_Pending_Sector=0 Offline_Uncorrectable=0 UDMA_CRC_Error_Count=0 Multi_Zone_Error_Rate=0
In this case, the plugin happily reports that everything is well with that drive.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder