With much joy I can finally announce version 6.0 of the monitoring plugin check_smart, to monitor the SMART status of physical block devices (whether they are hard drives or solid state drives). I've been thinking for a while already to introduce multi attribute checks but also wanted to offer a multi threshold option. Today is the day!
The new version is a major release and there are were couple of new features added:
With these added features, the plugin is now capable to detect and report failures or increased numbers on multiple SMART attributes. Yet you still have the freedom to override the defaults or to set thresholds to your liking.
This is especially helpful when you want to see how a drive ages over time and create graphs with the performance data, but don't want to be alarmed every time there was a new defect sector detected.
Sure! Here they are.
In the following example we have a (likely) dying SSD. Running the plugin now shows several issues.
# ./check_smart.pl -d /dev/sdc -i ata
WARNING: Reallocated_Sector_Ct is non-zero (16), Program_Fail_Cnt_Total is non-zero (16), Runtime_Bad_Block is non-zero (16), Uncorrectable_Error_Cnt is non-zero (33281)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249342716 Total_LBAs_Read=2781582604
Before version 6.0 only the Current_Pending_Sector attribute was checked but on this SSD there is no such attribute. The plugin would have returned OK all the time although there's clearly something bad happening with this solid state drive.
Let's assume that we don't want to be alerted yet, only if the values of these attributes worsen. We can now set thresholds for each attribute:
# ./check_smart.pl -d /dev/sdc -i ata -w 'Reallocated_Sector_Ct=17,Program_Fail_Cnt_Total=17,Runtime_Bad_Block=17,Uncorrectable_Error_Cnt=34000'
OK: no SMART errors detected. Reallocated_Sector_Ct is non-zero (16) (but less than threshold 17), Program_Fail_Cnt_Total is non-zero (16) (but less than threshold 17), Runtime_Bad_Block is non-zero (16) (but less than threshold 17), Uncorrectable_Error_Cnt is non-zero (33281) (but less than threshold 34000)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249534028 Total_LBAs_Read=2781582604
We can also tell the plugin to completely ignore certain attributes with the already existing -e (--exclude) parameter:
# ./check_smart.pl -d /dev/sdc -i ata -e 'Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Runtime_Bad_Block'
WARNING: Uncorrectable_Error_Cnt is non-zero (33281)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249583900 Total_LBAs_Read=2781582604
As stated above, the plugin by default checks SMART attributes which were defined as a list in the plugin code. But this list can be overridden. Let's assume you only want to check the raw values for the "Reallocated_Sector_Ct":
# ./check_smart.pl -d /dev/sdc -i ata -r 'Reallocated_Sector_Ct'
WARNING: Reallocated_Sector_Ct is non-zero (16)|Reallocated_Sector_Ct=16 Power_On_Hours=25058 Power_Cycle_Count=890 Program_Fail_Count_Chip=11 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=870 Used_Rsvd_Blk_Cnt_Chip=396 Used_Rsvd_Blk_Cnt_Tot=722 Unused_Rsvd_Blk_Cnt_Tot=3310 Program_Fail_Cnt_Total=16 Erase_Fail_Count_Total=0 Runtime_Bad_Block=16 Uncorrectable_Error_Cnt=33281 ECC_Error_Rate=33281 Offline_Uncorrectable=0 CRC_Error_Count=0 Available_Reservd_Space=1620 Total_LBAs_Written=2249682452 Total_LBAs_Read=2781582604
The -b parameter was used to set manual thresholds for "defect sectors". In the plugin this meant that only the ATA attribute "Current_Pending_Sector" was checked. For SCSI drives this checked the value of the "grown defect list".
Because of the new -w parameter, there is no need anymore for this -b parameter, at least not on ATA drives. To support backward-compatibility, I adjusted the plugin that a used -b parameter will automatically add "Current_Pending_Sector" and its value to the threshold list.
But on SCSI drives there are no SMART attributes as we know it in ATA drives; there is only the "grown defect list". So this parameter can still be used to set thresholds for this value on SCSI drives.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder