On an old HP Proliant DL380 G4 server running on FreeBSD 6.0, I discovered a strange behavior when the machine booted:
Out of the blue I'd say it looks like a file system check. I'm no BSD expert, but this assumption makes sense. Because I suspected a disk failure, I wanted to check the SMART values of all disks. But that's easier said than done. First of all, the disks run on an HP Raid Controller, therefore they're presented to the FreeBSD OS as cciss devices.
Now to the next downer: cciss support in smartmontools exists since version 5.38. Guess what? The smartmontools package for FreeBSD 6.0 is version 5.33 (see FreeBSD's FTP-Archive for 6.0). Fortunately in 6.4 the smartmontools package was updated to 5.38 (see FreeBSD's FTP-Archive for 6.4) and it can be installed on FreeBSD 6.0, too.
So I downloaded and installed smartmontools:
pkg_add smartmontools-5.38.tbz
smartmontools has been installed
To check the status of drives, use the following:
/usr/local/sbin/smartctl -a /dev/ad0 for first ATA drive
/usr/local/sbin/smartctl -a /dev/da0 for first SCSI drive
To enable monitor of drives, you can use /usr/local/sbin/smartd
A sample configuration file has been installed as
/usr/local/etc/smartd.conf.sample
Copy this file to /usr/local/etc/smartd.conf and edit appropriately
To have smartd start at boot
echo 'smartd_enable="YES"' >> /etc/rc.conf
It took me a while to figure out the syntax for disks behind cciss, but eventually I got the first results:
smartctl -iH -d cciss,0 /dev/ciss0
smartctl version 5.38 [i386-portbld-freebsd6.4] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: COMPAQ BF0368A4CA Version: HPB5
Serial number: 3WQ18WXXXXXXXXXXXQQ
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Sun Nov 3 20:40:59 2013 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
smartctl -iH -d cciss,1 /dev/ciss0
smartctl version 5.38 [i386-portbld-freebsd6.4] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: COMPAQ BF03688284 Version: HPB5
Serial number: 3WQ15KZMWXXDFDFWWWV
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Sun Nov 3 20:41:12 2013 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
The important part here is to see that the "block device" at the end is always /dev/ciss0 which is the raid controller. To get the SMART information for all disks attached to /dev/ciss0, "-d cciss,N" must be used. In this server there are 6 drives, so I could go from "cciss,0" up to "cciss,5".
The parameters -iH at the begin mean "show me the disk's information" and "show me the disk's health status".
To read more values (e.g. temperature, read errors, etc.), the parameter -a need to be used:
smartctl -d cciss,3 /dev/ciss0 -a
smartctl version 5.38 [i386-portbld-freebsd6.4] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
Device: COMPAQ BD1468A4C5 Version: HPB4
Serial number: 3KS2TRV30000762072WC
Device type: disk
Transport protocol: Parallel SCSI (SPI-4)
Local Time is: Sun Nov 3 21:17:59 2013 CET
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 23 C
Drive Trip Temperature: 68 C
Elements in grown defect list: 48
Vendor (Seagate) cache information
Blocks sent to initiator = 1088366961
Blocks received from initiator = 3948371350
Blocks read from cache and sent to initiator = 794704138
Number of read and write commands whose size <= segment size = 3304384398
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 67077.17
number of minutes until next internal SMART test = 78
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 0 0 0 0 0.000 0
write: 0 0 0 0 0 0.000 0
Non-medium error count: 218
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 0 - [- - -]
Long (extended) Self Test duration: 2643 seconds [44.0 minutes]
Take a look at the following line: Elements in grown defect list: 48
Disk 4 (cciss,3) was the only disk with elements in the defect list. Looks like I found the bad guy.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder