Last week I wrote several posts about S.M.A.R.T. checks on FreeBSD. Well they work, they can definitely be used for monitoring on production servers, but there is one issue which needs to be addressed: The drives order used in smartctl (cciss,N) is not forcibly the physical order!
Let's go to some detail. Last week I got an alert from check_smart.pl that a disk on a HP Proliant DL380 G5 running with FreeBSD 9.1 got defect sectors (elements in grown defect list). I verified this manually with the smartctl command:
smartctl -d cciss,0 /dev/ciss0 -a
smartctl 6.0 2012-10-10 r3643 [FreeBSD 9.1-RELEASE-p4 amd64] (local build)
Copyright (C) 2012-12, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/ciss0 [cciss_disk_00] [SCSI]: Device open changed type from 'sat,auto' to 'cciss'
Vendor: HP
[...]
Serial number: 123450000999VE
Device type: disk
Transport protocol: SAS
Local Time is: Fri Nov 8 13:59:19 2013 CET
[...]
Elements in grown defect list: 12
Logically, to me, "cciss,0" means the very first disk of the server. So that would be drive slot #1.
I exchanged the drive and ran smartctl again:
smartctl -d cciss,0 /dev/ciss0 -a
smartctl 6.0 2012-10-10 r3643 [FreeBSD 9.1-RELEASE-p4 amd64] (local build)
Copyright (C) 2012-12, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/ciss0 [cciss_disk_00] [SCSI]: Device open changed type from 'sat,auto' to 'cciss'
Vendor: HP
[...]
Serial number: 123450000999VE
Device type: disk
Transport protocol: SAS
Local Time is: Fri Nov 8 15:27:33 2013 CET
[...]
Elements in grown defect list: 14
Did you notice the exact same serial number of the drive behind cciss,0? So that means that I have replaced the wrong disk.
After some research, I found this archived FreeBSD mailing list article from 2008: http://lists.freebsd.org/pipermail/freebsd-ports/2008-April/048312.html
The author of the post describes the exact same phenomenon on his FreeBSD machine:
The recent incorporation of the FreeBSD CISS SMART support into the
mainstream smartmontools distribution has had some unexpected results on
several HP ProLiant DL380 G3 machines. I have five DL380/G3s with four
drives each; all have the same symptoms now: querying a given ciss/scsi
target gives results for the wrong drive
It seems the correct disk labeling/numbering worked before smartmontools 5.38. Unfortunately FreeBSD does not have tools to list all physical drives. camcontrol devlist only shows the logical drive's raid controller.
As stupid as it sounds... but labeling the drives' serial number with a sticker can help you identify the disk in the physical slots. You can find the serial number of the disk in the smartctl output and match it against the physical drive.
So if you use FreeBSD behind a CCISS (HP SmartArray) Raid Controller, be extra careful and don't trust the cciss numbering!
Update, still Nov 11th 2013:
After some replacement tests, it seems that FreeBSD is seeing the disk the other way around. So cciss,0 is the last disk, cciss,3 the first (in a server with 4 physical disks). If it is always like this, the physical disk can be identified. But what happens if a new disk is inserted? Is a recount necessary when disk #5 appears as cciss,0 or will it appear as cciss,5? I have no idea...
Update 2, again Nov 11th 2013:
I just came across the command cciss_vol_status which can be compiled on FreeBSD and Linux from http://sourceforge.net/projects/cciss/files/cciss_vol_status/. So I gave it a shot and installed it:
cd /tmp
fetch http://downloads.sourceforge.net/project/cciss/cciss_vol_status/cciss_vol_status-1.11.tar.gz
tar -xzf cciss_vol_status-1.11.tar.gz
cd cciss_vol_status-1.11
./configure
make
make install
Then I ran the command against the /dev/ciss0 device and at first I was disappointed - again:
cciss_vol_status -s /dev/ciss0
/dev/ciss0: (Smart Array P400) RAID 1 Volume 0 status: OK.
/dev/ciss0: (Smart Array P400) RAID 1 Volume 1 status: OK.
My face brightened up when I tried the verbose option (-V):
cciss_vol_status -V /dev/ciss0
Controller: Smart Array P400
Board ID: 0x3234103c
Logical drives: 2
Running firmware: 5.20
ROM firmware: 5.20
/dev/ciss0: (Smart Array P400) RAID 1 Volume 0 status: OK.
/dev/ciss0: (Smart Array P400) RAID 1 Volume 1 status: OK.
Physical drives: 4
connector 2I box 1 bay 4 HP DG072ABAB3 XXXXXXXX00009732RCV7 HPDD OK
connector 2I box 1 bay 3 HP DG072BB975 XXXXXXXX00009907Q0VR HPDC OK
connector 2I box 1 bay 2 HP DG072BB975 XXXXXXXX00009906P4DN HPDC OK
connector 2I box 1 bay 1 HP DG072BB975 XXXXXXXX00009907RPKW HPDC OK
/dev/ciss0(Smart Array P400:0): Non-Volatile Cache status:
Cache configured: Yes
Read cache memory: 52 MiB
Write cache memory: 156 MiB
Write cache enabled: Yes
So THIS is exactly what I needed! I can now finally compare the serial number from smartctl output and match it against the correct physical slot. Problem solved!
macan from wrote on Jun 8th, 2016:
cciss_vol_status is in ports:
/usr/ports/sysutils/cciss_vol_status/
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder