I wanted to monitor the current RAID status on an IBM x3650 M4 server, simply by using check_raid. I've been using this plugin for years and it supports most software and hardware raid controllers. I've never had any problems with it (once I installed the required cli tools for each hardware controller) - until today.
Due to a very strange hardware setup, inherited from an ex-colleague, the server turns out to have two different RAID controllers active. 12 physical drives are attached to one controller, 2 physical drives to another.
Once I installed the megacli command (from http://hwraid.le-vert.net/), the plugin correctly identified the physical drives behind /dev/sda:
# /usr/lib/nagios/plugins/check_raid -l
megacli
1 active plugins
# /usr/lib/nagios/plugins/check_raid
WARNING: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]
To disable the warning on the disabled WriteCache:
# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]
But where are the other two physical drives? From my experience with hardware raid controllers I was pretty sure that megacli is able to detect multiple controllers and is able to retrieve the drive information from all controllers.
A manual verification using megacli still only returned 12 drives:
# megacli -CfgDsply -aall |grep Physical
Physical Disk Information:
Physical Disk: 0
Physical Sector Size: 512
Physical Disk: 1
Physical Sector Size: 512
Physical Disk: 2
Physical Sector Size: 512
Physical Disk: 3
Physical Sector Size: 512
Physical Disk: 4
Physical Sector Size: 512
Physical Disk: 5
Physical Sector Size: 512
Physical Disk Information:
Physical Disk: 0
Physical Sector Size: 512
Physical Disk: 1
Physical Sector Size: 512
Physical Disk: 2
Physical Sector Size: 512
Physical Disk: 3
Physical Sector Size: 512
Physical Disk: 4
Physical Sector Size: 512
Physical Disk: 5
Physical Sector Size: 512
Thankfully a colleague, who recently was working on that particular server, made a screenshot of the storage controller menu during the boot process:
As it turns out, there are two different storage controllers built into that server. One is a MegaRaid controller (ServeRAID M5210) and one is a MPT controller:
# lspci | grep -i LSI
0a:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2004 PCI-Express Fusion-MPT SAS-2 [Spitfire] (rev 03)
14:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS-3 3108 [Invader] (rev 02)
No wonder megacli wasn't able to find the drives!
I tried again with "mpt-status" (http://hwraid.le-vert.net/wiki/LSIFusionMPT), but this didn't show any config:
# apt-get install mpt-status
# /usr/sbin/mpt-status -p
Checking for SCSI ID:0
ioctl: No such device
I removed mpt-status again and went on to try the command "sas2ircu" for newer MPT cards. Finally I got some output:
# apt-get install sas2ircu
# sas2ircu LIST
LSI Corporation SAS2 IR Configuration Utility.
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2009-2013 LSI Corporation. All rights reserved.
Adapter Vendor Device SubSys SubSys
Index Type ID ID Pci Address Ven ID Dev ID
----- ------------ ------ ------ ----------------- ------ ------
0 SAS2004 1000h 70h 00h:0ah:00h:00h 1014h 040eh
SAS2IRCU: Utility Completed Successfully.
And, hurray, check_raid was now able to read the infos from both controllers:
# /usr/lib/nagios/plugins/check_raid -l
megacli
sas2ircu
2 active plugins
# /usr/lib/nagios/plugins/check_raid --cache-fail=OK
OK: megacli:[Volumes(1): DISK0.0:Optimal,WriteCache:DISABLED; Devices(12): 11,08,01,03,09,10,04,06,12,07,02,05=Online]; sas2ircu:[ctrl #0: 1 Vols: Optimal: 2 Drives: Optimal (OPT)::]
Update November 15th 2018:
This article helped me again today when the check_raid plugin alarmed of a failed drive:
# sas2ircu-status
-- Controller informations --
-- ID | Model
c0 | SAS2004
-- Arrays informations --
-- ID | Type | Size | Status
c0u0 | RAID1 | 1906G | Degraded (DGD)
-- Disks informations
-- ID | Model | Status
c0u0p0 | WD2000FYYZ-23UL (WDWMC1P051XXXX) | Failed (FLD)
c0u0p1 | WD2000FYYZ-23UL (WDWMC1P0D3XXXX) | Optimal (OPT)
There is at least one disk/array in a NOT OPTIMAL state.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder