In our Icinga2 monitoring at Infiniroot one of our priorities is to cover all hardware elements of our servers. There's nothing worse than ignoring hardware defects or not even monitoring hardware. As a monitoring consultant for many companies (small and large) I am seeing this very often: Hardware checks are often not implemented.
Besides HDD, SSD and NVMe drives using our own check_smart monitoring plugin, we also query the hardware status of the server's remote management card (ILO on HPE servers, CIMC on Cisco servers).
For our Cisco servers we've used the check_cisco_imc.py Nagios plugin in the past. Because we were building our new server infrastructure with newer servers recently (in July 2024), we've switched to the check_redfish plugin; not because the former plugin would not work correctly but because it only supports Python2.
Recently one of our Cisco UCS C220 servers reported the following hardware alert:
[CRITICAL]: Voltage P3V_BAT_V_MOIN (status: CRITICAL/Enabled): 2.48V
The CMOS battery is a CRC 2032 battery, a very common battery for motherboards (CMOS battery) and... kitchen supplies (such as scales). :-) These CRC 2032 batteries can be bought anywhere and are not difficult to find.
The reason why this alerts is that after some usage, the internal voltage decreases. A CRC 2032 battery is supposed to deliver around 3V of current. Many devices have an internal threshold of 2.8V or 2.7V. If the current from the battery drops below this threshold, the device - in this case the CIMC battery voltage sensor - alerts.
First of all the Cisco UCS server needs to be shut down and completely powered off. All cables need to be plugged off to make sure there is no power running through the equipment anymore.
The CMOS battery can be found besides CPU1, just below the plastic cover (called air baffle).
Locating the CMOS battery is the easy part. The tricky part where you will discover new cussing words is how to get the %รง&!ing battery out of the battery holder.
The manual only mentions:
"Gently remove the battery from the holder on the motherboard"
Let me tell you: Your fingers (and especially your finger nails) won't do it! Believe these wise words.
The working trick is actually to use a small screwdriver and slide into the opening on the left side, slightly go below the inserted CRC 2032 battery and push it upwards. This way you get the battery to slide out of the hard plastic holder.
Note: I removed fan #2 in order to get a better accessibility to the CMOS battery.
Once the battery was replaced and the server powered on again, it took a couple of minutes until the CIMC was running smoothly again. Then the next hardware check from our monitoring revealed that all hardware elements are OK again.
No comments yet.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder