Since the first Release Candidate (RC1) of Debian 11 (Bullseye) was available, I have tested the new version on an older HP Proliant DL380 G7 server. Although old and EOL, this server model is still running smoothly and with good performance with Linux installed. But I ran into unexpected boot failures - even with RC2 and the official production release.
Interestingly though, once Bullseye was installed, the Operating System did not boot properly and ran into a freeze of the server. Neither CTRL+ALT+DEL key combo nor a momentary press on the power button would work. Only a server (hard) reset would release the machine and force a power down.
With normal boot parameters, the "crash" would not be visible. But with an additional "debug" added in the Kernel command line in the Grub config, the following problems could be spotted.
I al also recorded a video of the failed Bullseye boot:
With every failed boot, the error logged mentioned cpuidle_enter. This raised a couple of questions concerning the source of the problem:
But then at the next boot it would work again, finally booting up until the login prompt. If there was really a hardware incompatibility, why would the server boot sometimes and sometimes run into the freeze?
To figure out how stable (or unstable) Bullseye on this HP Proliant DL380 G7 actually is, I decided to to mass boot testing. Once Debian Bullseye would boot 10x in a row without a hiccup, I would call it stable.
Boot Attempt |
Success or Fail |
Changes / Description |
#1 | FAIL | - |
#2 | FAIL | BIOS -> Power Management HP Power Profile changed to Custom HP Power Regulator changed to "OS Control Mode" |
#3 | OK | |
#4 | OK | |
#5 | OK | |
#6 | OK | |
#7 | FAIL | |
#8 | OK | |
#9 | OK | Install "intel-microcode" package |
#10 |
OK |
Freeze of system after 2 minutes -> FAIL |
#11 |
FAIL |
BIOS -> Power Management Power Management Options -> Advanced Power Management Options -> Minimum Proc Idle Power Core State set to "No C-States" -> Minimum Proc Idle Power Package State set to "No Package States" |
#12 |
OK |
|
#13 |
FAIL |
|
#14 |
OK |
|
#15 |
FAIL |
|
#16 |
FAIL |
Noticed the following error in the console: iTCO_wdt unable to reset NO_REBOOT flag, device disabled by hardware/BIOS |
#17 |
OK |
|
#18 |
FAIL |
|
#19 |
OK |
Added "idle=nomwait" to Grub config (Kernel cmdline) |
#20 |
FAIL |
As you can see from all the boot tests, the wanted ten successful boots in a row did not happen. Is this problem caused by the Kernel 5.10 or by the new Debian Bullseye (maybe a certain way of configuration?)?
To answer this, another Operating System (Ubuntu 20.04) was installed and tested.
The idea behind installing Ubuntu 20.04 was to test the older Kernel 5.4. Would this one work? How does the Ubuntu installation behave in general compared to Debian?
Boot Attempt |
Success or Fail |
Changes / Description |
#1 | OK | |
#2 | OK | |
#3 | OK | |
#4 | OK | |
#5 | OK | |
#6 | OK | |
#7 | OK | |
#8 | OK | |
#9 | OK | |
#10 | OK | |
Surprise, surprise! The server booted correctly 10x in a row with Ubuntu 20.04! To be honest, I did not expect that. A hardware defect can therefore definitely be ruled out in this case. But is it the Kernel version causing problems or something distribution specific?
Meanwhile back with Debian 11 installed again, the tests were about to continue. I prepared myself to run a Kernel bisect to find out which Kernel version would actually trigger the boot issues. Unfortunately the documentation seems to be so out of date, that all attempts to correctly run a bisect failed. At that point in time I wanted to focus on preparing ILO3 for additional hardware monitoring (using check_ilo2_health monitoring plugin) and then get back into the bisect procedure. While testing the monitoring plugin with a read-only user, ILO completely froze; not only the XML Remote API but also the User Interface in the browser. Nothing too bad, I thought, I just stop the monitoring and let ILO recover. But I quickly realized, that the server itself stopped responding to pings. A look at the console revealed, that the server (again running with Debian Bullseye) froze and ran into a Kernel panic:
According to the console output, the crash was triggered by a NMI:
<NMI>
dump_stack+0x6b/0x83
panic+0x101/0x2d7
nmi_panic.cold+0xc/0xc
hpwdt_pretimeout+0x7f/0xd0 [hpwdt]
nmi_handle+0x58/0x100
default_do_nmi+0x98/0x130
exc_nmi+0x12f/0x150
end_repeat_nmi+0x16/0x55
RIP: 0010:mwait_idle_with_hints.constprop.0+0x4b/0x90
Code: 65 48 8b 04 25 c0 7b 01 00 0f 01 c8 48 8b 00 a8 08 75 17 e9 07 [...]
RSP: 0018:ffffffff87203e58 EFLAGS: 00000046
RAX: 0000000000000001 RBX: 0000000000000002 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffffff873ae6c0 RDI: 0000000000000001
RBP: ffffccdbff218e00 R08: 0000068289c6860d R09: 0000068439b22cc1
R10: 0000000000000f8e R11: 00000000001d86ca R12: 0000000000000002
R13: ffffffff873ae7a8 R14: 0000000000000002 R15: 0000000000000000
? mwait_idle_with_hints.constprop.0+0x4b/0x90
? mwait_idle_with_hints.constprop.0+0x4b/0x90
</NMI>
intel_idle+0x1f/0x30
cpuidle_enter_state+0x89/0x350
cpuidle_enter+0x29/0x40
do_idle+0x1ef/0x2b0
cpu_startup_entry+0x19/0x20
start_kernel+0x587/0x5a8
secondary_startup_64_no_verify+0xb0/0xbb
The logged entries looked eerily similar to the ones seen during the boot freezes (cpuidle_enter) but with one major difference: At the beginning of this Kernel panic the causing module can be spotted:
hpwdt_pretimeout+0x7f/0xd0 [hpwdt]
Now with finally more information at hand, more information about this hpwdt Kernel module and what it is doing was required. It turns out that this Kernel module triggers an NMI on the Operating System in certain situations, for example if ILO is hanging:
If the system gets into a bad state and hangs, the HPE ProLiant iLO timer register will not be updated in a timely fashion and a hardware system reset (also known as an Automatic Server Recovery (ASR)) event will occur.
The seen pretimeout before is actually a module parameter:
pretimeout - allows the user to set the watchdog pretimeout value.
This is the number of seconds before timeout when an
NMI is delivered to the system. Setting the value to
zero disables the pretimeout NMI.
Default value is 9 seconds.
This explains the Kernel panic from above: ILO was not responding and after 9 seconds the HP watchdog was unable to communicate with ILO - forcing a NMI on the system.
But could hpwdt also be responsible for the boot problems? After all, the logged errors in the console looked very similar.
Additional research led to a mailing list conversation (HPWDT watchdog module leads to panics), based on the reported Ubuntu bug #1432837:
We have been seeing random crashs from various HP systems, this has been tracked to loading of the hpwdt watchdog modules. Basically these modules are a loaded gun and unless you know exactly what you are doing you are likely to take off your own head. For this reason we already blacklist "all" of these modules in kmod/module-in-tools blacklists.
This basically means: Ubuntu disables (blacklists) the hpwdt module, since 2015 already! This would perfectly explain why Ubuntu 20.04 booted successfully, ten times in a row.
But was this bug only reported in Ubuntu? No! Almost every Linux distribution got a bug report concerning panics with hpwdt:
Let's talk turkey! Assuming the hpwdt module really causes the boot problems, let's disable (blacklist) this, the same way as Ubuntu mentions it in the bug report and update the initramfs:
root@bullseye:~# echo "blacklist hpwdt" >> /etc/modprobe.d/blacklist-hp.conf
root@bullseye:~# cat /etc/modprobe.d/blacklist-hp.conf
blacklist hpwdt
root@bullseye:~# update-initramfs -k all -u
root@bullseye:~# update-grub
After this change, the server was rebooted and the multi boot testing started again:
Boot Attempt |
Success or Fail |
Changes / Description |
#1 | OK | |
#2 | OK | |
#3 | OK | |
#4 | OK | |
#5 | OK | |
#6 | OK | |
#7 | OK | |
#8 | OK | |
#9 | OK | |
#10 | OK | |
And here we go! 10 x boot without any freezes or crashes, once hpwdt was disabled!
To determine whether or not a Kernel module should be loaded (or unloaded) there's actually one question to be answered: Do you need it?
Looking at the specifics of hpwdt and what it actually does:
The HPE iLO NMI Watchdog driver is a kernel module that provides basic watchdog functionality and handler for the iLO "Generate NMI to System" virtual button.
How often does it happen, that someone needs to trigger a NMI (non maskable interrupt) via ILO in production? Probably never. I did that once or twice on test machines in the last 10 years. So at least in our situation we're fine disabling this module - even easier now comparing the positives (more stable system, no boot problems) vs. the negatives (can't launch NMI from ILO).
During this troubleshooting and research act, a lot of documents, bugs, mailing list posts etc about hpwdt have been read. Here are some interesting ones:
Juan Pablo Madrigal Calderon from Costa Rica wrote on Jul 19th, 2024:
It also worked in Debian 12 in kernel 6.10 for a HP Proliant 360DL gen 7.
You made my day, thank you
Lars from Germany wrote on Dec 1st, 2023:
Thanks a lot! We were finally able to fix your booting problem. Once and for all!
Marcelo from Argentina wrote on Jul 18th, 2023:
Me sirvió de mucho para un viejito pero noble DL380 G7.
Se solucionó mi problema.
Gracias Master!
Marcelo.-
Klaus from wrote on Mar 23rd, 2023:
Thanks, you saved my day
Alberto from wrote on Mar 13th, 2023:
Thank you so much for your detailed investigation, not an easy bug to hunt. Kudos!
daetsch from wrote on Nov 12th, 2022:
Thank you my friend. Well then, Nagios it is :-)
ck from Switzerland wrote on Nov 11th, 2022:
Hi daetsch. I was using SMH in the past and the pain is not new (see The Fight to install HP Management Agents and System Management Homepage). I have ditched SMH completely in the past years and rely on HP ILO for hardware monitoring. Hope this helps.
daetsch from wrote on Nov 11th, 2022:
Worked. Thanks a lot for that.
In addition to DL380 with bullseye: since my upgrade the HP System Management Homepage (package hpsmh) isn't showing anything anymore. All I found in the logs so far is that they complain about missing libraries (libnetsnmp.so.30 which doesnt exist anymore in bullseye). Not sure if this is the real reason.
Do you use those services as well and what is your experience?
Hayden .A.N.G from Sheffield, United Kingdom wrote on Aug 4th, 2022:
Brilliant! Had a lot of HP proliant servers to install and this guide made my job easy. Thankyou ever so much!
sophware from wrote on Jan 31st, 2022:
This did the trick for a server I'm looking to donate. Thank you for all the work and for writing it up.
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder