A couple of weeks ago, we've experienced one of these strange cases, where you scratch your head more than once.
An EC2 instance which was upgraded from Ubuntu 18.04 to 20.04 a while ago, was all of a sudden marked as DOWN in our monitoring (on a Saturday night, of course!). No recent changes would explain the system going down, that is why we expected an issue with the underlying host - something we've already experienced a couple of times in AWS. Only a forced shut down and a new start of the EC2 instance in the AWS console helps in such a case.
The instance came back up again. Problem solved - we thought. We were wrong.
One hour after the boot (at 21:45), the monitoring alerted that the very same instance is down - again. Could it be coincidence? Did the EC2 machine start up on another faulty AWS host? It's not impossible, but the chances are slim.
After connecting to the EC2 Serial Console, this system was still running but it indeed lost its IP address:
As we knew that the EC2 instances get their internal IP addresses from the VPC using (reserved) DHCP, we manually launched dhclient and helas - the IP is back:
This worked fine as a workaround - the system remained up. But we still needed to find out what actually happened.
By analyzing the system logs, we came across something very interesting at the time our monitoring alerted:
Jan 18 21:44:52 focal avahi-daemon[941]: Withdrawing address record for 10.10.1.41 on eth0.
Jan 18 21:44:52 focal avahi-daemon[941]: Leaving mDNS multicast group on interface eth0.IPv4 with address 10.10.1.41.
Jan 18 21:44:52 focal avahi-daemon[941]: Interface eth0.IPv4 no longer relevant for mDNS.
By using last, we can see the exact last system boot time:
ck@focal:~$ last | head
ck pts/0 84.20.57.136 Tue Jan 18 22:11 still logged in
root ttyS0 Tue Jan 18 21:50 still logged in
reboot system boot 5.4.0-91-generic Tue Jan 18 20:44 still running
[...]
And this perfectly confirmed what we thought: Exactly one hour after the system boot, the network connection was lost.
From the logs, the only hint points to a problem in avahi-daemon. Research led to an Ubuntu bug #1586528, however as most comments point out (correctly), the problem is not avahi, but avahi rather logs the event as a symptom.
Another potential source of the problem is described in Red Hat bug 1093803, which points to a wrong system clock. To make sure we can rule this out, we decided to once again reboot the machine and follow the startup on the EC2 serial console.
While the system was booting, our eyes spotted a service with a big red FAILED during boot time:
To our surprise, it's the networking service which seemed to fail:
[FAILED] Failed to start Raise network interfaces.
See 'systemctl status networking.service' for details
Even though the networking service was marked as failed, the instance came up with network just fine. We could access the machine using SSH and continue troubleshooting.
The obvious next point to look at was the Systemd networking.service. Why was it marked as failed?
ck@focal:~$ systemctl status networking.service
- networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2022-01-19 07:40:25 CET; 4min 55s ago
Docs: man:interfaces(5)
Process: 639 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
Main PID: 639 (code=exited, status=1/FAILURE)
Jan 19 07:40:25 focal dhclient[814]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0xac4f3a09)
Jan 19 07:40:25 focal ifup[828]: RTNETLINK answers: File exists
Jan 19 07:40:25 focal dhclient[814]: bound to 10.10.1.41 -- renewal in 1442 seconds.
Jan 19 07:40:25 focal ifup[814]: bound to 10.10.1.41 -- renewal in 1442 seconds.
Jan 19 07:40:25 focal ifup[870]: /etc/network/if-up.d/ubuntu-fan: 29: /usr/sbin/fanctl: not found
Jan 19 07:40:25 focal ifup[840]: run-parts: /etc/network/if-up.d/ubuntu-fan exited with return code 127
Jan 19 07:40:25 focal ifup[639]: ifup: failed to bring up eth0
Jan 19 07:40:25 focal systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Jan 19 07:40:25 focal systemd[1]: networking.service: Failed with result 'exit-code'.
Jan 19 07:40:25 focal systemd[1]: Failed to start Raise network interfaces.
From this output we can see the current service status is still marked as failed. The logs below first show that the IP address (10.10.1.41) was successfully received from the DHCP server (10.10.1.1). A DHCP renewal was set for 1442 seconds (~ 24 minutes). That's not a lot, but seems to be standard in AWS' configuration.
But the more interesting follows right below the DHCP/address binding part: The ifup script (/etc/network/if-up.d/ubuntu-fan) failed to execute and claims a missing /usr/sbin/fanctl. The fact that this script didn't execute properly seems to have thrown off the complete networking service and marked it as failed.
With this additional clue, the if-up script in question was analyzed. From which package is this script actually coming from? Turns out, this script belongs to the ubuntu-fan package:
root@focal:~# dpkg -S /etc/network/if-up.d/ubuntu-fan
ubuntu-fan: /etc/network/if-up.d/ubuntu-fan
Is this package actually installed? We verified and ... no!
root@focal:~# dpkg -l|grep ubuntu-fan
rc ubuntu-fan 0.9.1 all Ubuntu FAN network support enablement
The package ubuntu-fan was marked as removed, yet a few configurations (such as the if-up script) still remained. This also explains why the /usr/sbin/fanctl script could not be found anymore.
The fact that the package was marked as "rc" means it was previously installed. That must have come from the initial AMI image we've used in the past. The distribution upgrade to Ubuntu 20.04 seems to have removed the package.
As we didn't need ubuntu-fan on this instance, we completely purged the package:
root@focal:~# apt-get purge ubuntu-fan
And the if-up script having caused the networking.service to fail is now gone, too:
root@focal:~# ls -l /etc/network/if-up.d/ubuntu-fan
ls: cannot access '/etc/network/if-up.d/ubuntu-fan': No such file or directory
After this fix, we rebooted this EC2 instance one last time (at 08:57) and watched the services all coming up just fine. And most importantly, the networking.service was correctly started:
ck@focal:~$ systemctl status networking.service
- networking.service - Raise network interfaces
Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2022-01-19 08:57:31 CET; 1min 46s ago
Docs: man:interfaces(5)
Process: 647 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
Main PID: 647 (code=exited, status=0/SUCCESS)
Jan 19 08:57:30 focal systemd[1]: Starting Raise network interfaces...
Jan 19 08:57:31 focal ifup[647]: ifup: waiting for lock on /run/network/ifstate.eth0
Jan 19 08:57:31 focal systemd[1]: Finished Raise network interfaces.
Now we "just" needed to wait. Will the machine again lose its IP address or will the DHCP renewal work this time. By following the logs over some time, the DHCP renewals now finally worked again:
ck@focal:~$ cat /var/log/syslog | grep dhclient
[...]
Jan 19 08:57:34 focal dhclient[618]: Internet Systems Consortium DHCP Client 4.4.1
Jan 19 08:57:34 focal dhclient[618]: Copyright 2004-2018 Internet Systems Consortium.
Jan 19 08:57:34 focal dhclient[618]: All rights reserved.
Jan 19 08:57:34 focal dhclient[618]: For info, please visit https://www.isc.org/software/dhcp/
Jan 19 08:57:34 focal dhclient[618]:
Jan 19 08:57:34 focal dhclient[618]: Listening on LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[618]: Sending on LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[618]: Sending on Socket/fallback
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 18 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 14 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPOFFER of 10.10.1.41 from 10.10.1.1
Jan 19 08:57:34 focal dhclient[618]: DHCPREQUEST for 10.10.1.41 on eth0 to 255.255.255.255 port 67 (xid=0x2f0bad84)
Jan 19 08:57:34 focal dhclient[618]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: bound to 10.10.1.41 -- renewal in 1449 seconds.
Jan 19 08:57:34 focal dhclient[682]: Internet Systems Consortium DHCP Client 4.4.1
Jan 19 08:57:34 focal dhclient[682]: Copyright 2004-2018 Internet Systems Consortium.
Jan 19 08:57:34 focal dhclient[682]: All rights reserved.
Jan 19 08:57:34 focal dhclient[682]: For info, please visit https://www.isc.org/software/dhcp/
Jan 19 08:57:34 focal dhclient[682]:
Jan 19 08:57:34 focal dhclient[682]: Listening on LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[682]: Sending on LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[682]: Sending on Socket/fallback
Jan 19 08:57:34 focal dhclient[682]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x54b7d922)
Jan 19 08:57:34 focal dhclient[682]: DHCPOFFER of 10.10.1.41 from 10.10.1.1
Jan 19 08:57:34 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 255.255.255.255 port 67 (xid=0x22d9b754)
Jan 19 08:57:34 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 08:57:34 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1362 seconds.
Jan 19 09:20:13 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 10.10.1.1 port 67 (xid=0x22d9b754)
Jan 19 09:20:13 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 09:20:13 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1593 seconds.
Jan 19 09:46:46 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 10.10.1.1 port 67 (xid=0x22d9b754)
Jan 19 09:46:46 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 09:46:46 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1419 seconds.
Jan 19 10:10:25 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 10.10.1.1 port 67 (xid=0x22d9b754)
Jan 19 10:10:25 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 10:10:25 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1418 seconds.
[...]
To sum this all up, the AWS EC2 instance lost its network connectivity because the DHCP lease was not renewed by the Systemd controlled networking.service. As the networking service entered a FAILED state, the service did not trigger any DHCP requests (DHCPREQUEST) anymore to renew the address lease. The failed state was caused by a remaining if-up script from the ubuntu-fan package, which was removed by the Ubuntu distribution upgrade.
Sometimes you just can't spot the problem. This is where external help can be a saver! An additional pair of eyes and different ideas might help you to find the error. We've seen and solved a lot of Linux server issues in the past - that's why Infiniroot might be able to help you and your IT team, too. Contact us for our remote analysis and troubleshooting services.
Erik Forsberg from Linköping, Sweden wrote on Aug 28th, 2024:
Very accurate description of exactly the same problem I encountered. Many thanks for publishing, this was very nice to read saved me time!
AWS Android Ansible Apache Apple Atlassian BSD Backup Bash Bluecoat CMS Chef Cloud Coding Consul Containers CouchDB DB DNS Database Databases Docker ELK Elasticsearch Filebeat FreeBSD Galera Git GlusterFS Grafana Graphics HAProxy HTML Hacks Hardware Icinga Influx Internet Java KVM Kibana Kodi Kubernetes LVM LXC Linux Logstash Mac Macintosh Mail MariaDB Minio MongoDB Monitoring Multimedia MySQL NFS Nagios Network Nginx OSSEC OTRS Observability Office OpenSearch PGSQL PHP Perl Personal PostgreSQL Postgres PowerDNS Proxmox Proxy Python Rancher Rant Redis Roundcube SSL Samba Seafile Security Shell SmartOS Solaris Surveillance Systemd TLS Tomcat Ubuntu Unix VMWare VMware Varnish Virtualization Windows Wireless Wordpress Wyse ZFS Zoneminder