Troubleshooting AWS EC2 instance with lost network connectivity (after upgrade to Ubuntu 20.04)

Written by - 1 comments

Published on - Listed in AWS Cloud Linux Network Systemd


A couple of weeks ago, we've experienced one of these strange cases, where you scratch your head more than once.

An EC2 instance which was upgraded from Ubuntu 18.04 to 20.04 a while ago, was all of a sudden marked as DOWN in our monitoring (on a Saturday night, of course!). No recent changes would explain the system going down, that is why we expected an issue with the underlying host - something we've already experienced a couple of times in AWS. Only a forced shut down and a new start of the EC2 instance in the AWS console helps in such a case.

The instance came back up again. Problem solved - we thought. We were wrong.

System down again, time for analysis

One hour after the boot (at 21:45), the monitoring alerted that the very same instance is down - again. Could it be coincidence? Did the EC2 machine start up on another faulty AWS host? It's not impossible, but the chances are slim.

After connecting to the EC2 Serial Console, this system was still running but it indeed lost its IP address:

EC2 instance ip address lost

As we knew that the EC2 instances get their internal IP addresses from the VPC using (reserved) DHCP, we manually launched dhclient and helas - the IP is back:

EC2 instance ip address back after running dhclient

This worked fine as a workaround - the system remained up. But we still needed to find out what actually happened.

By analyzing the system logs, we came across something very interesting at the time our monitoring alerted:

Jan 18 21:44:52 focal avahi-daemon[941]: Withdrawing address record for 10.10.1.41 on eth0.
Jan 18 21:44:52 focal avahi-daemon[941]: Leaving mDNS multicast group on interface eth0.IPv4 with address 10.10.1.41.
Jan 18 21:44:52 focal avahi-daemon[941]: Interface eth0.IPv4 no longer relevant for mDNS.

By using last, we can see the exact last system boot time:

ck@focal:~$ last | head
ck       pts/0        84.20.57.136     Tue Jan 18 22:11   still logged in
root     ttyS0                         Tue Jan 18 21:50   still logged in
reboot   system boot  5.4.0-91-generic Tue Jan 18 20:44   still running
[...]

And this perfectly confirmed what we thought: Exactly one hour after the system boot, the network connection was lost.

From the logs, the only hint points to a problem in avahi-daemon. Research led to an Ubuntu bug #1586528, however as most comments point out (correctly), the problem is not avahi, but avahi rather logs the event as a symptom.

Another potential source of the problem is described in Red Hat bug 1093803, which points to a wrong system clock. To make sure we can rule this out, we decided to once again reboot the machine and follow the startup on the EC2 serial console.

Systemd service networking.service fails

While the system was booting, our eyes spotted a service with a big red FAILED during boot time:

EC2 instance boot in console

To our surprise, it's the networking service which seemed to fail:

[FAILED] Failed to start Raise network interfaces.
See 'systemctl status networking.service' for details

Even though the networking service was marked as failed, the instance came up with network just fine. We could access the machine using SSH and continue troubleshooting.

The obvious next point to look at was the Systemd networking.service. Why was it marked as failed?

ck@focal:~$ systemctl status networking.service
- networking.service - Raise network interfaces
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2022-01-19 07:40:25 CET; 4min 55s ago
       Docs: man:interfaces(5)
    Process: 639 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=1/FAILURE)
   Main PID: 639 (code=exited, status=1/FAILURE)

Jan 19 07:40:25 focal dhclient[814]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0xac4f3a09)
Jan 19 07:40:25 focal ifup[828]: RTNETLINK answers: File exists
Jan 19 07:40:25 focal dhclient[814]: bound to 10.10.1.41 -- renewal in 1442 seconds.
Jan 19 07:40:25 focal ifup[814]: bound to 10.10.1.41 -- renewal in 1442 seconds.
Jan 19 07:40:25 focal ifup[870]: /etc/network/if-up.d/ubuntu-fan: 29: /usr/sbin/fanctl: not found
Jan 19 07:40:25 focal ifup[840]: run-parts: /etc/network/if-up.d/ubuntu-fan exited with return code 127

Jan 19 07:40:25 focal ifup[639]: ifup: failed to bring up eth0
Jan 19 07:40:25 focal systemd[1]: networking.service: Main process exited, code=exited, status=1/FAILURE
Jan 19 07:40:25 focal systemd[1]: networking.service: Failed with result 'exit-code'.
Jan 19 07:40:25 focal systemd[1]: Failed to start Raise network interfaces.

From this output we can see the current service status is still marked as failed. The logs below first show that the IP address (10.10.1.41) was successfully received from the DHCP server (10.10.1.1). A DHCP renewal was set for 1442 seconds (~ 24 minutes). That's not a lot, but seems to be standard in AWS' configuration.

But the more interesting follows right below the DHCP/address binding part: The ifup script (/etc/network/if-up.d/ubuntu-fan) failed to execute and claims a missing /usr/sbin/fanctl. The fact that this script didn't execute properly seems to have thrown off the complete networking service and marked it as failed.

ubuntu-fan removed, but not completely

With this additional clue, the if-up script in question was analyzed. From which package is this script actually coming from? Turns out, this script belongs to the ubuntu-fan package:

root@focal:~# dpkg -S /etc/network/if-up.d/ubuntu-fan
ubuntu-fan: /etc/network/if-up.d/ubuntu-fan

Is this package actually installed? We verified and ... no!

root@focal:~# dpkg -l|grep ubuntu-fan
rc  ubuntu-fan    0.9.1    all     Ubuntu FAN network support enablement

The package ubuntu-fan was marked as removed, yet a few configurations (such as the if-up script) still remained. This also explains why the /usr/sbin/fanctl script could not be found anymore.

The fact that the package was marked as "rc" means it was previously installed. That must have come from the initial AMI image we've used in the past. The distribution upgrade to Ubuntu 20.04 seems to have removed the package.

As we didn't need ubuntu-fan on this instance, we completely purged the package:

root@focal:~# apt-get purge ubuntu-fan

And the if-up script having caused the networking.service to fail is now gone, too:

root@focal:~# ls -l /etc/network/if-up.d/ubuntu-fan
ls: cannot access '/etc/network/if-up.d/ubuntu-fan': No such file or directory

Another system boot - DHCP renewal is back!

After this fix, we rebooted this EC2 instance one last time (at 08:57) and watched the services all coming up just fine. And most importantly, the networking.service was correctly started:

ck@focal:~$ systemctl status networking.service
- networking.service - Raise network interfaces
     Loaded: loaded (/lib/systemd/system/networking.service; enabled; vendor preset: enabled)
Active: active (exited) since Wed 2022-01-19 08:57:31 CET; 1min 46s ago
       Docs: man:interfaces(5)
    Process: 647 ExecStart=/sbin/ifup -a --read-environment (code=exited, status=0/SUCCESS)
   Main PID: 647 (code=exited, status=0/SUCCESS)

Jan 19 08:57:30 focal systemd[1]: Starting Raise network interfaces...
Jan 19 08:57:31 focal ifup[647]: ifup: waiting for lock on /run/network/ifstate.eth0
Jan 19 08:57:31 focal systemd[1]: Finished Raise network interfaces.

Now we "just" needed to wait. Will the machine again lose its IP address or will the DHCP renewal work this time. By following the logs over some time, the DHCP renewals now finally worked again:

ck@focal:~$ cat /var/log/syslog | grep dhclient
[...]
Jan 19 08:57:34 focal dhclient[618]: Internet Systems Consortium DHCP Client 4.4.1
Jan 19 08:57:34 focal dhclient[618]: Copyright 2004-2018 Internet Systems Consortium.
Jan 19 08:57:34 focal dhclient[618]: All rights reserved.
Jan 19 08:57:34 focal dhclient[618]: For info, please visit https://www.isc.org/software/dhcp/
Jan 19 08:57:34 focal dhclient[618]:
Jan 19 08:57:34 focal dhclient[618]: Listening on LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[618]: Sending on   LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[618]: Sending on   Socket/fallback
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 8 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 18 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 14 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: DHCPOFFER of 10.10.1.41 from 10.10.1.1
Jan 19 08:57:34 focal dhclient[618]: DHCPREQUEST for 10.10.1.41 on eth0 to 255.255.255.255 port 67 (xid=0x2f0bad84)
Jan 19 08:57:34 focal dhclient[618]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x84ad0b2f)
Jan 19 08:57:34 focal dhclient[618]: bound to 10.10.1.41 -- renewal in 1449 seconds.
Jan 19 08:57:34 focal dhclient[682]: Internet Systems Consortium DHCP Client 4.4.1
Jan 19 08:57:34 focal dhclient[682]: Copyright 2004-2018 Internet Systems Consortium.
Jan 19 08:57:34 focal dhclient[682]: All rights reserved.
Jan 19 08:57:34 focal dhclient[682]: For info, please visit https://www.isc.org/software/dhcp/
Jan 19 08:57:34 focal dhclient[682]:
Jan 19 08:57:34 focal dhclient[682]: Listening on LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[682]: Sending on   LPF/eth0/02:a1:e7:1c:80:c1
Jan 19 08:57:34 focal dhclient[682]: Sending on   Socket/fallback
Jan 19 08:57:34 focal dhclient[682]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3 (xid=0x54b7d922)
Jan 19 08:57:34 focal dhclient[682]: DHCPOFFER of 10.10.1.41 from 10.10.1.1
Jan 19 08:57:34 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 255.255.255.255 port 67 (xid=0x22d9b754)
Jan 19 08:57:34 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 08:57:34 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1362 seconds.
Jan 19 09:20:13 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 10.10.1.1 port 67 (xid=0x22d9b754)
Jan 19 09:20:13 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 09:20:13 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1593 seconds.
Jan 19 09:46:46 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 10.10.1.1 port 67 (xid=0x22d9b754)
Jan 19 09:46:46 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 09:46:46 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1419 seconds.
Jan 19 10:10:25 focal dhclient[682]: DHCPREQUEST for 10.10.1.41 on eth0 to 10.10.1.1 port 67 (xid=0x22d9b754)
Jan 19 10:10:25 focal dhclient[682]: DHCPACK of 10.10.1.41 from 10.10.1.1 (xid=0x54b7d922)
Jan 19 10:10:25 focal dhclient[682]: bound to 10.10.1.41 -- renewal in 1418 seconds.
[...]

Removed ubuntu-fan broke networking.service, which broke DHCP renewal

To sum this all up, the AWS EC2 instance lost its network connectivity because the DHCP lease was not renewed by the Systemd controlled networking.service. As the networking service entered a FAILED state, the service did not trigger any DHCP requests (DHCPREQUEST) anymore to renew the address lease. The failed state was caused by a remaining if-up script from the ubuntu-fan package, which was removed by the Ubuntu distribution upgrade.

Need troubleshooting help on your servers?

Sometimes you just can't spot the problem. This is where external help can be a saver! An additional pair of eyes and different ideas might help you to find the error. We've seen and solved a lot of Linux server issues in the past - that's why Infiniroot might be able to help you and your IT team, too. Contact us for our remote analysis and troubleshooting services.



Add a comment

Show form to leave a comment

Comments (newest first)

Erik Forsberg from Linköping, Sweden wrote on Aug 28th, 2024:

Very accurate description of exactly the same problem I encountered. Many thanks for publishing, this was very nice to read saved me time!


RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Office   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder