AWS EC2 instances not booting after Ubuntu distribution upgrade (grub install to wrong NVMe device)

Written by - 0 comments

Published on - Listed in AWS Cloud Linux


Ubuntu distribution upgrades, for example from 18.04 Bionic to 20.04 Focal, are most of the times pretty painless. Sure, there are sometimes a couple of major software upgrades which require configuration adjustments, but in general the upgraded system should boot.

EC2 not booting after dist-upgrade

On AWS however the story is a little different. In the past few weeks, a bunch of EC2 instances needed to be upgraded from Ubuntu 18.04 to 20.04 - yet three out of four (that's 3/4) machines did not come back up after a final reboot! The instance screenshot would show that the machine landed in grub rescue:

EC2 instance not booting after Ubuntu distribution upgrade

Because all of the affected EC2 instances were Kubernetes cluster nodes, it was faster to just deploy a new EC2 instance and join the Kubernetes cluster again, than trying to fix the upgraded EC2 instance. And yes, connecting to an EC2 console still is very annoying and requires relatively a lot of effort, compared to other cloud providers (e.g. upCloud).

GRUB failed to install

During the latest dist-upgrade of yet another EC2 instance, something different happened towards the end of the upgrade: Grub asked on which disk to be installed. It offered two disks: /dev/nvme0n1 and /dev/nvme1n1.

This and all the other EC2 instances, previously upgraded, have the same configuration: A primary disk (EBS) and a secondary disk (EBS), dedicated for containers and mounted on /var/lib/docker.

Of course I went with the default choice, to install grub on the first disk, /dev/nvme0n1. But - big surprise - that didn't work:

grub install failed on ec2 instance

With the dist-upgrade done and with the machine still up, the NVMe disks are verified:

root@ubuntu:~# ll /dev/nvme*
crw------- 1 root root 246, 0 Nov 11 16:06 /dev/nvme0
brw-rw---- 1 root disk 259, 1 Nov 11 16:06 /dev/nvme0n1
crw------- 1 root root 246, 1 Nov 11 16:06 /dev/nvme1
brw-rw---- 1 root disk 259, 0 Nov 11 16:06 /dev/nvme1n1
brw-rw---- 1 root disk 259, 2 Nov 11 16:06 /dev/nvme1n1p1

So here we have the two NVMe drives. But which one is which?

root@ubuntu:~# fdisk -l /dev/nvme0n1
Disk /dev/nvme0n1: 50 GiB, 53687091200 bytes, 104857600 sectors
Disk model: Amazon Elastic Block Store              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Oh? There is no boot partition on this drive. And it's a 50GB drive - that means that this is the secondary EBS disk!

Let's check out the, according to this machine, second NVMe drive:

root@ubuntu:~# fdisk -l /dev/nvme1n1
Disk /dev/nvme1n1: 30 GiB, 32212254720 bytes, 62914560 sectors
Disk model: Amazon Elastic Block Store              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x11c9238f

Device         Boot Start      End  Sectors Size Id Type
/dev/nvme1n1p1 *     2048 62914526 62912479  30G 83 Linux

Here we go; this is our primary EBS disk of 30 GB with the boot partition.

So the actual boot device is /dev/nvme1n1, not /dev/nvme0n1! Turns out the EC2 machine turned the ordering of the drives around - resulting in a grub install on the wrong device on an automated dist-upgrade.

Manually install GRUB on the correct device

To fix this, the Grub boot loader can be installed manually on the correct (boot) device - which is /dev/nvme1n1 in this situation:

root@ubuntu:~# grub-install /dev/nvme1n1
Installing for i386-pc platform.
Installation finished. No error reported.

The EC2 instance can now be rebooted and the machine comes back up again.


Add a comment

Show form to leave a comment

Comments (newest first)

No comments yet.

RSS feed

Blog Tags:

  AWS   Android   Ansible   Apache   Apple   Atlassian   BSD   Backup   Bash   Bluecoat   CMS   Chef   Cloud   Coding   Consul   Containers   CouchDB   DB   DNS   Database   Databases   Docker   ELK   Elasticsearch   Filebeat   FreeBSD   Galera   Git   GlusterFS   Grafana   Graphics   HAProxy   HTML   Hacks   Hardware   Icinga   Influx   Internet   Java   KVM   Kibana   Kodi   Kubernetes   LVM   LXC   Linux   Logstash   Mac   Macintosh   Mail   MariaDB   Minio   MongoDB   Monitoring   Multimedia   MySQL   NFS   Nagios   Network   Nginx   OSSEC   OTRS   Observability   Office   OpenSearch   PGSQL   PHP   Perl   Personal   PostgreSQL   Postgres   PowerDNS   Proxmox   Proxy   Python   Rancher   Rant   Redis   Roundcube   SSL   Samba   Seafile   Security   Shell   SmartOS   Solaris   Surveillance   Systemd   TLS   Tomcat   Ubuntu   Unix   VMWare   VMware   Varnish   Virtualization   Windows   Wireless   Wordpress   Wyse   ZFS   Zoneminder