Knowledge Base

Browse our knowledge base for free solutions to common problems

EFI Boot Launching Wrong Kernel Version On CentOS 7

Created On: 23 January 2023
Written by: Ben

Background

One of our customers had a strange issue when upgrading their kernel on CentOS 7. The update of the kernel appeared to be working fine and was showing no errors. Only after boot were they noticing that the kernel version was not changing.

Initial Troubleshooting

First we checked the installed kernel version on the system using yum:

[root@dedi1 ]# yum list installed | grep kernel
kernel.x86_64                          3.10.0-1160.49.1.el7           @updates  
kernel.x86_64                          3.10.0-1160.62.1.el7           @updates  
kernel.x86_64                          3.10.0-1160.66.1.el7           @updates  
kernel.x86_64                          3.10.0-1160.76.1.el7           @updates  
kernel.x86_64                          3.10.0-1160.81.1.el7           @updates  
kernel-tools.x86_64                    3.10.0-1160.81.1.el7           @updates  
kernel-tools-libs.x86_64               3.10.0-1160.81.1.el7           @updates

The latest kernel version installed was 3.10.0-1160.81.1.el7 when confirming the running kernel with uname we got something completely different:

[root@dedi1 ]# uname -r
3.10.0-1160.81.1.el7.x86_64

Our first impression was that the grub.cfg file was not updating after a kernel update thus forcing the original kernel to load on boot.

Our first attempted fix was to reinstall the kernel version thus forcing grub to regenerate. To do this we run the following:

[root@dedi1 ]# yum reinstall kernel-3.10.0-1160.81.1.el7
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * centos-qemu-ev: uk.mirrors.clouvider.net
 * epel: lon.mirror.rackspace.com
 * soluslabs: repo-usa-dallas.solusvm.com
 * solusvm-2-software-base: repo-france-1.solusvm.com
 * solusvm-2-stack-base: repo-france-1.solusvm.com
ookla_speedtest-cli/x86_64/signature                                                                                       |  833 B  00:00:00     
ookla_speedtest-cli/x86_64/signature                                                                                       | 1.8 kB  00:00:00 !!! 
ookla_speedtest-cli-source/signature                                                                                       |  833 B  00:00:00     
ookla_speedtest-cli-source/signature                                                                                       |  951 B  00:00:00 !!! 
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:3.10.0-1160.81.1.el7 will be installed
--> Finished Dependency Resolution

Dependencies Resolved

==================================================================================================================================================
 Package                        Arch                           Version                                      Repository                       Size
==================================================================================================================================================
Reinstalling:
 kernel                         x86_64                         3.10.0-1160.81.1.el7                         updates                          52 M

Transaction Summary
==================================================================================================================================================
Reinstall  1 Package

Total download size: 52 M
Installed size: 66 M
Is this ok [y/d/N]: y
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
kernel-3.10.0-1160.81.1.el7.x86_64.rpm                                                                                     |  52 MB  00:00:00     
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Installing : kernel-3.10.0-1160.81.1.el7.x86_64                                                                                             1/1 
  Verifying  : kernel-3.10.0-1160.81.1.el7.x86_64                                                                                             1/1 

Installed:
  kernel.x86_64 0:3.10.0-1160.81.1.el7                                                                                                            

Complete!

After re-installing the kernel the grub.cfg file was checked to ensure that kernel version entries were in the correct order:

[root@dedi1 ]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
0 : CentOS Linux (3.10.0-1160.81.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-1160.76.1.el7.x86_64) 7 (Core)
2 : CentOS Linux (3.10.0-1160.66.1.el7.x86_64) 7 (Core)
3 : CentOS Linux (3.10.0-1160.62.1.el7.x86_64) 7 (Core)
4 : CentOS Linux (3.10.0-1160.49.1.el7.x86_64) 7 (Core)
5 : CentOS Linux (0-rescue-cab9605edaa5484da7c2f02b8fd10762) 7 (Core)

Everything looked fine. The latest grub version appears at the top of the list so should be the version which loads on boot (not the older version). So again we attempted another reboot.

When the server came back online we found the same issue appeared. The running kernel version was still the older version:

[root@dedi1 ]# uname -r 
3.10.0-1160.81.1.el7.x86_64

At this stage we were confused as to why the kernel loading was incorrect despite checking the grub.cfg file and confirming it was correct.

Monitoring Boot

We wanted to confirm that what we could see in the grub menu aligned with our previous checks of grub.cfg using awk. IPMI / KVMoIP was enabled on the server at a datacenter level so we could monitor the boot sequence of the server.

After connecting to the IPMI console another reboot of the server was triggered. We could see the server go through its standard boot process.

One important note is that when this server boots it uses something called rEFInd. If you do not know what rEFInd is, in basic terms it's a boot manager for UEFI. It's capable of scanning for GRUB configs and displaying them as options to boot from within it's own boot menu.

rEFInd > GRUB > OS Boot

Within the rEFInd menu we could see two different grub.cfg files which were present. By default rEFInd will use the first selection (similar to GRUB). So we selected this as the grub file to boot from.

To our bemusement the grub menu which appeared did not show the correct kernel versions, but rather the old kernel versions which our customer was complaining about.

We rebooted the system again this time selecting the secondary grub.cfg file from the rEFInd boot menu. This time the kernel order in grub appeared exactly how it should have, the kernel boot order aligned with what was seen when running the awk command during our initial diagnostics.

So what was going on?

Finding The Cause

The server was reboot and allowed to load up without any manual selections of grub.cfg taking place within rEFInd. Obviously we would be on the wrong kernel but this was not an issue for the sake of further troubleshooting.

Reading through the the rEFInd documentation states that it will look for grub.cfg files on boot partitions. So check the currently mounted partitions with df -h.

[root@dedi1 ~]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G   34M   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/md2         48G   16G   30G  35% /
/dev/nvme1n1p1  511M  3.6M  508M   1% /boot/efi
tmpfs           3.1G     0  3.1G   0% /run/user/1003
tmpfs           3.1G     0  3.1G   0% /run/user/0

You can see that the active (and obviously incorrect as it loads the wrong grub.cfg) boot partition is:

/dev/nvme1n1p1

As this was already mounted to /boot/efi so the only thing we would need to do is replace the grub.cfg in here with the one on the other none mounted boot partition which contained the correct entries.

The easiest way to find this is to look for a pattern between disks when running fdisk -l for example you should see two different partitions on two separate disk which are:

  1. The same size
  2. The same partition type (EFI System)

Begin by running this to determine the main disks names on the server:

[root@dedi1 ~]# lsblk -dp | grep -o '^/dev[^ ]*'
/dev/nvme0n1
/dev/nvme1n1

Then run fdisk -l against each of the individual disks via SSH:

[root@dedi1 ~]# fdisk -l /dev/nvme0n1
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/nvme0n1: 450.1 GB, 450098159616 bytes, 879097968 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: FBC6F4EC-C7F0-4198-B258-F98490ADB481


#         Start          End    Size  Type            Name
 1         2048      1048575    511M  EFI System      primary
 2      1048576    101711871     48G  Linux RAID      primary
 3    101711872    118489087      8G  Linux filesyste primary
 4    118489088    879097934  362.7G  Linux RAID      
[root@dedi1 ~]# 
[root@dedi1 ~]# 
[root@dedi1 ~]# 
[root@dedi1 ~]# 
[root@dedi1 ~]# 
[root@dedi1 ~]# 
[root@dedi1 ~]# fdisk -l /dev/nvme1n1
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/nvme1n1: 450.1 GB, 450098159616 bytes, 879097968 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: DDC6CE3A-4B13-4D0C-BF72-E2E9B9686F58


#         Start          End    Size  Type            Name
 1         2048      1048575    511M  EFI System      primary
 2      1048576    101711871     48G  Linux RAID      primary
 3    101711872    118489087      8G  Linux filesyste primary
 4    118489088    879097934  362.7G  Linux RAID

You can see from the terminal extract that for both of the disks attached to the server the boot EFI partition is number 1. This would make the boot partition for disk 1 (simply add p1 to the end of the disk name):

  • /dev/nvme0n1p1

And the boot partition for disk 2:

  • /dev/nvme1n1p1

The Fix

As /dev/nvme1n1p1 is already mounted the only one we need to mount is the other. To do this we create a mount directory:

[root@dedi1 ]# mkdir /tmp/boot-part-test

Then we mount it:

[root@dedi1 ]# mount /dev/nvme0n1p1 /tmp/boot-part-test/

Now we can check the entries inside of it's grub.cfg file.

First change to that directory using:

[root@dedi1 ]# cd /tmp/boot-part-test/efi/EFI/centos/

Now scan the grub.cfg file with the following command to ensure it contains the correct entries:

[root@dedi1 centos]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' grub.cfg
0 : CentOS Linux (3.10.0-1160.81.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-1160.62.1.el7.x86_64) 7 (Core)
2 : CentOS Linux (3.10.0-1160.66.1.el7.x86_64) 7 (Core)
3 : CentOS Linux (3.10.0-1160.76.1.el7.x86_64) 7 (Core)
4 : CentOS Linux (3.10.0-1160.49.1.el7.x86_64) 7 (Core)
5 : CentOS Linux (0-rescue-cab9605edaa5484da7c2f02b8fd10762) 7 (Core)

Before copying and replacing the grub.cfg file inside of the efi directory you should confirm the kernels listed in that file are incorrect with:

[root@dedi1 centos]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' /boot/efi/EFI/centos/grub.cfg
0 : CentOS Linux (3.10.0-1160.49.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-1127.el7.x86_64) 7 (Core)
2 : CentOS Linux (0-rescue-cab9605edaa5484da7c2f02b8fd10762) 7 (Core)

If the active grub.cfg file is incorrect replace it with the one from the mounted partition which is:

cp -a /tmp/boot-part-test/efi/EFI/centos/grub.cfg /boot/efi/EFI/centos/grub.cfg

When the system boots both partitions on both disks have the same grub.cfg file. rE correct kernel will be loaded regardless of rEFInd's choice of partition to use.

Before rebooting do not forget to unmount the partition:

[root@dedi1 ]# umount /tmp/boot-part-test 

A Better Solution

The ideal solution to this problem is to have a persistent boot partition across both of the disks. The dirtiest and easiest way to achieve this was to use the above.

If you are looking for a more permanent solution the best way would be to use RAID1 mapper which stays in sync across both disks. This would also provide better protection in case of disk failure.

ICTU LTD is a company registered England and Wales (Company No. 09344913) 15 Queen Square, Leeds, West Yorkshire, England, LS2 8AJ
Copyright © 2024 ICTU LTD, All Rights Reserved.
exit