One of our customers had a strange issue when upgrading their kernel on CentOS 7. The update of the kernel appeared to be working fine and was showing no errors. Only after boot were they noticing that the kernel version was not changing.
First we checked the installed kernel version on the system using yum:
[root@dedi1 ]# yum list installed | grep kernel
kernel.x86_64 3.10.0-1160.49.1.el7 @updates
kernel.x86_64 3.10.0-1160.62.1.el7 @updates
kernel.x86_64 3.10.0-1160.66.1.el7 @updates
kernel.x86_64 3.10.0-1160.76.1.el7 @updates
kernel.x86_64 3.10.0-1160.81.1.el7 @updates
kernel-tools.x86_64 3.10.0-1160.81.1.el7 @updates
kernel-tools-libs.x86_64 3.10.0-1160.81.1.el7 @updates
The latest kernel version installed was 3.10.0-1160.81.1.el7 when confirming the running kernel with uname we got something completely different:
[root@dedi1 ]# uname -r
3.10.0-1160.81.1.el7.x86_64
Our first impression was that the grub.cfg file was not updating after a kernel update thus forcing the original kernel to load on boot.
Our first attempted fix was to reinstall the kernel version thus forcing grub to regenerate. To do this we run the following:
[root@dedi1 ]# yum reinstall kernel-3.10.0-1160.81.1.el7
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* centos-qemu-ev: uk.mirrors.clouvider.net
* epel: lon.mirror.rackspace.com
* soluslabs: repo-usa-dallas.solusvm.com
* solusvm-2-software-base: repo-france-1.solusvm.com
* solusvm-2-stack-base: repo-france-1.solusvm.com
ookla_speedtest-cli/x86_64/signature | 833 B 00:00:00
ookla_speedtest-cli/x86_64/signature | 1.8 kB 00:00:00 !!!
ookla_speedtest-cli-source/signature | 833 B 00:00:00
ookla_speedtest-cli-source/signature | 951 B 00:00:00 !!!
Resolving Dependencies
--> Running transaction check
---> Package kernel.x86_64 0:3.10.0-1160.81.1.el7 will be installed
--> Finished Dependency Resolution
Dependencies Resolved
==================================================================================================================================================
Package Arch Version Repository Size
==================================================================================================================================================
Reinstalling:
kernel x86_64 3.10.0-1160.81.1.el7 updates 52 M
Transaction Summary
==================================================================================================================================================
Reinstall 1 Package
Total download size: 52 M
Installed size: 66 M
Is this ok [y/d/N]: y
Downloading packages:
Delta RPMs disabled because /usr/bin/applydeltarpm not installed.
kernel-3.10.0-1160.81.1.el7.x86_64.rpm | 52 MB 00:00:00
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
Installing : kernel-3.10.0-1160.81.1.el7.x86_64 1/1
Verifying : kernel-3.10.0-1160.81.1.el7.x86_64 1/1
Installed:
kernel.x86_64 0:3.10.0-1160.81.1.el7
Complete!
After re-installing the kernel the grub.cfg file was checked to ensure that kernel version entries were in the correct order:
[root@dedi1 ]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' /etc/grub2.cfg
0 : CentOS Linux (3.10.0-1160.81.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-1160.76.1.el7.x86_64) 7 (Core)
2 : CentOS Linux (3.10.0-1160.66.1.el7.x86_64) 7 (Core)
3 : CentOS Linux (3.10.0-1160.62.1.el7.x86_64) 7 (Core)
4 : CentOS Linux (3.10.0-1160.49.1.el7.x86_64) 7 (Core)
5 : CentOS Linux (0-rescue-cab9605edaa5484da7c2f02b8fd10762) 7 (Core)
Everything looked fine. The latest grub version appears at the top of the list so should be the version which loads on boot (not the older version). So again we attempted another reboot.
When the server came back online we found the same issue appeared. The running kernel version was still the older version:
[root@dedi1 ]# uname -r
3.10.0-1160.81.1.el7.x86_64
At this stage we were confused as to why the kernel loading was incorrect despite checking the grub.cfg file and confirming it was correct.
We wanted to confirm that what we could see in the grub menu aligned with our previous checks of grub.cfg using awk. IPMI / KVMoIP was enabled on the server at a datacenter level so we could monitor the boot sequence of the server.
After connecting to the IPMI console another reboot of the server was triggered. We could see the server go through its standard boot process.
One important note is that when this server boots it uses something called rEFInd. If you do not know what rEFInd is, in basic terms it's a boot manager for UEFI. It's capable of scanning for GRUB configs and displaying them as options to boot from within it's own boot menu.
rEFInd > GRUB > OS Boot
Within the rEFInd menu we could see two different grub.cfg files which were present. By default rEFInd will use the first selection (similar to GRUB). So we selected this as the grub file to boot from.
To our bemusement the grub menu which appeared did not show the correct kernel versions, but rather the old kernel versions which our customer was complaining about.
We rebooted the system again this time selecting the secondary grub.cfg file from the rEFInd boot menu. This time the kernel order in grub appeared exactly how it should have, the kernel boot order aligned with what was seen when running the awk command during our initial diagnostics.
So what was going on?
The server was reboot and allowed to load up without any manual selections of grub.cfg taking place within rEFInd. Obviously we would be on the wrong kernel but this was not an issue for the sake of further troubleshooting.
Reading through the the rEFInd documentation states that it will look for grub.cfg files on boot partitions. So check the currently mounted partitions with df -h.
[root@dedi1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
devtmpfs 16G 0 16G 0% /dev
tmpfs 16G 0 16G 0% /dev/shm
tmpfs 16G 34M 16G 1% /run
tmpfs 16G 0 16G 0% /sys/fs/cgroup
/dev/md2 48G 16G 30G 35% /
/dev/nvme1n1p1 511M 3.6M 508M 1% /boot/efi
tmpfs 3.1G 0 3.1G 0% /run/user/1003
tmpfs 3.1G 0 3.1G 0% /run/user/0
You can see that the active (and obviously incorrect as it loads the wrong grub.cfg) boot partition is:
/dev/nvme1n1p1
As this was already mounted to /boot/efi so the only thing we would need to do is replace the grub.cfg in here with the one on the other none mounted boot partition which contained the correct entries.
The easiest way to find this is to look for a pattern between disks when running fdisk -l for example you should see two different partitions on two separate disk which are:
Begin by running this to determine the main disks names on the server:
[root@dedi1 ~]# lsblk -dp | grep -o '^/dev[^ ]*'
/dev/nvme0n1
/dev/nvme1n1
Then run fdisk -l against each of the individual disks via SSH:
[root@dedi1 ~]# fdisk -l /dev/nvme0n1
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/nvme0n1: 450.1 GB, 450098159616 bytes, 879097968 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: FBC6F4EC-C7F0-4198-B258-F98490ADB481
# Start End Size Type Name
1 2048 1048575 511M EFI System primary
2 1048576 101711871 48G Linux RAID primary
3 101711872 118489087 8G Linux filesyste primary
4 118489088 879097934 362.7G Linux RAID
[root@dedi1 ~]#
[root@dedi1 ~]#
[root@dedi1 ~]#
[root@dedi1 ~]#
[root@dedi1 ~]#
[root@dedi1 ~]#
[root@dedi1 ~]# fdisk -l /dev/nvme1n1
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.
Disk /dev/nvme1n1: 450.1 GB, 450098159616 bytes, 879097968 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt
Disk identifier: DDC6CE3A-4B13-4D0C-BF72-E2E9B9686F58
# Start End Size Type Name
1 2048 1048575 511M EFI System primary
2 1048576 101711871 48G Linux RAID primary
3 101711872 118489087 8G Linux filesyste primary
4 118489088 879097934 362.7G Linux RAID
You can see from the terminal extract that for both of the disks attached to the server the boot EFI partition is number 1. This would make the boot partition for disk 1 (simply add p1 to the end of the disk name):
And the boot partition for disk 2:
As /dev/nvme1n1p1 is already mounted the only one we need to mount is the other. To do this we create a mount directory:
[root@dedi1 ]# mkdir /tmp/boot-part-test
Then we mount it:
[root@dedi1 ]# mount /dev/nvme0n1p1 /tmp/boot-part-test/
Now we can check the entries inside of it's grub.cfg file.
First change to that directory using:
[root@dedi1 ]# cd /tmp/boot-part-test/efi/EFI/centos/
Now scan the grub.cfg file with the following command to ensure it contains the correct entries:
[root@dedi1 centos]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' grub.cfg
0 : CentOS Linux (3.10.0-1160.81.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-1160.62.1.el7.x86_64) 7 (Core)
2 : CentOS Linux (3.10.0-1160.66.1.el7.x86_64) 7 (Core)
3 : CentOS Linux (3.10.0-1160.76.1.el7.x86_64) 7 (Core)
4 : CentOS Linux (3.10.0-1160.49.1.el7.x86_64) 7 (Core)
5 : CentOS Linux (0-rescue-cab9605edaa5484da7c2f02b8fd10762) 7 (Core)
Before copying and replacing the grub.cfg file inside of the efi directory you should confirm the kernels listed in that file are incorrect with:
[root@dedi1 centos]# awk -F\' '$1=="menuentry " {print i++ " : " $2}' /boot/efi/EFI/centos/grub.cfg
0 : CentOS Linux (3.10.0-1160.49.1.el7.x86_64) 7 (Core)
1 : CentOS Linux (3.10.0-1127.el7.x86_64) 7 (Core)
2 : CentOS Linux (0-rescue-cab9605edaa5484da7c2f02b8fd10762) 7 (Core)
If the active grub.cfg file is incorrect replace it with the one from the mounted partition which is:
cp -a /tmp/boot-part-test/efi/EFI/centos/grub.cfg /boot/efi/EFI/centos/grub.cfg
When the system boots both partitions on both disks have the same grub.cfg file. rE correct kernel will be loaded regardless of rEFInd's choice of partition to use.
Before rebooting do not forget to unmount the partition:
[root@dedi1 ]# umount /tmp/boot-part-test
The ideal solution to this problem is to have a persistent boot partition across both of the disks. The dirtiest and easiest way to achieve this was to use the above.
If you are looking for a more permanent solution the best way would be to use RAID1 mapper which stays in sync across both disks. This would also provide better protection in case of disk failure.