Debugging Linux* hibernation issues on servers
Hibernation, also called suspend to disk, is the ultimate feature for saving power—it saves the system context onto the disk, falls into low power state, and once woken up, restores the system context to what it was before the hibernation process began. This mechanism is similar to suspend to RAM, except that the latter retains the system context in RAM rather than on the disk. A system in hibernation has the greatest energy savings of the system sleep states and it retains context even when the system is powered off. Due to its efficiency and scalability, the hibernation process is deployed in desktops, laptops, and mobile devices.
We have seen an increased demand for running the hibernation process on server platforms. Certain high performance graphic workstation users require the ability to hibernate their system when they change tasks and to resume their previous workspace quickly upon their return. Cloud users also require the hibernation process. Specific machines in the server farm that are infrequently used can be put into hibernation dynamically for energy savings. These machines must be brought up quickly if the currently running machines cannot satisfy the desired performance conditions.
The Kernel Power team in the Intel Open Source Technology Center (OTC) has worked on maintaining Linux* suspend (hibernate) for several years, improving its reliability and scalability, and publishing best practices to share the insights from our work. As an example, we published a blog, Best Practices to Debug Linux Suspend/Hibernate Issues to help users handle common problems.
Despite the fact that we have fixed many hibernation issues over the years, the fixes are primarily targeted for desktop/laptop/mobile platforms which are not designed for large-scale scenarios. Running the hibernation process on servers is typically more difficult than on desktops/laptops/mobile platforms, because servers usually have more CPUs, more DRAM, and more sophisticated peripherals. As a result, the hibernation framework must deal with a more complex situation on servers.
In this paper, we describe the overall hibernation framework and corresponding components, using Intel® Xeon® platforms as a model. We illustrate issues triggered when implementing the hibernation process on servers, share the methodology we used to find and troubleshoot hibernation issues, and provide strategies to fix these hibernation issues. We close the paper with suggested ways to continue to improve the Linux hibernation process on servers, with help from both users and developers.
Hibernation process overview
The hibernation process can be roughly summarized in the following diagram.
Three main subsystems are involved in the hibernation process: Memory subsystem, Device driver subsystem, and Multi-Processor subsystem. Servers contain a large number of CPUs, massive capacity of DRAM, and different peripherals when compared to desktop/laptop/mobile platforms. As a result, the server hibernation process is quite complex and the hibernation framework must handle potential race conditions and corner cases properly. To read more about hibernation implementation, see a low-level hibernation bug hunt.
This section describes issues that impact the hibernation process on Intel® Xeon® platforms, divided into the following categories:
- Huge amount of memory
- Sophisticated peripheral devices
- Massive number of CPUs (multi-processors)
Huge amount of memory
Typically, when the hibernation process is launched, the system iterates all the pages and checks them, one by one, to see if they are used by kernel components. If so, the page will be considered as a proper candidate, saved in the hibernation snapshot, and finally written to disk. Later during resume, the snapshot will be read from the disk and restored to the original addresses.
In this scenario, with the increased memory capacity on servers, the iteration for all the pages in the system can be very time-consuming. This can result in the following error message, which occurs if the time exceeds the threshold of the watchdog:
NMI watchdog: Watchdog detected hard LOCKUP on cpu 27. RIP: 0010:memory_bm_find_bit+0xf4/0x100 done (allocated 6590003 pages) PM: Allocated 26360012 kbytes in 19.89 seconds (1325.28 MB/s)
This issue was triggered on a system with a large amount of DRAM (512 GB), which is unlikely to exist on an ordinary desktop or mobile device. According to the error message, the watchdog was triggered when the hibernation process was trying to allocate the snapshot buffer. Since the NMI watchdog is normally triggered on the CPU where the local IRQ has been disabled for quite long time, the issue popped up when trying to allocate too many pages in loops while the IRQ is kept in disabled for too long. The default threshold of the NMI watchdog is 20 seconds. From the log above, we can infer that the allocation process took nearly 20 seconds (2.10 GHz CPU), thus the NMI lockup was triggered.
The fix is simple, just feed the watchdog for every 100k of pages. We chose 100k based on the following estimation: It takes nearly 20 seconds to traverse 659003 pages on a 2.10GHz CPU and to trigger the watchdog. If the timeout of the watchdog is set to a minimum of 1 second, a safe interval could be estimated as 6590003/20 = 320k pages. However, to account for platforms that are running at a lower frequency, we decided to feed the watchdog every 100k pages.
Sophisticated peripheral devices
Sophisticated peripherals also leads to high risk of hibernation failure. Here are two examples.
CPU offline failure:
On servers, there are usually many large scalable peripheral devices, such as network cards or storage disks. Most of them are extremely sensitive to performance/throughput, thus they usually leverage a multi-queue mechanism to achieve this goal. The multi-queue mechanism is a framework that manages the peripheral’s interrupts gracefully, that is, it divides each peripheral into different parts of virtual sub-devices that own their unique interrupt pools. As a result, there are massive numbers of interrupts in the system, which can cause problems when trying to start the hibernation process.
The hibernation process puts all non-boot CPUs (which are CPUs other than CPU0) offline before creating the snapshot. As a consequence, when each CPU is put offline, its interrupts must be redirected to other CPUs still online, so that the interrupts can be handled. However, as the number of online CPUs decreases, each online CPU must deal with more and more interrupt requests. Because each CPU has an upper limit for IRQ requests, the system finally reports a failure due to insufficient resources and aborts the CPU offline process. This is what the symptom looks like on a system with 32 CPUs and multiple network cards when attempting to start the hibernation process:
CPU 31 disable failed: CPU has 62 vectors assigned and there are only 0 available
According to the error message, the CPU offline process failed when trying to put CPU31 offline. To be more specific, the scenario is described below:
- Hibernation process launched.
- CPU1 to CPU31 should be put offline, one by one.
- CPU31 is the last online non-boot CPU. Putting CPU31 offline requires that all the IRQs binded to CPU31 are migrated to the CPU0.
- There are not enough free IRQ slots available on CPU0, thus the IRQ migration failed, which causes the hibernation process to fail.
This process failure is illustrated below:
STEP 1. CPU0, CPU30, and CPU31 are online:
STEP 2. CPU30 is put offline, and IRQ vectors on CPU30 are migrated to CPU0:
STEP 3. Attempt to put CPU31 offline FAILED!
CPU31 could not be put offline, because there are no free IRQ vector slots on CPU0, thus the used IRQ vectors (total 62) on CPU31 could not be assigned to CPU0.
This issue has nothing to do with the hibernation process, but it can be 100% reproduced by doing a simple CPU hot-plug. After investigation, the root cause was the network driver allocating too many IRQ resources and an imperfect kernel IRQ migration strategy. To avoid this issue, the ideal solution is for the driver to use managed interrupts, then the kernel will not try to migrate these interrupts when bringing the CPU offline. However, it might be hard for an existing driver to switch to that mode.
Our testing revealed another solution by exposing a drawback in the kernel: the IRQ vector allocation algorithm followed an aggressive strategy to allocate IRQ vectors on the same CPU, which worsened the symptom of IRQ resource limitation during CPU offline. After reporting the drawback to the community, the kernel improved its IRQ vector allocation algorithm . A good side effect of that improvement is that the IRQ resources can be spread on multiple CPUs, which avoids the issue discussed in this section.
Inconsistent device status:
Sometimes the following error is returned when writing the snapshot to the disk:
do_IRQ: 31.151 No irq handler for vector.
This symptom can be described as follows:
An interrupt is triggered and dispatched to CPU31. Its vector number on CPU31 is 151. However, there is no handler installed on CPU31 for vector 151, thus this IRQ will not get acked and will cause an IRQ flood which kills the system. To be more specific, in this case, the 31.151 is an interrupt from the AHCI host controller.
This problem is caused by a kernel bug, which was triggered when IRQ migration happened among CPUs. In the bug, the kernel code forgot to restore the PCI MSI interrupt destination after an CPU offline/online cycle.
The fix is simple: restore the PCI status for the device during the hibernation process. Otherwise, the status may be lost .
Large numbers of CPUs
One of the most significant features of servers is the astonishing number of CPUs. Nowadays it is common for systems to have more than 100 CPUs, usually comprised of a collection of sockets. Software must handle all CPUs carefully and make sure that there’s no potential risk to the system when multiple CPUs are running the same code in parallel. Software should also be designed with scalable code logic for the increase in CPU numbers.
Large numbers of CPUs increase the difficulty of running the hibernation process. Here’s an example to demonstrate the risk of hibernation failure on servers due to the massive number of CPUs failing to collaborate with each other during the hibernation resume process.
A hibernation stress test carried out on an Intel® Xeon® Processor-based server (codenamed Broadwell) showed that the non-boot CPU would hang occasionally during resume from the hibernation process. Further investigation showed that the hang occurred when the non-boot CPU was woken up incorrectly and tried to access memory content, which triggered an an exception due to illegal virtual address (vaddr) access:
'Unable to handler kernel address of 0xffff8800ba800010...'
Further investigation showed that the direct reason for this exception was because the virtual address 0xffff8800ba800010 did not have any valid virtual-physical mapping. However, according to the kernel direct mapping implementation and the kernel usable DRAM map, 0xffff8800ba800010 was a valid virtual address. So, how did this exception happen?
We discovered a race condition during the hibernation resume process. Here's the scenario of how this problem happened: the page table for direct mapping was allocated dynamically by
kernel_physical_mapping_init in the Linux kernel. This means that the page table mapping for 0xffff8800ba800010 was set up during kernel boot. However, when the boot CPUs were writing pages back to their original physical address, they overwrote the content of the variable monitor
mwait_ptr, which is monitored by all non-boot CPUs. Next, they incorrectly woke up one of the non-boot CPUs. Since the page table currently used by the non-boot CPU was under modification by the boot CPU and it was likely to be a invalid page table, an exception occured when the woken up non-boot CPUs tried to access the memory. The whole process can be illustrated in the following diagram.
This issue can be easily reproduced on a platform with many CPUs, which means that servers are likely to suffer from this race condition. This scenario is not likely to be found on desktop or mobile devices.
After root-causing the issue and several rounds of discussion in the mailing list, this issue has been fixed in the upstream kernel .
Debugging the hibernation process
For people who would like to debug the hibernation process on their own, there are many debugging methodologies to track hibernation issues. Two of the most frequently used methods are described below:
Using pm_test mode:
There are several levels of test mode for debugging purposes. As described in the kernel documentation power management basic debug methodologies section, you can use different pm_test modes to check at which stage the hibernation process failed. There is one special test mode, test_resume, that can best address if the hibernation process failure is due to the firmware (BIOS) or the kernel itself. Once enabled, the hibernation snapshot will be created and written to the disk, then a resume will be launched immediately without rebooting the system. In this way, we can eliminate the possibility that the BIOS has provided inconsistent system information across the hibernation process. If test_resume mode still fails, then this is likely to be a kernel hibernation logic bug. Otherwise, the failure might be due to BIOS issues.
It is recommended to collect as many logs as possible during hibernation debugging. This includes but is not limited to serial port logs (which are especially important for servers) and dmesg. We also recommend that you append
no_console_suspend ignore_loglevel earlyprintk=serial[,ttySn[,baudrate]] in the kernel command line.
From the bugs described in the prior sections, it can be inferred that the hibernation process on servers exposes some defects in the kernel that are not encountered on desktop/mobile devices, mostly due to insufficient stress test level—more CPUs, more memory, etc.
For users and testers, we encourage you to play with the hibernation process on your servers, especially on systems with huge number of CPUs and enormous memory capability. We also recommend that you do some corner case verification tests, such as occupying the memory footprint as much as possible, or loading your driver which declares massive number of multi-queue interrupts. If you see anything suspicious, please raise a bug report at Linux kernel bugzilla with Product = Power Management and Component = Hibernation/Suspend.
For driver developers, the hibernation process can be easily broken by minor changes. It would be great if you can verify that your driver is working perfectly during the hibernation process. Please feel free to share your own methods for improving hibernation stability.
 Commit 69cde0004a4b (“x86/vector: Use matrix allocator for vector assignment”)
 Commit e60514bd4485 (“PCI/PM: Restore the status of PCI devices across hibernation”)
 Commit 406f992e4a37 (“x86 / hibernate: Use hlt_play_dead() when resuming from hibernation”)