I must admit that I am fascinated by unexpected software behaviors and bugs. They provide comic relief and drama in the sometimes rather dry world of software development. Some bugs are trivial and embarrassing to talk about once they’ve been resolved. Some bugs turn out to be as elusive as the mythical yeti, and never found. And then some bugs give us heroic narratives in which a courageous programmer slays the dragon through brilliant inspiration and sheer determination. This is the story of one such bug — a very tough bug that we found in the Linux* kernel. We found it by running Intel® Simics® simulator as a workload on another Simics simulation session and using its debug facilities. Simply triggering and reproducing the bug required a network with two separate machines, and at least three simulated processor cores. It required true concurrency to trigger, not just multitasking on a single processor.
Background
In order to run Intel® Architecture (IA) targets quickly on IA-based hosts, the Simics simulator makes use of Intel® Virtualization technology (Intel® VT-x). This requires that you install a Linux kernel module known as VMXMON on the host running the simulator. This bug happened after a one-line change to VMXMON. The change fixed a previous problem, and was reviewed, considered correct, and necessary. In testing before final release to users, the bug-fixed version was solid and never exhibited any problems.
The Bug Hits
However, when the new version of VMXMON was used in a setup where the Simics simulator was integrated with another simulator, failures started to appear. The combination of the new, modified, VMXMON, the same host kernel as before, and the same external simulator in the same version as before gave rise to an error that had not occurred before. The change had after all introduced a new bug or revealed a latent bug — which was considered very surprising given the nature of the change.
The setup that caused the issue was this… almost:
The Simics simulator and the external simulator communicated over a file on the host. This file is opened using the Linux kernel call mmap in both simulators. When we used the newer VMXMON, we began seeing regular crashes with certain types of simulator workloads. The error was the same each time: The host Linux operating system stopped on an assert in the network file system (NFS) driver. The assert stated that a virtual memory page was considered both clean and dirty at the same time. This state was fundamentally broken, and there was nothing else the kernel could do except assert.
The result of the failure was a crashed the host machine, which is not a desired behavior. That the assert happened in NFS made it clear that the setup was more complicated than what initially thought. When the whole setup was local as shown in the picture above, the bug did not hit.
To trigger the bug, the shared file had to be stored on a remote NFS server:
The kernel assert provided a stack trace showing where the inconsistent state was detected. However, this had little to do with the root cause of the issue. The real problem was to find the code that created the inconsistent state in the first place. Clearly, this was a perfect case for debugging using the Simics simulator itself, since it required reproducing and analyzing a kernel-level issue involving concurrency.
Reproducing the Bug
To get the bug replicated inside of the Simics simulator for analysis, we first simplified the setup by not including the complex external simulator software in the setup. Fortunately, reproducing the bug turned out to be rather easy. All that was necessary was to have a program outside of the Simics simulator that opened a file using mmap. No actual simulation work required. Creating the simplified reproduction setup provided some additional insights on the necessary preconditions for the bug:
- The NFS server had to be a separate host on the local network.
- There had to be changes happening to the shared file, along with sync requests to make sure changes were written to disk and communicated to the other process.
- The host Linux on the machine running the Simics simulator and VMXMON had to be a certain version. It did not trigger on all versions tested, but specifically required the Linux kernel version and distribution used on the test servers.
This setup reliably reproduced the same kernel stack trace as the original bug, and was brought up inside a Simics simulator session, as shown below. The real-world hardware was replaced with simulated hardware, but the software stack was kept the same – including using the precise version of Linux required to reproduce the bug.
Trying to replicate the bug in the simulator, we found that there was one additional prerequisite: The machine the simulators were running on had to have at least two processor cores. Without true concurrency between the software threads involved, the bug would not trigger. Just switching between the two programs on a single processor cores was not sufficient. At this point, just running the setup would trigger the bug.
Analyzing the Bug
The bug was analyzed following a simple workflow:
- Start a new Simics simulation with the above setup.
- Boot the simulated systems and set up the shared file and external software.
- Stop and save a checkpoint in the outer Simics simulator (to quickly get back to the interesting point in case we messed up in the following steps).
- Enable reverse execution.
- Start the simulation again, and run the inner Simics simulator.
- Run the test until the kernel crashed— which happened within a short period of time.
At this point, the Simics simulator built-in debugger was activated. Debugging this on a physical machine would have been virtually impossible, since the kernel crashed as a result of the bug. A hardware debugger attached to the server would have been useful if the production machine had supported it. However, even with a debug port, the ease of replication and the power of reverse debug would not have been available.
With the debugger was attached and inspecting the state of the simulated machine, the kernel data involved in the crash was investigated. It turned out to be the metadata for memory pages maintained by the NFS file system and the kernel. There are two sets of flags for each page. The Linux kernel itself maintains a set of flags for all pages, used among other things to track if a page is dirty, or being written back to disk. In addition, when using NFS, NFS maintains its own flags including a separate NFS “clean/dirty” flag. In the problem case, the NFS flag was set to clean but the kernel flag indicated a dirty page. The assert triggered when the kernel detected this clean/dirty inconsistency.
To find the root cause of the change, it was a simple matter to:
- Set a breakpoint on the NFS flag for the page.
- Reverse the execution to find the last writer.
What this revealed was a race condition between threads in the Linux kernel.
The NFS clean flag was being set from one kernel thread of execution, as part of completing an NFS write-back of pages. The thread would clear the kernel’s dirty flag on the page, and set the write-back flag. Once the write-back had completed, the thread set the write-back flag to NFS clean and cleared the write-back flag. The time it took for the write-back to complete left a window open, in which a different thread could modify the metadata and create an inconsistency. That’s what happened here.
At the same time, running within the context of an ioctl call from a user thread, the VMXMON driver was marking pages as dirty, with the wrinkle that the pages were also locked — this caused the kernel code to take an unusual path. Finally, a different kernel thread was looking for dirty pages to write back. This thread found a page that NFS had marked clean, pending the completion of a write-back, but which VMXMON had marked as dirty in the meantime. Having found such a page, it triggered the assert that crashed the kernel.
If we take a step back to consider what happened here, we see that the bug was a race involving three different concurrent threads, and some really unlucky timing, coupled with an unusual combination of mmap and page locking. This is what happens in software, unfortunately: Software components that are robust and proven in use can still fail due to unforeseen or novel circumstances and interactions.
The Linux vendor that provided the server Linux was notified and fixed the issue. It also appears that more recent Linux kernel code has undergone extensive changes that appears to avoid the race entirely.
Conclusions
If we look at this bug as a specimen, it is clearly not a simple “comic relief” bug. Rather, it was clearly a crippling bug for what we were trying to do, and it was time for some serious debugging. It was an heroic quest, tracking the dragon bug through the badlands of the kernel and the long trails of the network. In the end, the dragon was found, confronted, and slain.
When doing this, having the right tool was critical, and we were fortunate to be able to leverage Simics. Two properties of the Simics simulator were indispensable: 1) A debugger that allowed debugging a crashed OS. 2) Being able to repeat and reverse the execution of an entire system that contained multiple machines, with multiple processor cores, in a network and complete software stacks.
Finding and fixing this bug is a good example of the unique debug power of Simics full-system virtual platforms and tools. It’s also an example of how critical it is to simulate actual hardware concurrency in order to reveal bugs. Some bugs just do not trigger if tasks are forced to multitask on a single processor.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.