How the Intel® Simics® Simulator Executes Instructions

Jakob_Engblom · ‎11-11-2023

The instruction set simulator (ISS) is a core component of any virtual platform. The instruction set that the ISS simulates determines which software will work (and not work) on a virtual platform (VP) containing it. Most of the VP execution time tends to be spent in the ISS, even though the number of ISS objects in the VP is typically much smaller than the number of peripheral device models. In most cases, the performance of the ISS is critical to the overall performance of the virtual platform.

The Intel® Simics® simulator features a very high-performance instruction-set simulator framework that has been used to simulate more than two hundred different processor core variants from a dozen or so instruction-set families. To achieve its speed, the instruction set simulators in the Intel Simics simulator use a mix of a classic interpreter, a just-in-time (JIT) compiler, and virtualization.

Multiple Ways of Simulating Instructions

The ISS for a particular processor core can use all three execution modes within a single simulation run. The strategy is to always use the fastest mode possible for the current target code, falling back to slower modes when necessary. A single simulation run will most likely feature a mix of execution modes over time – it is not typically the case that everything is run in just a single mode.

Different ways to run instructions and falling back towards slower modes when needed

Falling back to a “slower” mode can happen for several reasons, including the following:

The instructions or execution modes used by the target software are not available in the faster mode.
The simulator determines that it is actually faster to run in the nominally slower mode, typically due to execution overhead and switching costs.
The user is requesting operations that are only available in a slower mode, such as observing the execution using instrumentation.
The user is single-stepping or running code for very short time durations.
There is not enough virtual time before the next event of end of time quantum to motivate going into virtualization or running a whole JIT compiler block.
The user has turned off the JIT compiler or virtualization modes.

Interpreter

The interpreter provides the base of the ISS. It covers all the instructions and all execution modes of the target processor core. The original implementation of the Intel Simics simulator from the 1990s used a fast instruction-set interpreter as a core component. The interpreter is highly optimized, applying standard interpreter techniques such as only decoding the instructions in a basic block once. Even with all optimizations, the interpreter calls a service routine for each instruction that is executed, providing a relatively slow execution.

The interpreter is still a very useful component of the ISS. It is good for running code one instruction at a time with minimal overhead costs, such as when single-stepping target code (it is much simpler to do that using an interpreter than trying to force a JIT compiler or virtualization setup to execute only a single instruction at a time). It is easier to handle special cases like breakpoints and magic instructions in the interpreter.

Just-in-Time Compiler

JIT compilation typically provides a significant speed-up compared to an interpreter. By converting blocks of target code to blocks of host code, most of the overhead of the interpreter is removed. Furthermore, the JIT compiler can apply optimizations to the generated code blocks, reaching across the code for multiple target instructions. The introduction of a JIT compiler in the Intel Simics Simulator provided a very significant increase in simulator speed. It pushed the simulator across the 1 billion target instructions per second mark in 2004.

However, JIT compilation does carry a cost. If a block of code is only executed a few times, the cost and latency of converting it to host code will slow down the execution overall. Basically, the cost of compilation outweighs the time gained from faster simulation. To avoid wasted effort, the JIT compiler is only applied to blocks of target code that have been executed a certain number of times. In addition, JIT compilation is performed in the background using separate threads without interrupting the interpreter.

JIT compilation works best when running complete basic blocks. Thus, it is not used when the time quantum length is short. Entering and existing JIT-compiled code is more expensive than entering and exiting the interpreter. The cut-over point where the gains outweigh the overhead depends on the particular instruction set and workload being executed, and optimal performance sometimes requires setting parameters like the time quantum length appropriately.

Historically, a drawback of JIT compilers was the code size. Compared to an interpreter, JIT compilers produce larger code and thus increase the instruction cache pressure. In the early days of the Intel Simics simulator, instruction caches were typically small, and it was not a given that using a JIT compiler would bring any benefit in practice. Today that is no longer an issue.

Time Quantum Effects

One important factor in the performance of an ISS is the length of the time quantum. A longer time quantum gives the ISS more time to execute instructions compared to the overhead costs of switching between target cores and entering and leaving the ISS. To put some numbers on it, experience indicates that the JIT compiler is faster than the interpreter beyond something like 100 instructions. To get maximum benefit from the JIT compiler, a time quantum of at least 10000 instructions is required.

The graph above shows the results of a single simple experiment that compares the performance of the interpreter and JIT compiler. The target system is a RISC-V-simple with four RISC-V* cores, running four parallel Coremark* instances (same experimental setup as discussed in this previous blog post). This means that the target processor cores are basically loaded to 100 percent and that instruction-set simulation absolutely dominates the simulation time. The performance of the interpreter and the JIT compiler are compared as the time quantum varies from 1 to 1 million cycles (which, for all practical purposes, is equivalent to the same number of instructions per quantum). The performance is measured as speedup over the time it takes to run the test with a time quantum length of 1 cycle. The time for JIT and interpreter at 1 cycle is the same since the JIT falls back to the interpreter at very short time quanta.

The measurements indicate that the JIT compiler provides most of its performance benefits at 10k cycles. Setting the quantum to 100k will provide some incremental gains. When using the interpreter, there is a small improvement in performance with longer time quanta, but it is not at all as marked. The JIT compiler is noticeably faster than the interpreter beyond 100 cycles. These numbers are very similar across architectures, including both ARM* and Intel Architecture, indicating that the benefits of longer time quanta are independent of the target architecture.

Virtualization

The fastest way to run instructions is to run them directly on the host using hardware virtualization technology. This can provide a 5x performance improvement over JIT compilation or more, depending on the precise instructions executed and how hard they are to handle in the JIT compiler. Virtualization requires that the simulated processor has the same general instruction-set architecture as the host processor. The Intel Simics simulator uses virtualization to accelerate Intel Architecture targets by applying Intel Virtualization Technology (Intel VT) for IA-32, Intel 64, and Intel Architecture (Intel® VT-x).

There is no free lunch, however, and the cost of entering and exiting virtualization is significant and usually higher than the cost of entering code produced by the JIT compiler or using the interpreter. To be profitable, virtualization requires even longer uninterrupted sequences of instructions than the JIT compiler. In terms of temporal decoupling, optimal time quanta are usually on the order of 500k to 1 million instructions.

The graph above shows the result of an experiment comparing the performance of virtualization-based execution and JIT compiler-based execution for a four-way parallel test program running on four-core Intel Simics Quick-Start Platform (QSP). The JIT compiler is faster for short time quanta, but the virtualization-based execution is faster beyond roughly 10k cycles. The precise numbers are obviously specific to this particular workload on this particular target, but the general trend is the same as seen for other workloads and targets.

Virtualizing a Future or Different Instruction Set

Virtualization faces an interesting challenge when it comes to the instruction set being executed. It is quite likely that the simulated machine has some differences from the host, even when both are Intel Architecture processors. For example, the virtual platform could be simulating a server platform on a client host (or vice-versa), which have slightly different hardware features and instruction-set extensions. When simulating future processor cores, the simulation will have to deal with new instructions not yet available in the host hardware.

The Intel Simics simulator ISS handles these cases by falling back to the JIT compiler or interpreter when particular instructions or execution modes are unavailable on the host. It pulls maximum value from virtualization without affecting the instruction-set semantics. Thus, it is usually a good idea to run simulations on the newest available hardware – as it is more likely to have any particular feature implemented.

Another potential issue when using virtualization to run code in a virtual platform is time control. The virtual platform is event-driven, and it is crucial for the simulator semantics that events fire at the same point in time regardless of whether code is run using the interpreter, JIT compiler, or virtualization. However, when running using virtualization, the simulator hands over control to the processor and its virtualization features. The solution is to use the host performance monitor features to exit from virtualization after a certain number of instructions have been executed. This provides simulation timing that aligns with the other simulator modes. Using virtualization in this way requires that a specific driver is installed on the host. Standard virtualization drivers for simple virtual machines do not provide the features required to support precise timing.

Hypersimulation

The best-case scenario for an instruction-set simulator is when it does not have to run any instructions at all, which happens more often than not. It is very common to find target processor cores idling since the operating system on the target has no work for them to do. Software typically indicates idleness to the hardware by running specific instructions or putting the processor into a sleep or wait state. The hardware takes advantage of idling to save power. The simulator takes advantage of idling to fast-forward the execution.

Based on observations running various operating systems on the simulator, idle code on Intel Architecture typically uses MWAIT instructions (a long time ago, the instruction to use was HLT). On ARM cores, WFI (Wait For Interrupt) is typical. RISC-V also provides an instruction called WFI for the same purpose.

When a processor core is idling, the instruction-set simulator can simply skip ahead in time to the next posted event or end of a time quantum. Given standard serial temporally decoupled simulator semantics, fast-forwarding is equivalent to stepping through the same time quantum one cycle at a time. This mode of execution is dubbed hypersimulation.

Hypersimulation can provide extremely fast time progress. The “performance” can be 100x real-time (i.e., a slow-down of 1/100) or more. It is not uncommon to forget a running Intel Simics Simulation run overnight and come back the next day to find it having simulated months into the virtual future.

Hypersimulation can make idle cores in a system essentially free from a simulation performance perspective. For example, the quad-core quad-Coremark experiment on the RISC-V discussed above was repeated on an ARM*-based platform that was configured with four or eight cores. The execution time of four parallel Coremark instances on a four-core and an eight-core setup were essentially identical. Having four additional idling cores did not have a noticeable impact on the simulator performance – that is the magic of hypersimulation.

Screenshot of a quad-Coremark run on an octa-core target, showing how four cores are idling waiting for interrupts while four cores are running code.

It should be noted that target software can prevent hypersimulation from working by using busy looping instead of proper wait primitives. Some busy loops can be identified by the simulator automatically, using a functionality known as “autohyper”. To work around really bad software, the Intel Simics Simulator allows you to write “idle patterns” that the ISS will recognize and consider as idling.

Busy loops used to be more common in the past when processors used the same amount of power no matter if they were running code or idling, and software thus might not care to use idle modes. Today, not using proper idling is bound to have negative effects on the power consumption of the system and is thus fairly rare in operating systems. Autohyper still finds the occasional loop in Linux, but it is far less common than it used to be. Busy idle loops are still common in boot firmware, which can impact the ISS performance during boots.

Disturbances in the Force

An ISS in a virtual platform must handle more than just simple register-to-register arithmetic instructions, which can have a significant impact on the overall simulation performance.

There are interactions with the world outside the processor core, like memory accesses, device accesses, and interrupts. While accesses to standard memory can be optimized using direct memory interfaces, device accesses are more expensive. The simulator has to drop out from instruction execution and run code in the device models. Triggering events has the same effect – the ISS must leave its core loop and call an event callback.

Factors that affect the speed of instruction-set simulation.

Calls to device code can also result in calls back into the instruction-set simulator to signal hardware events like interrupts. This will force the ISS to check the current interrupt settings and then possibly redirect the execution from the current basic block to an appropriate interrupt handler.

Instructions that cause page faults, illegal instructions, exceptions, and transitions between privilege levels (like user/operating system/hypervisor/system management) also require more simulation work.

Applying instrumentation and tracing to the execution (for example, to count instruction types) also results in lower performance since more work is needed for each instruction executed. Instrumentation works with the JIT compiler, but it is not possible to use it with virtualization since the hardware does not offer the ability to hook into instructions and memory operations in the way that is needed.

All of these factors can result in a noticeable drop in simulation performance. The behavior of the software being simulated and use cases like instrumentation and tracing can have a huge impact on the performance of the ISS and the virtual platform overall.

This is the nature of the technology – performance will vary widely even for “the same” target.