Given the compute-intensive nature of virtual platforms, it is natural to want to use multiple host cores to accelerate their execution. The Intel® Simics® Simulator provides several mechanisms to leverage multiple host cores to run a simulation in parallel, at different granularity levels. In this blog post we will take a look at an example of a single device model using threading to accelerate the simulator. This in essence a high-level description of the threading part of “workshop-02”, as found in the Intel Simics Simulator Training Package. The training package is included in the Public Release of the Intel Simics Simulator.
The Target System
The system that we want to parallelize is a compute accelerator subsystem containing several parallel compute units. In addition to the compute units there is a control unit that is responsible for starting jobs on the compute units and waiting for them to complete. The accelerator is driven by software running on a processor that sets up descriptors in the accelerator memory and interacts with the control registers of the compute units and control unit.
The system looks like this:
The Job Structure
The compute accelerator executes jobs, running each from start to end before accepting a new job. Each job consists of a serial setup phase, where software prepares work descriptors, and the computation is started via the control unit, a parallel compute phase, and then a collection phase. From the perspective of the software and hardware in the virtual platform, the compute phase is parallel—all the compute units run simultaneously. The virtual execution time of the work depends on the size of the data being processed and should not change regardless of how the virtual platform is run.
The simplest way to run this in the simulator is the classic serial mode, in which a single host thread runs each compute unit in turn. Like this (the image says “conceptual thread” for a reason, as what is actually happening is a bit more complicated, as explained in a previous blog post
This is obviously not very efficient for a case like this, but it is robust and, most importantly, very easy to write models for. So, what can be done to parallelize this?
Standard Simulator Parallelism
The Intel Simics simulator provides some ways to parallelize a simulation that applies almost automatically to any virtual platform. The simplest method from a model perspective is to parallelize across different simulated machines in a virtual network. This is kind of like creating multiple virtual instances of the simulator inside a single simulator process. Each simulated machine (or target) runs as a serial simulation, in parallel to other internally serial simulations. All communication between the machines is done using special link objects, typically corresponding to networks.
The beauty of this way of doing parallelism is that it is very easy to support in models. It basically amounts to avoiding process-global data (i.e., global variables). There is never any chance of two execution threads simultaneously accessing the same simulation object data. This was the first type of parallelism introduced in the Intel Simics Simulator more than 15 years ago, as a fairly easy step from the purely serial simulator that existed before then.
This does not help in this case, as all the compute units are inside a single target machine.
A more advanced form of parallelism is to let processor cores run in parallel while keeping devices serialized. This works well for parallel software workloads, as discussed in a previous blog on threading in the simple RISC-V* virtual platform.
This model makes the implementation of the processor cores and memory simulation more complex, as they have to deal with parallelism and true concurrency. Device models are left untouched, working exactly like they do in a serial simulation.
This does not really help either in the case of a device model unless we make a device into a “processor” by implementing the simulator processor API. That is likely overkill and overly complicated.
The Real Details
To understand how to make device models run in parallel, some more details on the Intel Simics Simulator threading model are needed. Fundamentally, the simulator uses the concept of “thread domains” to express potential parallelism within a single target system. Every target contains a pseudo-serial world where all simulation objects that expect the classic serial execution model are put. Only one actual thread will execute code in the serial world at any moment in time.
Only “thread-aware” objects are allowed in thread domains outside the pseudo-serial world. Processor cores are, in general, thread-aware, along with critical helper objects like models of interrupt controllers, per-core timers, and memory-management units. Simple device models like the compute unit would rather not be, as that is just additional pain.
Parallelizing with a Thread
Most of the actual interaction between software and the compute units does not need parallelization. Software will write the configuration registers and read the status registers; cheap operations that do not benefit from parallelization. The solution is to thread out only the core compute job, which is a small local fix in the device model.
Conceptually, the solution looks like this:
The serial case and the threaded case run the same compute kernel code on a work buffer in the same way. In the serial case, the compute kernel runs in context of the device model itself on the main simulation thread with a buffer that inside the device model. In the threaded offload case, the compute work is thrown out into a thread “on the side” with a work buffer that is also outside of the device model. Note that a work buffer is always used as it avoids the overhead of reading and writing individual words from memory.
For most accelerator models it is easy to pick up all the data that will be worked on from memory in a single large read to a buffer, do the compute, and deposit the results using a single large write. In a virtual platform, a read or write to memory can be arbitrarily large and still happen in a single atomic simulation step. Maybe not exactly what a direct memory access (DMA) operation looks like in the actual hardware, but it usually does not matter.
If there are enough host threads available to run all the compute threads at once, you can expect a much shorter host execution time. Ideally, something like this:
Note that the compute threads are host operating system threads, explicitly created for the purposes of running work in parallel. The simulation work in the thread domains, on the other hand, is processed by the simulation execution threads managed by the simulator core scheduler. The thread domains do not have their own dedicated host threads.
Virtual Platform Semantics and Threading
Implementing a threaded compute kernel requires some care with respect to simulation semantics. The controller device has to indicate the current status of the current compute operation to software – is it idle and ready to accept a new job, is it processing, or is it done? It should also issue a completion interrupt.
In the serial execution case, this is easy. The status update is made visible to software based on the size of the work provided to the accelerator: a delay is computed based on the size of the work, and a timed event is posted at that point. The actual work can either be done when the work starts, or at the end. It is done on the main execution thread as is synchronous with the code handling the completion.
However, once we go to a threaded model, the compute becomes asynchronous to the main simulator thread. Thus, the completion time of the compute job is unknown ahead of time. One way of handling this would be to update the visible status in the model whenever the compute actually completes. However, that makes the simulation non-deterministic and different from the serial case.
Instead, a better solution is to start the compute operation as soon as possible and post an event based on the same virtual time as in the serial case. Ideally the compute thread has finished by the time the event is fired, but if not, the completion event handler has no choice but to wait for the compute to complete while blocking progress of the simulation.
The complete logic is shown in the illustration above.
Note that it is necessary to include a “done flag” that communicates the state of the computation. Accesses to this flag must be done in a thread-safe manner – but that is actually the only part of the setup that requires classic thread programming. Everything else is standard simple serial code.
Concrete Implementation: Mandelbrot Computation
The “workshop-02” example that is included in the Training package in the Public Release of the Intel Simics Simulator implements this threading pattern. It provides full source code and instructions for how to run and evaluate various optimizations to a compute accelerator, including the use of threading.
The short video below shows the effectiveness of the threading. Each frame shown is rendered using eight compute units in parallel, from the perspective of the target system. The rendering time for each frame goes down significantly when using threading, which is visible in just how much faster the animation zooms in.
Note that the virtual time of the two runs is identical, it is just that the threaded variant executes faster in host time.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.