Threading and the Simple RISC-V* Virtual Platform

Jakob_Engblom · ‎07-13-2023

The Intel® Simics® simulator can use multiple host cores to accelerate the simulation in a few different ways. When simulating a network, each simulated machine can run on its own thread. When simulating a target machine with multiple cores, the simulation of the cores can be spread across multiple host threads to simulate parallel hardware in parallel. Both these mechanisms can be combined in a single simulation.

How does Multicore Simulation Work?

Threaded simulation for the multicore case is illustrated in the picture below. Multiple CoreMark* processes are used as an example load, since each CoreMark run will use one processor core fully. The key point is that pretty everything here is dynamically scheduled; there is no direct connection between target software and host processor cores.

Starting from the top, target software is scheduled on target cores by the target operating system(s). As the operating system code and scheduled user processes run, a stream of instructions and other operations is processed by the instruction-set simulators in the virtual platform. The instruction-set simulators are scheduled based on the work that they queue up with the Simics simulator scheduler.

The scheduler assigns the queued work to the execution threads of the simulator. There is not a one-to-one relationship between target cores and host threads, but instead any execution thread can pick up work from any instruction-set simulator.

The execution threads are scheduled by the host operating system onto the physical cores of the host. The number of execution threads available depends on the target system, host system, and current threading settings. The default thread count is set according to performance-optimizing heuristics, but the user can override this thread count if desired.

When simulating a multicore target in parallel, there is generally no value in having more execution threads than the number of cores in the target system. Using fewer execution threads often works well since most targets are not going to load all targets cores to 100% all the time. There is also no point in having more execution threads than host cores – the host core count provides an upper limit on the possible parallel execution.

The parallelism in the simulator is basically limited to min( target cores, execution thread count ). With the execution thread count limited to min( manual thread limit, host core count ).

When there are fewer execution threads than , the execution threads share the work. If there is a single execution thread, the result is the classic temporally decoupled simulation where the simulator switches between the target cores on a time-quantum basis.

Simulator Threading and the Simple RISC-V Platform

The simple RISC-V virtual platform provides a good test environment to explore how threading is used to run multiple target cores. There is not much going on in the hardware, and the default software stack is rather minimalistic. In contrast, most standard Linux distros running on real systems tend to run update daemons and many services in the background by default, with the resulting load complicating measurements and interpretation of results.

It is easy to add the CoreMark benchmark to the software stack using the Buildroot build tool (mentioned briefly in my previous blog post). Running parallel instances of CoreMark on the target system results in a highly parallel load. There is no communication or synchronization between the processes, and each CoreMark process can load a single target processor core to 100%. Thus, the target operating system should be able to schedule the CoreMark processes in a way that basically runs each on their own core, providing an ideal testcase for parallelization.

Running a Parallel Workload

The following is a command-line session from the Simics simulator that runs multiple CoreMark binaries in parallel on the target and measures the performance of the simulator as it executes the workload.

Caption: Running a single-threaded simulation with a four-way parallel workload (on a server with two Intel Xeon® Platinum 8480+ processors).

Running a single-threaded simulation with a four-way parallel workload (on a server with two Intel Xeon® Platinum 8480+ processors).

The starting point is a RISC-V simple virtual platform booted to Linux prompt. The simulator is stopped. The performance measurement is set up, and serial console input to the target Linux is queued. Finally, the simulator is instructed to run until the target system Linux prompt reappears on the serial console. When the simulator starts running, the serial input is sent to its serial port. The performance measurement results are shown when the simulator stops.

# Wait for the boot to finish before manually stopping the run
running> stop
# Start performance measurement
simics> system-perfmeter mode = minimum 
# Send input to target system via its serial console – kicking off 4 coremarks
simics> board.console.con.input "coremark & coremark & coremark & coremark\n"
# Run the simulation until all the target programs have finished
# ...
simics> bp.console_string.run-until board.console.con "# "
# Output from the measurement tool, after the run completed:
…
SystemPerf: Performance summary:
--------------------------------
SystemPerf: Target:  4 CPUs in 1 cells [4]
SystemPerf: Running on: Simics build-id 6213 linux64 on 12 CPUs with 31811 MiB RAM
SystemPerf: Threads: 1 execution threads, 5 compilation threads
SystemPerf: Virtual (target) time elapsed:     14.20
SystemPerf: Real (host) time elapsed:          87.90
SystemPerf: Slowdown:                           6.19
SystemPerf: Host CPU utilization:              99.85%

Since there is no attempt to control the scheduling of the tasks on the target system, it is not a given that each program instance gets scheduled on its own target core. The Linux kernel is allowed to run all the instances interleaved on a single core, leaving the other three target cores unused if it decided that was the best policy for some reason. However, from the look of things, Linux is actually doing a pretty good job of running the CoreMark instances in parallel (on the target system).

This run was performed using a single execution thread (“1 execution threads”), and took about 90 seconds to finish. See below for performance disclaimers and notes on the host on which this was run. Your results will differ, and the absolute time taken is not relevant.

Checking the Target Software View

It is always good to check your assumptions when doing performance measurements. In this case, a key assumption is that the target Linux manages to spread the load evenly across all the target cores.

To get an understanding for what the target software thinks is going on, I updated the Buildroot configuration to include the htop program. Running htop in parallel to the CoreMark instances shows that the target system is indeed loading all its four cores, and that the only significant load are the CoreMark processes:

There are some “imperfections” in the reported data – this is a live Linux system, and the scheduler might move some tasks around between cores in order to let other programs run, like htop itself and background services.

Checking the Simulator View

Another way to check the load balance is to use the simulator to get a low-level view of how many instructions are executed on each core. This is easy to do using the Simics simulator standard instruction-count tool.

Like this:

running> stop
# Set up an instruction counter tool, and connect it to all the processor core
simics> new-instruction-count -connect-all
Created icount0 (connected to 4 processors)
# Same input and run logic as above
simics> board.console.con.input "coremark & coremark & coremark & coremark\n"
simics> bp.console_string.run-until board.console.con "# "
# ... wait for run to stop
# Check the results
simics> icount0.icount
┌─────┬─────────────┬────────────┬───────┐
│Row #│  Processor  │   Count    │Count% │
├─────┼─────────────┼────────────┼───────┤
│    1│board.hart[1]│ 42586963092│ 25.00%│
│    2│board.hart[0]│ 42584905639│ 25.00%│
│    3│board.hart[3]│ 42581704695│ 25.00%│
│    4│board.hart[2]│ 42581690318│ 25.00%│
├─────┼─────────────┼────────────┼───────┤
│Sum  │             │170335263744│100.00%│
└─────┴─────────────┴────────────┴───────┘

This runs the simulator in the same way as in the performance test shown above, but instead of measuring the performance, the simulator counts the number of instructions executed on each core. The result is clear: the load balance is very close to perfect when running this benchmark setup. Thus, both the simulator and the target software agree that the load is well-balanced and should provide a good basis for evaluating the parallelism of the simulator.

Running a Parallel Workload in Parallel

To test the efficiency of multithreading in the Simics simulator, the same experiment as above is performed after switching simulator execution mode to multicore:

# Simulator is stopped
simics> set-threading-mode mode = multicore 
Switching threading mode to 'multicore'
# Then the exact same sequence as in the single-threaded case
simics> system-perfmeter mode = minimum 
simics> board.console.con.input "coremark & coremark & coremark & coremark\n"
simics> bp.console_string.run-until board.console.con "# "
# After the simulator stops
…
SystemPerf: Performance summary:
--------------------------------
SystemPerf: Target:  4 CPUs in 1 cells [4]
SystemPerf: Running on: Simics build-id 6213 linux64 on 12 CPUs with 31811 MiB RAM
SystemPerf: Threads: 4 execution threads, 5 compilation threads
SystemPerf: Virtual (target) time elapsed:     14.20
SystemPerf: Real (host) time elapsed:          23.79
SystemPerf: Slowdown:                           1.68
SystemPerf: Host CPU utilization:             400.05%

Both runs took the same amount of target time (14.20 seconds), which indicates that the target software ran in the same way and performed the same work (which should not be blindly assumed). In host time, the simulation ran about 3.6 times faster – not quite four times, but close. This shows that using multiple simulation threads does speed up the simulation when the target software is compute-bound and well-parallelized.

Running a Parallel Workload in a Slightly Less Parallel Way

The number of execution threads that the simulator uses can be configured. Reducing it to two threads from four would be expected to result in about twice the execution time. The experiment is the same as above, with an added limit to the number of host threads to use:

# Simulator is stopped
simics> set-threading-mode mode = multicore 
Switching threading mode to 'multicore'
# New: Limit the simulator parallelism to 2 threads
simics> set-thread-limit 2
# Same setup as before
simics> system-perfmeter mode = minimum 
simics> board.console.con.input "coremark & coremark & coremark & coremark\n"
simics> bp.console_string.run-until board.console.con "# "
# Results, after the simulator stops:
…
SystemPerf: Performance summary:
--------------------------------
SystemPerf: Target:  4 CPUs in 1 cells [4]
SystemPerf: Running on: Simics build-id 6213 linux64 on 12 CPUs with 31811 MiB RAM
SystemPerf: Threads: 2 execution threads, 5 compilation threads
SystemPerf: Virtual (target) time elapsed:     14.20
SystemPerf: Real (host) time elapsed:          45.42
SystemPerf: Slowdown:                           3.20
SystemPerf: Host CPU utilization:             200.47%

Indeed, using two execution threads results in a host runtime that is roughly twice as high as the four-thread case. Essentially, the same amount of simulation work is spread across half as many execution threads, leading to double the execution time.

Running a Less Parallel Workload

Reducing the target workload to two parallel CoreMark instances (instead of four) shows what happens when the target system is not loaded to 100%. Two CoreMark instances should use at most two of the four target system processor cores at the same time.

Running two CoreMark instances using two Simics simulator execution threads:

# Simics simulator execution threads:
simics> set-thread-limit 2
# The standard setup, except only two CoreMark instances:
simics> system-perfmeter mode = minimum 
simics> board.console.con.input "coremark & coremark\n"
simics> bp.console_string.run-until board.console.con "# "
# Abbreviated output, showing the relevant information:
…
SystemPerf: Threads: 2 execution threads, 5 compilation threads
SystemPerf: Virtual (target) time elapsed:     14.20
SystemPerf: Real (host) time elapsed:          23.48

Running the same workload using four host threads takes roughly the same amount of host time:

# More threads to run on
simics> set-thread-limit 4
simics> system-perfmeter mode = minimum 
simics> board.console.con.input "coremark & coremark \n"
simics> bp.console_string.run-until board.console.con "# "
…
SystemPerf: Threads: 4 execution threads, 5 compilation threads
SystemPerf: Virtual (target) time elapsed:     14.20
SystemPerf: Real (host) time elapsed:          24.05

That is very similar to runtime with two execution threads, since there is no actual work that the two additional execution threads can help with. For completeness, the result with a single execution thread is as expected, about twice the time of the two-host-thread case:

SystemPerf: Threads: 1 execution threads, 5 compilation threads
SystemPerf: Virtual (target) time elapsed:     14.20
SystemPerf: Real (host) time elapsed:          45.08

Summary of Parallel Workloads

To complete the picture, I also ran three CoreMark instances using 1, 2, and 4 host threads. The result is summarized in the graph below. Note that all the experimental runs take the same amount of virtual time – but different amounts of host time depending on how quickly the simulator executes.

Speedup from threading.png

This simple experiment demonstrates that adding more execution threads makes the simulator complete a workload faster when there is parallelism in the workload. For example, for two parallel CoreMark instances on the target, going from 2 to 4 host threads provides no benefit. Running a single CoreMark takes the same time regardless of how many host threads are used, since it only loads a single core and its work is serial by nature (Amdahl’s law holds).

One Last Check

What does the operating system do when running a partial load? Such as three CoreMark instances on four cores? One possibility is that it keeps each thread on the same core for the duration of the run, while another possibility is that it moves the threads around. The latter behavior is commonly seen on laptops and servers when you run a heavy process like the Simics simulator itself – the load tends to migrate between cores over time, as the complex operating system schedulers make decisions based on the overall system state.

This is easy to check, using the instruction count tool. A quick experiment running three CoreMark processes results in an instruction count like this:

simics> icount0.icount
┌─────┬─────────────┬────────────┬───────┐
│Row #│  Processor  │   Count    │Count% │
├─────┼─────────────┼────────────┼───────┤
│    1│board.hart[0]│ 42586066577│ 33.33%│
│    2│board.hart[2]│ 42585441542│ 33.33%│
│    3│board.hart[3]│ 42581920148│ 33.33%│
│    4│board.hart[1]│     2161017│  0.00%│
├─────┼─────────────┼────────────┼───────┤
│Sum  │             │127755589284│100.00%│
└─────┴─────────────┴────────────┴───────┘

Looks like the operating system kept each process on the same core for the duration. Testing the extreme case of a single CoreMark instance also shows a strong tendency to stick to one core:

simics> icount0.icount
┌─────┬─────────────┬───────────┬───────┐
│Row #│  Processor  │   Count   │Count% │
├─────┼─────────────┼───────────┼───────┤
│    1│board.hart[2]│42585764730│100.00%│
│    2│board.hart[1]│     893726│  0.00%│
│    3│board.hart[0]│     757813│  0.00%│
│    4│board.hart[3]│     186245│  0.00%│
├─────┼─────────────┼───────────┼───────┤
│Sum  │             │42587602514│100.00%│
└─────┴─────────────┴───────────┴───────┘

Disclaimer and Details

In general, measuring and optimizing performance is a tricky business. The simulator performance might vary even over short periods of time due to the ambient temperature, other software running on the host at the same time, and other factors. Your experience will vary in absolute terms, but the trend for how parallelism speeds up the execution should be similar.

The experiments above were performed in a single interactive run of the Simics simulator. The host machine was an Intel Next Unit of Computing (NUC) computer standing on my desk, sporting an Intel® Core™ i7-10710U laptop chip. It is easy to find host computers that are significantly faster than this NUC – and thus the absolute time required for the runs is not very interesting. The relevant data is the effect of parallelism in the target and host software stacks.