Solved: Thank you for an excellent

dyken__christopher · ‎01-16-2019

Hi,

I'm trying to understand the "Instructions Retired" column (not the "Retiring"-class under locators) in Microarchitecture Exploration in VTune.

From the bottom-up-view, I've found a function that I've interested in, and I'm looking at a block of assembly. There is no control flow, so I would guess that each of these instructions would successfully retire the same number of times.

However, the value under "instructions retired" vary quite a lot for this sequence of instructions. One explanation might be sampling noise, but I wonder if my assumption that this column should somehow correlate with the number of that particular instruction that retires is broken, or I have the wrong mental model here.

If anyone would shed some light on this, I would be grateful.

McCalpinJohn · ‎01-16-2019

VTune uses a "sampling" approach to performance analysis, which can produce many surprises....

Sampling is based on interrupting the program execution and recording information (including the program counter). An out-of-order processor typically has some instructions that have executed, but not retired, when an interrupt arrives, which means that the program counter being used to fetch instructions may be far ahead of the address of the most recently retired instruction(s).

Section 6.6 of Volume 3 of the Intel Architectures SW Developer's Manual states that the return instruction pointer points to the first instruction to be executed when the program returns, which means that all prior instructions must have been retired. I seem to recall that is common for sampling software to report the instruction immediately preceding the one pointed to by the return instruction pointer as the "cause" of the interrupt. (This makes sense if the interrupt is due to overflow of a performance counter, for example, though the lag between instruction execution and the generation of a performance monitor interrupt means that this is almost never exactly correct.)

The architectural manuals are surprisingly vague about the detailed timing of interrupts. This allows the designers much more flexibility in implementation, while still satisfying the architectural requirements.

View solution in original post

McCalpinJohn · ‎01-16-2019

VTune uses a "sampling" approach to performance analysis, which can produce many surprises....

Sampling is based on interrupting the program execution and recording information (including the program counter). An out-of-order processor typically has some instructions that have executed, but not retired, when an interrupt arrives, which means that the program counter being used to fetch instructions may be far ahead of the address of the most recently retired instruction(s).

Section 6.6 of Volume 3 of the Intel Architectures SW Developer's Manual states that the return instruction pointer points to the first instruction to be executed when the program returns, which means that all prior instructions must have been retired. I seem to recall that is common for sampling software to report the instruction immediately preceding the one pointed to by the return instruction pointer as the "cause" of the interrupt. (This makes sense if the interrupt is due to overflow of a performance counter, for example, though the lag between instruction execution and the generation of a performance monitor interrupt means that this is almost never exactly correct.)

The architectural manuals are surprisingly vague about the detailed timing of interrupts. This allows the designers much more flexibility in implementation, while still satisfying the architectural requirements.

dyken__christopher · ‎01-17-2019

Thank you for an excellent reply!

So, just to check that my mental model is appropriate:

In principle, "Instructions Retired" should be constant for a code-block without any branching, but it is challenging to attribute performance events to precisely the right instruction due out-of-order processing of multiple instructions in flight, in addition to quantization artifacts due to the approach is based on sampling?

If so, can one interpret the variance of "instructions retired" over a code-block as an indication on how well performance events have been attributed to the right instruction?

McCalpinJohn · ‎01-18-2019

In the terminology of compiler writers, a "basic block" is a section of code with one entrance and one exit, so you are guaranteed that all instructions in the "basic block" execute exactly the same number of times. Because of systematic biases, sampling won't capture the instructions uniformly, even for very large sample sizes. This is nothing to worry about -- any sample in the basic block means all of those instructions were executed when the sample was taken.

The interpretation of sampling results depends on what is used to trigger the sampling.

The most common approach is to sample based on wall-clock time. Then the sample counts are proportional to how much time the code spent in each of the basic blocks where the samples occurred.
Another commonly used approach is to sample on the overflow of hardware performance counters. In this case, the sample counts are proportional to the hardware event counts occurring in each basic block where the sample occurred.

In any sampling technique, there is the possibility of systematic error due to a correlation between activity of the code and the triggering of the samples. For example, a code might have a periodicity of exactly 1 millisecond, so a 1 millisecond sampling trigger would always hit in the same location in the code's execution path. The same can happen for sampling based on hardware performance counters. The effects can be minimized by careful choice of sampling interval (or event count) and/or by introducing additional randomization to the sampling interval. I have not seen evidence of these sorts of problems with VTune, but I only rarely use sampling for performance analysis, and even when I use it, I am only looking for very large-scale patterns (e.g., identifying the top 5 "hot spots" in the code).

As a more "quantitative" alternative to sampling, I always use performance counter measurements over intervals for my detailed performance work -- e.g., either instrumenting the code to read the counters at the beginning and end of functions or loops, or running a separate program to read the counters on all the processors at regular intervals. An example code that reads (almost) all the performance counters on a Skylake Xeon system is https://github.com/jdmccalpin/periodic-performance-counters. ; This is configured to run concurrently with the code under test and read all the core and uncore performance counters at regular intervals. When I want to analyze a code that is fairly easily made self-contained, I sometimes replace the "sleep()" function with the code that I want to test, so I get counts before and after the code execution (instead of at regular time intervals).

Understanding "Instructions Retired" in Microarch expl in VTune