Software Tuning, Performance Optimization & Platform Monitoring
Discussion around monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform monitoring

How does out-of-order execution cause PMI skid ??


How out-of-order execution cause PMI skid ??
What is the root-cause of the PMI Skid??

0 Kudos
3 Replies
Black Belt

It is important to understand the difference between exceptions and interrupts.  These terms are sometimes used interchangeably, but it is helpful to keep them distinct.   Chapter 6 of Volume 3 of the Intel Architectures Software Developer's Manual (document 325384) provides an extensive discussion of Exceptions and Interrupts in the context of the Intel architecture (which is more complex than the simple description I provide here).

In general,

  • Exceptions are raised as a part of the execution of an instruction in whatever functional unit is executing the instruction, so identifying the instruction that caused the exception is straightforward, and making the exceptions "precise" and restartable is well understood.  (Not easy, but well studied.)  Intel refers to this class of exceptions as "Faults", and examples include page faults, divide by zero, etc.
  • Interrupts come from sources that are external to the processor pipeline.  There may be no relationship between the timing of the interrupt and the timing of the execution of the instructions in a particular core.  External interrupts must be handled in a way that respects the sequential order of execution of the program, but not with any particular timing relationship between the arrival of the interrupt and the instructions being executed.

If the performance monitoring interrupt was an "exception" of the "fault" class, there would be no problem with skid.

Implementing the performance monitoring interrupt in this way is not practical, however.  The performance monitoring unit aggregates data from all of the functional units of the core.  The latency of sending data to the performance monitoring unit will vary by unit and may be several cycles for the more distant units.  So an interrupt is used instead of an exception, and this breaks the precise mapping back to instructions.  There are cases where the skid can be reduced, but to eliminate it completely the performance monitoring unit would have to be replicated in every processor functional unit so that a "fault exception" could be generated instead.   I don't know of any design that has taken this approach....


Thanks John.

I think PMC is incremented in pipeline,  and somehow PMU will check it  (don't know is it callback-liked or periodically check),

And if it is overflow, PMU then issue a external interrupt by apic.

but meanwhile, the pipeline is still processing the remaining instruction and moving forward even the interrupt is fired (processor's pipeline doesn't case about that)

so that processors cannot predict or guarantee how far does interrupt moved,  that caused the differences between the instruction of an interrupt arrivals and the instruction of PMC overflowing  is becoming greatly vary.

There is some curious and some question with some personal thought and assumption:

(1) if above described assumption is basically correct, how does the out-of-order execution affect this phenomeon?

(2) Is it becaouse of OoOE lets processors can execute more than one instruction in pipeline at any given time. And the instruction that caused overflow and other instructions are highly possibly executing at the same time, so that the original overflow instruction may not have enough time to wait a PMI before it retire and moving forward. (racing)

For in-order execution, an instruction execution have to wait a previous instruction retired, so that an instruction is always have an enough time to receive PMI before it moves to next instruction forward.

(3) Does in-order execution architecture eliminates skid or just greatly mitigates skid ? 


Black Belt

The implementation of the Performance Monitoring Unit (PMU) can't be "tightly coupled" to the functional unit pipelines for many reasons:

  1. The processor is too large for zero-cycle signaling from all of the units to the PMU and back again.
    1. Recall that the PMU monitors not just the instruction pipelines, but also the L1I and L1D caches, the L2 cache, and the interface between the L2 and the "uncore".
  2. Delaying the pipeline (at any stage) by even a single cycle is unacceptable.
    1. Recall that the PMU can monitor events at many different parts of the pipeline -- instruction fetch, instruction decode, instruction issue, instruction dispatch, instruction retirement.  These happen lots of cycles apart, which means that they happen at physically separate locations, which means that it is not possible for the PMU to be "close" to all of them -- even for a single functional unit.
  3. Not all PMU events are caused by instructions! 
    1. The cache hierarchy is run by a complex, adaptive state machine, not by instructions.  (The caches interact with the instruction stream, but they are not directly controlled by the instruction stream.)
    2. For example, cache accesses can be caused by the HW prefetchers, which operate somewhat autonomously.
    3. Cache writebacks can be caused by external intervention requests.  These can be from the shared cache, from other cores, or from IO devices.
    4. Interrupts from external sources can also be counted by the PMU, and these are clearly not attributable to any specific local instruction.

These issues apply to both in-order and out-of-order processors.   The issues are more complex in out-of-order processors because they are typically physically larger (more cycles of latency).  Attempting to compensate for the delay is more complex because it is not (in general) possible to know what was actually executed between the instruction that caused the final increment to the performance counter and the time that the PMU was able to signal the interrupt.

There are limited cases for which the designers have worked very hard to make skid predictable, so it can be exactly compensated.  For the Sandy Bridge through Skylake cores, the "PEBS-PDIR" feature (discussed in Chapter 18 of Volume 3 of the Intel Architectures SW Developer's Guide, document 325384) allows known skid for exactly one event (INST_RETIRED.PREC_DIST) when using PMC1 only (and disabling counting on the other PMCs).  There are lots of other related topics in Chapter 18....