How to interpret pipeline stall with PMU event {frontend_stall, backend_stall, retired_slots}?

Tyree · ‎09-19-2024

Hi,

I am learning how to use TMA to profile my program.

After reading a lot of materials, I understand that Intel TMA separates sampling cycles into three parts (simplifying by not considering branch prediction): retirement, frontend stall, and backend stall. The corresponding events are UOPS_RETIRED.SLOTS, IDQ_UOPS_NOT_DELIVERED.CYCLES_0_UOPS_DELIV.CORE, and TOPDOWN.BACKEND_BOUND_SLOTS in the ICELAKE CPU.

However, I can't find any materials that explain how these three parts are actually counted in the processor pipeline.

Take the following 4-stage pipeline diagram as an example, assuming the pipeline width is 1:

Cycle    1 2 3 4 5 6 7 8 9 10
Inst1:   F D E R                    
Inst2:     F D E E E R
Inst3:       F D - - E R
Inst4:         F F - D E R
Inst5:           - F - D E R

What should the values for retirement, frontend_stall, and backend_stall be?

yuzhang3_intel · ‎09-19-2024

All the metrics are calculated based on PMU events in TMA. You can refer to the Perfmon below:

https://github.com/intel/perfmon

For the events counting you mentioned, you can see the 'Intel® 64 and IA-32 Architectures Software Developer’s Manual'