Accounting for CYCLE_ACTIVITY.CYCLES_NO_EXECUTE

Pradeep_R_ · ‎08-26-2015

Hi all,

I am using Vtunes' bandwidth profile to look at the fraction of time my software is waiting on any cache accesses on my HSW i7 processor. The CYCLE_ACTIVITY.CYCLES_NO_EXECUTE gives this time. When I try to break this down into fraction of time waiting on L1, L2, and L3+Mem, I am trying to use CYCLE_ACTIVITY.STALLS_L1D_PENDING, ...STALLS_L2_PENDING, and STALLS_LDM_PENDING. However, the sum of these three counts is > the CYCLES_NO_EXECUTE count always.

Can someone please clarify what other events are being counted in these counters which CYCLES_NO_EXECUTE doesn't count?

Thanks,

Pradeep.

McCalpinJohn · ‎08-26-2015

Some of the descriptions of this event are oversimplifications that can be confusing....

The processor can "stall" at many places in the pipeline, so it is important to be clear about what is being measured. The performance counter events mentioned here are all based on hardware event 0xA3 "CYCLE_ACTIVITY". The description of this event in Volume 3 of the SW Developer's Guide does not specify which part of the pipeline is being monitored, but it is pretty clear from the "nearby" events 0xA1 and 0xA6 that this event is measuring the "dispatch" of uops from the Reservation Stations to the execution ports. (This corresponds to the "Scheduler" block in Figure 2-1 of the Intel Optimization Reference Manual, document 248966-030.) At this point in the pipeline, a "stall" is a cycle in which no uop is sent to any of the 8 execution ports.
In general it is not possible to unambiguously assign "blame" for dispatch stalls -- especially since a stall may be due to several causes simultaneously. So what this event does is something different -- since many stalls are caused by failure to tolerate memory latency, finding cycles in which there is both a stall and a (demand) load miss should be helpful. Not exact, but helpful.

Other causes of dispatch stalls can include:

No uops in the Reservation Station due to bottlenecks earlier in the pipeline (instruction fetch/decode/rename). Intel calls these "front-end stalls". Note that Event 0xA6 "EXE_ACTIVITY" specifically *excludes* cycles in which the Reservation Station was not empty, but this extra bit of logic does not appear to apply to event 0xA3.
No independent uops in the Reservation Station available to be issued due to long-latency instructions. For example if the core is computing a dependent sequence of FMA instructions, the throughput is only one FMA instruction per 5 cycles -- leaving 4 cycles of dispatch stalls.
There are many other examples discussed in Appendix B of the Intel Optimization Reference Manual, and many of these don't require that there be a demand load miss pending.

So in summary:

CYCLE_ACTIVITY.CYCLES_NO_EXECUTE counts *all* cycles in which no uops are dispatched to the execution ports, no matter what the cause.
CYCLE_ACTIVITY.STALLS_*_PENDING only count such stall cycles if there is *also* a demand load miss pending at the L1, L2, or L3 level of the cache hierarchy.

Other caveats:

The processor core can execute instructions out of order, so it is often possible to hide the latency of L1 misses that hit in the L2 cache. It is less likely that the processor can hide the latency of L2 misses or L3 misses. So the STALLS_L2_PENDING and STALLS_L3_PENDING are more likely to be associated with a stall *caused by* the cache miss than the STALLS_L1D_PENDING event.
Some instructions appear to be dispatched multiple times while waiting for the data to return. (The floating-point counters on Sandy Bridge and Ivy Bridge are in this category.) If this happens you are likely to *miss* true stall cycles with these events -- they will not be incremented if a uop is dispatched to any port, even if that uop gets rejected and tried again. Most people would call that a "stall", but the hardware does not count it that way. I have not verified this on Haswell, but it is something to watch out for.