Ivy Bridge, counting cycles stalled due to LLC cache load misses?

T_C · ‎03-21-2015

Hi,

On Ivy Bridge there are the following counters:

CYCLE_ACTIVITY.CYCLES_L1D_PENDING
CYCLE_ACTIVITY.CYCLES_L2_PENDING
CYCLE_ACTIVITY.CYCLES_LDM_PENDING

but no CYCLE_ACTIVITY.CYCLES_LLC_PENDING. I have performed some profiling and my results suggest you cannot just subtract the first two counters from the third counter, to get the LLC value. There are three counters for the number of times there is a cache miss, but I want to know the effect of stalling.

How can I measure the number of CPU cycles stalled due to LLC cache load misses?

McCalpinJohn · ‎03-22-2015

The wording of the documentation is a bit strange, but the event CYCLE_ACTIVITY.CYCLES_LDM_PENDING is described as counting cycles with pending "memory loads". This sounds like it means LLC misses, but one would have to do careful directed testing to be sure.

It is important to note that, despite the wording in the VTune configuration files, the events named CYCLE_ACTIVITY.STALLS_*_PENDING do not mean that the stalls were *caused* by the cache miss. They count cycles in which there is both a "dispatch stall" and a "pending demand data load" to the corresponding level of the memory hierarchy. It is certainly a common case for the demand load miss to cause stalls -- especially for L2 and LLC misses, but there are also two other important cases. (1) There is a demand load outstanding that would not cause a dispatch stall, but there is another condition occurring at that same time that actually causes a dispatch stall (e.g., dependent arithmetic operations). (2) There is a demand load outstanding that would cause a dispatch stall, and there is another condition occurring at the same time that also causes a dispatch stall.

In the first case this event incorrectly attributes stalls to demand load misses, while in the second case the event could easily be incorrectly interpreted as suggesting that eliminating the demand load miss would eliminate the stalls.

T_C · ‎03-22-2015

Hi,

So which counters should I use to determine how much L1/L2/LLC cache misses are affecting performance? Just the counters which measure the number of L1/L2/LLC misses?

Also, could you tell me, why do some counters have _PS and some do not? In other words, what is the point offering the non _PS counter when the _PS counter is more accurate?

Peter_W_Intel · ‎03-22-2015

@ T C

I have an IvyBridge box, list useful events below, please use events with "_PS" suffix.

And please all events which are recommended by Tuning Guide and Performance Analysis papers, Penalties of L1/L2/LLC misses also are included in these articles.

~$ amplxe-runss -event-list | grep MEM_LOAD
MEM_LOAD_UOPS_RETIRED.L1_HIT
MEM_LOAD_UOPS_RETIRED.L2_HIT
MEM_LOAD_UOPS_RETIRED.LLC_HIT
MEM_LOAD_UOPS_RETIRED.L1_MISS
MEM_LOAD_UOPS_RETIRED.L2_MISS
MEM_LOAD_UOPS_RETIRED.LLC_MISS
MEM_LOAD_UOPS_RETIRED.HIT_LFB
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS
MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS
MEM_LOAD_UOPS_RETIRED.L1_MISS_PS
MEM_LOAD_UOPS_RETIRED.L2_MISS_PS
MEM_LOAD_UOPS_RETIRED.LLC_MISS_PS
MEM_LOAD_UOPS_RETIRED.HIT_LFB_PS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_MISS_PS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HIT_PS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_HITM_PS
MEM_LOAD_UOPS_LLC_HIT_RETIRED.XSNP_NONE_PS
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM
MEM_LOAD_UOPS_LLC_MISS_RETIRED.LOCAL_DRAM