On Ivy Bridge there are the following counters:
but no CYCLE_ACTIVITY.CYCLES_LLC_PENDING. I have performed some profiling and my results suggest you cannot just subtract the first two counters from the third counter, to get the LLC value. There are three counters for the number of times there is a cache miss, but I want to know the effect of stalling.
How can I measure the number of CPU cycles stalled due to LLC cache load misses?
The wording of the documentation is a bit strange, but the event CYCLE_ACTIVITY.CYCLES_LDM_PENDING is described as counting cycles with pending "memory loads". This sounds like it means LLC misses, but one would have to do careful directed testing to be sure.
It is important to note that, despite the wording in the VTune configuration files, the events named CYCLE_ACTIVITY.STALLS_*_PENDING do not mean that the stalls were *caused* by the cache miss. They count cycles in which there is both a "dispatch stall" and a "pending demand data load" to the corresponding level of the memory hierarchy. It is certainly a common case for the demand load miss to cause stalls -- especially for L2 and LLC misses, but there are also two other important cases. (1) There is a demand load outstanding that would not cause a dispatch stall, but there is another condition occurring at that same time that actually causes a dispatch stall (e.g., dependent arithmetic operations). (2) There is a demand load outstanding that would cause a dispatch stall, and there is another condition occurring at the same time that also causes a dispatch stall.
In the first case this event incorrectly attributes stalls to demand load misses, while in the second case the event could easily be incorrectly interpreted as suggesting that eliminating the demand load miss would eliminate the stalls.
So which counters should I use to determine how much L1/L2/LLC cache misses are affecting performance? Just the counters which measure the number of L1/L2/LLC misses?
Also, could you tell me, why do some counters have _PS and some do not? In other words, what is the point offering the non _PS counter when the _PS counter is more accurate?
@ T C
I have an IvyBridge box, list useful events below, please use events with "_PS" suffix.
And please all events which are recommended by Tuning Guide and Performance Analysis papers, Penalties of L1/L2/LLC misses also are included in these articles.
~$ amplxe-runss -event-list | grep MEM_LOAD