CYCLE_ACTIVITY.STALLS_MEM_ANY is an event for Intel(R) Microarchitecture Code Name Broadwell.
However, In this Paper “Yasin, A. (2014, March). A top-down method for performance analysis and counters architecture. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on (pp. 35-44). IEEE.”,
The author describe CYCLE_ACTIVITY.STALLS_MEM_ANY as one Intel's implementation of Top-Down metrics on IvyBridge, But I can't find which event for ivybridge exactly decribes "CYCLE_ACTIVITY.STALLS_MEM_ANY", thank you.
I'm interested in this problem as well. Did you maybe find some answers in the meantime?
I wanted to measure stall cycles due to back-end, specifically due to main memory.
I followed top-down method, more can be found here:
There is also a chapter B.4.2 in the Intel Optimization reference manual.
When monitoring with perf, it is easy to get back-end stall cycles. However, back-end stalls can be core-bound and memory-bound (as explained in the manual).
It says that there is a counter CYCLE_ACTIVITY.STALLS_LDM_PENDING on Ivy Bridge architecture, that counts "cycles when there is a non-completed in-flight memory demand load coincident with execution starvation. Note we account only for demand load operations as uops do not typically wait for (direct) completion of stores or HW prefetches."
Further in the manual it uses this counter to derive stalls due to L1, L2, L3 and main memory.
Is there a similar way to measure these things on SandyBridge?
I'm using 2-socket machine, with Xeon E5-2620.
The event CYCLE_ACTIVITY_STALLS.MEM_ANY is listed in Chapter 19 of Volume 3 of the SW Developer's manual for Skylake. The event uses the same Event 0xA3 that Intel has used for similar purposes on Sandy Bridge, Ivy Bridge, and Haswell, but adds new Umasks to allow counting stall cycles while there is an L3 demand load miss pending and while there is a memory load pending. (I don't understand the distinction between these two cases!) Oddly, Intel does not list any of the 0xA3 events for Broadwell. This could be due to an oversight or due to bugs in the implementation. I don't have any Broadwell processors to test....
Although Intel does not list it as a separate event in Chapter 19 of the SW Developer's Manual, you can combine Umask 0x01 (CYCLES_L2_PENDING) and Umask 0x04 (CYCLES_NO_DISPATCH) to count cycles in which there is a dispatch stall and at least one L2 demand load miss pending. It is used for this purpose in Intel's VTune product, where the event is referred to as CYCLE_ACTIVITY.STALLS_L2_PENDING. There is no assurance that the dispatch stall was actually caused by the L2 demand miss, but there is likely to be a correlation between L2 cache demand misses and dispatch stalls, so this event should be useful for pointing to pieces of code that may be worth further examination.
Thanks for the insights!
As for L3 miss pending vs. memory load pending on Skylake...I'm not an expert but maybe there is some difference if L3 miss data is found on other socket...
I will try with L2 pending stalls.