measuring L3 cache misses on i7

user123 · ‎03-15-2011

Could somebody please explain the differences between the following i7 counters: MEM_LOAD_RETIRED.LLC_MISS, LONGEST_LAT_CACHE.MISS, andMEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_128?
For background information, Iwould like to measure LLC missesin a way thatincludesLLC misses due tosoftware prefetches.
Thanks!

Rob5 · ‎03-16-2011

User123,

The differences between these events are exhibited as follows.

MEM_LOAD_RETIRED.LLC_MISS

LLC_MISS - Retired loads that miss the LLC cache (Precise Event)
Counts number of retired loads that miss the LLC cache. The load was satisfied by a remote socket, local memory or an IOH.

LONGEST_LAT_CACHE.MISS

MISS - Longest latency cache miss
Counts uncore Last Level Cache misses.

MEM_INST_RETIRED.LATENCY_ABOVE_THRESHOLD_128

LATENCY_ABOVE_THRESHOLD_128 - Load instructions retired above 128 cycles

The Nehalem processor has a "latency event" which is very similar to the Itanium ( R ) Processor Family Data EAR event. This event samples loads, recording the number of cycles between the execution of the instruction and actual deliver of the data. If the measured latency is larger than the minimum latency programmed into MSR 0x3f6, bits 15:0, then the counter is incremented. Counter overflow arms the PEBS mechanism and on the next event satisfying the latency threshold, the measured latency, the virtual or linear address and the data source are copied into 3 additional registers in the PEBS buffer. Because the virtual address is captured into a known location, the sampling driver could also execute a virtual to physical translation and capture the physical address. The physical address identifies the NUMA home location and in principle allows an analysis of the details of the cache occupancies.

Further, as the address is captured before retirement even the pointer chasing encodings

MOV RAX, [RAX + RBX] have their addresses captured.

Because an MSR is used to program the latency only one minimum latency value can be sampled on a core during a given period. To enable this, the Intel performance tools restrict the programming of this event to counter 3 to simplify the scheduling.

the following document provides detail about performance analysis and core events. The document is titled "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors" and can be found at: http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf

Also, if you have Intel VTune Amplifier XE 2011installed, then additional information about events can be found in the .chm help documentation located in the have Intel VTune Amplifier XE 2011 install directory \ documentation\en\.

In regards to including LLC misses due to software prefetches, I am not certain this granularity is available. I will research this to see what the possibilities are and update the thread as soon as I have additional information.

Thanks
Rob
Intel Support

user123 · ‎03-17-2011

Hi Rob,

Thank you for your prompt response! Yes, I have read the counter descriptions in SDM V3B, Dr. Leventhal's i7 Performance Analysis Guide, and the Vtune Amplifier XE documentation. The definition of LONGEST_LAT_CACHE.MISS in SDM 3B Table A3 (called there L3_LAT_CACHE.MISS) states that this event counts each cache miss condition for references to the last level cache. The event count may include speculative traffic but excludes cache line fills due to L2 hardware-prefetches. I have interpreted this description to imply that LONGEST_LAT_CACHE.MISS would count all the events that MEM_LOAD_RETIRED.LLC_MISS counts, plus more. Among the additional events counted, LONGEST_LAT_CACHE.MISS would include LLC misses due to software prefetches, but not hardware prefetches. Do you agree with my interpretation?

On the other hand, Table A5 (Westmere) in SDM 3B does not mention speculative traffic or hardware prefetch anymore, just uncore LLC misses. Does that mean that the counter definition has changed in Westmere, and on Westmere LONGEST_LAT_CACHE.MISS now includes all uncore LLC misses, and among them those due to both software and hardware prefetches? (I also noticed that LONGEST_LAT_CACHE.MISS events Umask value has changed as well it is 41H in Table 3 and 01H in Table 5.)

Thank you so much in advance!