The differences between these events are exhibited as follows.
LLC_MISS - Retired loads that miss the LLC cache (Precise Event)
Counts number of retired loads that miss the LLC cache. The load was satisfied by a remote socket, local memory or an IOH.
MISS - Longest latency cache miss
Counts uncore Last Level Cache misses.
LATENCY_ABOVE_THRESHOLD_128 - Load instructions retired above 128 cycles
The Nehalem processor has a "latency event" which is very similar to the Itanium ( R ) Processor Family Data EAR event. This event samples loads, recording the number of cycles between the execution of the instruction and actual deliver of the data. If the measured latency is larger than the minimum latency programmed into MSR 0x3f6, bits 15:0, then the counter is incremented. Counter overflow arms the PEBS mechanism and on the next event satisfying the latency threshold, the measured latency, the virtual or linear address and the data source are copied into 3 additional registers in the PEBS buffer. Because the virtual address is captured into a known location, the sampling driver could also execute a virtual to physical translation and capture the physical address. The physical address identifies the NUMA home location and in principle allows an analysis of the details of the cache occupancies.
Further, as the address is captured before retirement even the pointer chasing encodings
MOV RAX, [RAX + RBX] have their addresses captured.
Because an MSR is used to program the latency only one minimum latency value can be sampled on a core during a given period. To enable this, the Intel performance tools restrict the programming of this event to counter 3 to simplify the scheduling.
the following document provides detail about performance analysis and core events. The document is titled "Performance Analysis Guide for Intel Core i7 Processor and Intel Xeon 5500 processors" and can be found at: http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf
Also, if you have Intel VTune Amplifier XE 2011installed, then additional information about events can be found in the .chm help documentation located in the have Intel VTune Amplifier XE 2011 install directory \ documentation\en\.
In regards to including LLC misses due to software prefetches, I am not certain this granularity is available. I will research this to see what the possibilities are and update the thread as soon as I have additional information.
Thank you for your prompt response! Yes, I have read the counter descriptions in SDM V3B, Dr. Leventhal's i7 Performance Analysis Guide, and the Vtune Amplifier XE documentation. The definition of LONGEST_LAT_CACHE.MISS in SDM 3B Table A3 (called there L3_LAT_CACHE.MISS) states that this event counts each cache miss condition for references to the last level cache. The event count may include speculative traffic but excludes cache line fills due to L2 hardware-prefetches. I have interpreted this description to imply that LONGEST_LAT_CACHE.MISS would count all the events that MEM_LOAD_RETIRED.LLC_MISS counts, plus more. Among the additional events counted, LONGEST_LAT_CACHE.MISS would include LLC misses due to software prefetches, but not hardware prefetches. Do you agree with my interpretation?
On the other hand, Table A5 (Westmere) in SDM 3B does not mention speculative traffic or hardware prefetch anymore, just uncore LLC misses. Does that mean that the counter definition has changed in Westmere, and on Westmere LONGEST_LAT_CACHE.MISS now includes all uncore LLC misses, and among them those due to both software and hardware prefetches? (I also noticed that LONGEST_LAT_CACHE.MISS events Umask value has changed as well it is 41H in Table 3 and 01H in Table 5.)
Thank you so much in advance!