Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

How to understand LLC_misses, LLC-load-misses and LLC-store-misses

LY
Beginner
4,952 Views

Hi Everyone,

There is an issue confusing me for a long time. I use perf subsystem in Linux kernel to monitor last-level-cache-related events. And I know perf is implemented based on setting up and reading specific MSR for monitoring specific architectural or non-architectural core events or uncore events for intel machine. But some of events implemented in perf are based on kernel rather than MSR which cannot be detected by PMU.

Here comes my confusion. For LLC_misses, Intel has dedicated MSR responsible for monitoring it and also it is one of architectural events. What I am confused about is that whether PMU could detect and differentiate load-misses and store-misses in last level cache? In my opinion, every store-miss will cause a load-miss. And in the view of cache, it will treat each load-miss and each store-miss as load-miss, so it cannot differentiate load against store misses. I also checked Intel manual 3B chapter 19 Table 19-11. And I didn't find the events for llc-load-misses and llc-store-misses. However, there is an event called MEM_LOAD_UOPS_RETIRED.LLC_MISS (may be equivalent to llc-load-misses) which is retired load uops whose data source is LLC miss, in the core view (not cache view). But why there is not an event called MEM_STORE_UOPS_RETIRED.LLC_MISS which is retired store uops whose data source is LLC miss? And for the architectural event LLC_misses, does it include both last level load misses and store misses?

Thanks.

0 Kudos
1 Solution
McCalpinJohn
Honored Contributor III
4,952 Views

It is certainly possible to distinguish between store misses and load misses at the hardware level.  Although both cause data to be loaded, they have very different requirements for cache coherence, so they are implemented using different transactions. 

Transactions through the memory hierarchy can often be monitored from multiple locations.   As you noted, Intel does not include an explicit event in the core counters for differentiating between load and store accesses to the LLC, but they do include this capability (and much more) in the OFFCORE_RESPONSE event (described in Chapter 18 of Volume 3 of the SW Developer's Manual).   For the OFFCORE_RESPONSE events, you can count load misses with the DMND_DATA_RD request type and you can count store misses with the DMND_RFO request type.   You can differentiate between LLC hits and LLC misses using combinations of the "response supplier" or "snoop info" fields.

Intel has been careful not to put a very specific definition in place for the "architectural" version of the LLC reference and LLC miss events, but the processor-specific descriptions of the event/umask combinations 2EH_41H and 2EH_4FH have slightly different wording that can provide some indication of what is counted on each specific processor.   For the Xeon E5 (Sandy Bridge) and Xeon E5 v3 (Haswell) systems that I have tested, these events appear to count both demand load misses and demand store misses, but definitely not L2 hardware prefetches.  They probably also count L1 hardware prefetches that miss in the L2, but that is harder to test.

View solution in original post

0 Kudos
2 Replies
McCalpinJohn
Honored Contributor III
4,953 Views

It is certainly possible to distinguish between store misses and load misses at the hardware level.  Although both cause data to be loaded, they have very different requirements for cache coherence, so they are implemented using different transactions. 

Transactions through the memory hierarchy can often be monitored from multiple locations.   As you noted, Intel does not include an explicit event in the core counters for differentiating between load and store accesses to the LLC, but they do include this capability (and much more) in the OFFCORE_RESPONSE event (described in Chapter 18 of Volume 3 of the SW Developer's Manual).   For the OFFCORE_RESPONSE events, you can count load misses with the DMND_DATA_RD request type and you can count store misses with the DMND_RFO request type.   You can differentiate between LLC hits and LLC misses using combinations of the "response supplier" or "snoop info" fields.

Intel has been careful not to put a very specific definition in place for the "architectural" version of the LLC reference and LLC miss events, but the processor-specific descriptions of the event/umask combinations 2EH_41H and 2EH_4FH have slightly different wording that can provide some indication of what is counted on each specific processor.   For the Xeon E5 (Sandy Bridge) and Xeon E5 v3 (Haswell) systems that I have tested, these events appear to count both demand load misses and demand store misses, but definitely not L2 hardware prefetches.  They probably also count L1 hardware prefetches that miss in the L2, but that is harder to test.

0 Kudos
LY
Beginner
4,952 Views

Many thanks, Dr. McCalpin! I got the idea!

John McCalpin wrote:

It is certainly possible to distinguish between store misses and load misses at the hardware level.  Although both cause data to be loaded, they have very different requirements for cache coherence, so they are implemented using different transactions. 

Transactions through the memory hierarchy can often be monitored from multiple locations.   As you noted, Intel does not include an explicit event in the core counters for differentiating between load and store accesses to the LLC, but they do include this capability (and much more) in the OFFCORE_RESPONSE event (described in Chapter 18 of Volume 3 of the SW Developer's Manual).   For the OFFCORE_RESPONSE events, you can count load misses with the DMND_DATA_RD request type and you can count store misses with the DMND_RFO request type.   You can differentiate between LLC hits and LLC misses using combinations of the "response supplier" or "snoop info" fields.

Intel has been careful not to put a very specific definition in place for the "architectural" version of the LLC reference and LLC miss events, but the processor-specific descriptions of the event/umask combinations 2EH_41H and 2EH_4FH have slightly different wording that can provide some indication of what is counted on each specific processor.   For the Xeon E5 (Sandy Bridge) and Xeon E5 v3 (Haswell) systems that I have tested, these events appear to count both demand load misses and demand store misses, but definitely not L2 hardware prefetches.  They probably also count L1 hardware prefetches that miss in the L2, but that is harder to test.

0 Kudos
Reply