Thanks Thomas.I am at the

Franco_M_ · ‎02-15-2018

Dear all,

I am trying to learn about cache optimization, but I am confused by the number of hardware events I find regarding cache misses.

For every level, I find a number of events I could monitor, for instance (see here) for L1 cache I could use L1D.REPLACEMENT or L1D_PEND_MISS.PENDING, for L2 L2_RQSTS.ALL_DEMAND_MISS and many more.

However, I would like to monitor the so-called "cache misses" for L1, L2, and L3, and I find everywhere just one single number with no reference to what hardware counter they used.

Is there a reference that I could read about this issue?

Thanks!
Franco

Thomas_G_4 · ‎02-15-2018

Hi,
It depends which software you use because some define other names (high-level names) to events, e.g. PAPI with PAPI_L1/2/3_DCM or perf with 'cache-misses'. For PAPI you can get the the underlying event with papi_avail -e <eventname>.
Another source for this information are the LIKWID performance groups (https://github.com/RRZE-HPC/likwid/tree/master/groups) which are defined and mostly validated for the supported architectures (Validation: https://github.com/RRZE-HPC/likwid/wiki/TestAccuracy#tested-microarchitectures). The group L2 handles L1<->L2 traffic and L3 handles L2<->L3 traffic. The cache miss event for L3 is in the L3CACHE group.

Best,
Thomas

Franco_M_ · ‎02-15-2018

Thanks Thomas.I am at the moment using Xcode with Instruments,

There is no documentation that I know of about this tool regarding counting cache events.

Thank you,
Franco

Thomas_G_4 · ‎02-15-2018

You probably need to find the settings for the hardware events and then add the hex codes for the events directly (either search them in https://download.01.org/perfmon or read them out of LIKWID https://github.com/RRZE-HPC/likwid/wiki/TutorialLikwidPerf#using-likwids-information-with-perf). I have never used Instruments but it might be that Apple allows only predefined analysis like memory usage and energy consumption.

McCalpinJohn · ‎02-15-2018

It is also important to understand how the particular cache works (1) and what the specific event is measuring (2).

(1) Most Intel processors in the last decade have used a memory hierarchy composed of

a private L1 Data Cache,
a private Unified, non-inclusive L2 cache, and
a shared, inclusive L3 cache.

There are some exceptions, including the Silvermont and Knights Landing (both have a shared L2 with no L3), and the Skylake Xeon (with a non-inclusive L3), and possibly a few others.

With an inclusive L3 cache, all demand load or demand "read for ownership" (i.e., stores that miss in the cache) or L1 hardware prefetch requests or L2 hardware prefetch requests that miss in the L3 move data into the L3. If the request originated in the core or L1 prefetcher, the data will also be copied to the L2 cache and to the L1 data cache. If the request originated from the L2 hardware prefetchers, the data might be copied into the L2 cache, or might be loaded only into the L3, depending on undocumented heuristics related to how busy the caches are and possibly other factors.

With this cache structure, when a "clean" cache line is chosen as the "victim" to be replaced, the old cache entry ("entry" == data + tags) is simply overwritten with the new cache entry. When a "dirty" cache line is chosen as the victim to be replaced, the entry must be copied out to a higher-numbered level of the cache. L1 victims are typically transferred to the L2 cache, but since the L2 is non-inclusive, it is possible that the entry for this cache line has been evicted from the L2 cache before being evicted from the L1. In the implementations I have studied, it looks like the L1 writeback bypasses the L2 and sends the dirty cache line entry to the L3. The L3 is inclusive, so if the line is in the L1 (or L2), there must be a valid entry for the line in the L3.

Skylake Xeon processors implement a non-inclusive L3, which results in a very different flow. A big topic for another day.

(2) Performance counter events have very specific meanings, but often have vague descriptions. Things to pay attention to:

Some events only count activity due to demand loads (i.e., not demand stores or any hardware prefetches), such as MEM_LOAD_RETIRED.* An increment to MEM_LOAD_RETIRED.L2_HIT, for example, means that either the cache line was found in the L2 cache because it was still there due to a previous use, OR because it had been moved into the L2 by a hardware prefetch. Because of this, a very high hit rate at a particular cache level with this counter does not imply a low amount of traffic into that cache. It might be low traffic (because you are re-using data), or it might be high traffic (because the data is getting into that cache via hardware prefetches), or it might be anywhere in between.
Some events count all transfers (loads or stores, demand or HW PF), such as L1D.REPLACEMENT
Some events count only transactions that complete (i.e., are not rejected), while other events count every time that the transaction is attempted. These are not always easy to distinguish, and the details probably vary by processor model. In my testing, it looks like the events named L2_RQSTS.* (Event 0x24) count only successful transactions, while the corresponding events L2_TRANS.* (Event 0xF0) count both successful transactions and rejected attempted transactions.
Some performance counter events related to caches have published errata in the "Specification Update" document for the processor model.
Some performance counter events related to caches have bugs that are not published, or are published in unexpected places (such as the Intel Optimization Reference Manual). Because of this, it is a good idea to have test codes that have predictable cache behavior to do sanity checking on the counts on your processor before you trust the values. Running each of the tests with hardware prefetchers enabled and disabled can be very enlightening (https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors)

Franco_M_ · ‎02-16-2018

Thanks to both of you for your answers.

This discussion is making me quite puzzled about any claim about "cache misses" that I can read.

My question is then: what should I think they report when dealing with cache miss numbers? For instance, may I trust the numbers to be these ones for instance, or should I think otherwise? I ask this because there are events that are not present in my system although listed in the link.

Thanks for your clarifications!
Franco

McCalpinJohn · ‎02-16-2018

The folks who developed and maintain likwid do a good job of choosing events, but I don't think that they have comprehensively tested all of their event sets.

For example, for the Ivy Bridge EP processor, test results at https://github.com/RRZE-HPC/likwid/wiki/AccuracyIvyBridgeEP appear to show that the "L2 group" is accurate for most of the tests. BUT, I can't figure out which of the events described in the "L2 Group" are being plotted, and the event called L2_RQSTS_MISS does not have an unambiguous correlation to the events described in Volume 3 of the Intel SWDM or the events listed at https://download.01.org/perfmon/IVB/ivybridge_core_v20.json. ; This is the main reason why I program the counters myself -- I always get frustrated trying to reverse engineer the mappings that various performance monitoring packages use....

Also note all of the tests in that group use data set sizes that are much larger than the L2 cache, so all of the expected accesses will be misses -- so they have verified that the counters are correct when the memory references miss in the L2 cache, but they have not (in this set of results) verified that the counters are correct when the memory references hit in the L2 cache. Because I can't tell which results are being plotted, I can't speculate on whether the small differences (2%-3% for the "triad" test) tell us anything useful about the counters.

Franco_M_ · ‎02-16-2018

Thank you John, it's really interesting.

Do you recommend some library or software that I could try? My objective is to be as cross-platform as possible (my colleagues and I work on macOS, Windows, and Linux, but mobile OSs also woule be great), since LIKWID is linux-only.

Thanks!
Franco

McCalpinJohn · ‎02-19-2018

Cross-platform is really hard -- even if the processors are the same.

I don't have any recommendations for cross-platform libraries -- I almost always write my own low-level code to interact with the hardware performance counters.

Thomas_G_4 · ‎02-21-2018

The accuracy tests for LIKWID do not use the events directly but derived metrics. In this case the L1 <-> L2 bandwidth. The L2 group for IvyBridge uses the two events L1D_REPLACEMENT (in SDM L1D.REPLACEMENT) and L1D_M_EVICT (indeed not published by Intel for IvyBridge but for SandyBridge as L1D.EVICTION and working for IvyBridge, Haswell, Broadwell, Skylake, Kabylake. I certainly tested the event L2_TRANS.L1D_WB for evicts from L1 to L2 but in the groups I choose the events that provide more accurate results in the test cases. The name L1D_M_EVICT origins from Nehalem where the L1D event group was introduced). The sizes are selected to get misses in L1. I'm well aware that the pages should contain more information about what was measured, how it was evaluated, how the events behave for different sizes and so forth.

The only tool I know that works on Linux, Windows and macOS is Intel's PCM https://github.com/opcm/pcm

Franco_M_ · ‎02-21-2018

Thanks Thomas, I will look into Intel's PCM.

Thank you,
Franco

Travis_D_ · ‎09-28-2018

McCalpin, John wrote:

With this cache structure, when a "clean" cache line is chosen as the "victim" to be replaced, the old cache entry ("entry" == data + tags) is simply overwritten with the new cache entry. When a "dirty" cache line is chosen as the victim to be replaced, the entry must be copied out to a higher-numbered level of the cache. L1 victims are typically transferred to the L2 cache, but since the L2 is non-inclusive, it is possible that the entry for this cache line has been evicted from the L2 cache before being evicted from the L1. In the implementations I have studied, it looks like the L1 writeback bypasses the L2 and sends the dirty cache line entry to the L3. The L3 is inclusive, so if the line is in the L1 (or L2), there must be a valid entry for the line in the L3.

Dr. McCalpin - thanks very much for yet another very information post. One question:

When you say "it looks like the L1 writeback bypasses the L2 and sends the dirty cache line entry to the L3" you are talking about only in the scenario where the line has been evicted from L2 already, not the scenario where the line is still in L2, right?

In the case that the line is still in L2 I imagine it is updated there and the eviction stops at that point (the L3 is not updated). Do you agree?

Skylake Xeon processors implement a non-inclusive L3, which results in a very different flow. A big topic for another day.

If you've written it up anywhere at this point, I'd love to see it.

McCalpinJohn · ‎10-05-2018

The scenario that I described (L1 victim bypassing L2 and going to L3) is my interpretation of the action taken on SNB-BDW processors in the rare case that the line was evicted from the L2 before being evicted from the L1. (If the line is still contained in the L2 cache, the L1 victim will update the L2 entry, with no change to the L3 entry.) On these processors, the L2 is 8x larger than the L1 Data Cache, and both have the same 8-way associativity, but the L2 cache is unified, so it also caches L1 Instruction Cache entries. Allocation of Instruction Cache lines into the L2 can cause L2 victims without overflowing the corresponding locations in the L1 Data Cache.

When there are repeated accesses to data found at different levels of the memory hierarchy, the updates of the LRU pointers in the caches will diverge. For example, a load that hits in the L1 Data Cache will update the L1 LRU pointer to point away from that line, but will not update the L2 LRU pointer for that same line. After 8 accesses to the same cache set, this line that is very active in the L1 Data Cache will be chosen as the victim in the L2 cache (because the L2 cannot "see" that the line has been accessed recently in the L1 Data Cache). This is allowed because the L2 is not inclusive. A minor complication arises if the line in the L1 Data Cache is dirty -- when it is eventually chosen as the victim by the L1 Data Cache, there will not be an entry for that line in the L2 cache. An implementation might allocate an entry in the L2 cache for the dirty line, but this is complex. It is much easier to simply send the L1 victim to the L3. The L3 is inclusive on SNB-BDW, so it is guaranteed that there is an entry for the L1 victim to occupy in the L3 (without needing to allocate a new location in the L3, which might result in an L3 to memory victim).

The same issue of divergence between LRU pointers is more important when considering the inclusive L3 cache in Nehalem through Broadwell. If a line is repeatedly accessed in the L1 while other lines are flowing through the L3, the address will become "old" in the L3 and will soon be chosen as the L3 victim. In this case, the inclusive nature of the L3 requires that this line be evicted from the L1 and L2 caches before being dropped from the L3. This is bad -- if the line is still in active use it will have to be fetched from memory. In one of our Nehalem/Westmere systems there was a BIOS option to enable "replacement hints" -- messages from the L1 and/or L2 cache to the L3 cache that a cache line was being actively used. If I recall correctly, the BIOS option was recommended to be enabled for HPC workloads and disabled otherwise. I have not seen the same BIOS option in later processors (Sandy Bridge through Broadwell), but my interpretation is that these "replacement hints" are still there, and that they are enabled by default. This could be tested easily enough with a carefully designed microbenchmark....

For the Xeon Scalable Processors (Skylake Xeon), the L3 cache is no longer inclusive. A cache line will be held in either a core (L1 or L2) or in the L3 cache, but not in both. (I don't know if Intel allows exceptions to this in the way that AMD has in the past with its "mostly-exclusive" L3 cache.) In Xeon Scalable Processors, the L3 acts primarily as a victim cache for lines evicted from the L1 and L2 caches. The behavior is complex because there are dynamic prediction mechanisms in the L2 caches that decide whether a victim should be sent to L3 or to memory, and the heuristics controlling these predictors are not documented. From my testing, it appears that if data is dirty, or remotely homed, or evicted due to a snoop filter eviction, it has a very high probability of being sent to the L3. If data is clean and locally homed, the probability of being sent to the L3 appears to be based on the history of whether subsequent accesses to that address typically hit in the L3. I have not attempted to determine what sorts of state are tracked by the hardware to make these decisions.

Xeon Scalable Processors don't have an inclusive L3, but still need to track lines held in L1 and L2 caches. This is done using a structure called a "snoop filter", which is essentially the tags+directory part of an inclusive L3 cache, but without room to store the actual data for the cache line. Like the earlier inclusive L3 caches, the snoop filter is inclusive, so lines chosen as victims in the snoop filter must be evicted from all L1 and L2 caches before the snoop filter entry can be freed to track a different cache line. Information in the Xeon Scalable Processor Uncore Performance Monitoring Guide (plus information on core performance counter events in Chapter 19 of Volume 3 of the Intel Architectures Software Developers Guide) suggests that the snoop filter receives eviction notifications from the L1 and L2 caches for many (but not all) evictions, and probably receives replacement hints from the L1 and L2 caches as well.

In Nehalem through Broadwell, the inclusive L3 cache meant that all lines fetched by the L2 HW prefetchers had to be fetched into the L3 cache (if the line was not already in the L3), and could be optionally sent to the L2. In the Xeon Scalable Processors, the default appears to be for the L2 HW prefetchers to fetch only into the L2 cache. Some of our systems have a BIOS option to enable prefetches into the L3 cache. Our initial experiments with enabling this feature showed a mix of small gains and small losses, so we reverted to the default (disabled) and have not tested it in more detail.

Cache misses events: what to choose