Trouble finding cache miss rate on Skylake Xeon (Silver 4114)

Fan__Steven · ‎06-29-2018

Dear all,

I am currently studying about cache miss and trying to collect data of cache miss rate.The machine I am using has a Xeon Silver 4114 processor with a Skylake SP microarchitecture.

Some suggests that miss rate can be calculated by dividing OFFCORE_RESPONSE_0_DATA_IN_LOCAL_DRAM with the sum of MEM_INST_RETIRED_ALL LOADS and MEM_INST_RETIRED_ALL_STORES but I failed to find OFFCORE_RESPONSE_0_DATA_IN_LOCAL_DRAM, I found a counter called OFFCORE_RESPONSE_0_OPTIONS and the MSR value for DATA_IN_LOCAL_DRAM but the result I get is always zero. I am using Stream as the testing program and likwid to collect hardware counters.

I check several forums for the answer and some suggests that Skylake Xeon has a very different flow because of the non-inclusive L3, so I am also wondering if I have to find some other ways to get the miss rate.

Is there any reference I can read or any tools I can use?

Thank you!

Steven

McCalpinJohn · ‎07-05-2018

You need to be very clear about what you mean by "cache miss rate". There are lots of complications....

Which cache?
- L1 Instruction Cache
- L1 Data Cache
- L2 Unified Cache
- L3 Shared Cache
What transaction types?
- Local core instruction cache misses
  - demand misses
  - HW prefetch misses
- Local core loads
  - demand load misses
  - software prefetch load misses
  - L1 HW prefetch load misses
  - L2 HW prefetch load misses
- Local core stores
  - demand store misses
  - software prefetch store misses
  - L2 HW prefetch store misses
- Remote core accesses
- Remote cache accesses
- IO (DMA) accesses
What definition of "rate"?
- For L1 accesses, there can be anywhere between 1 and 64 load instructions that miss in the L1 Data Cache for a single cache line. How many of these should be counted? Even with something as simple as STREAM, minor changes to compiler options can cause the generation of code that has anywhere between 8 loads per cache line (non-vectorized code) and 1 load per cache line (AVX-512-vectorized code).
- "Miss Rate" implies some number of things related to "misses", divided by some other thing. Both the numerator and denominator can be confusing.
  - The numerator must include at least the number of cache lines that the code expects to load, but it may also include additional load instructions to other parts of those same lines, it may include stores, it may include software prefetches, it may include hardware prefetches.
  - The choice of denominator is not always obvious. "Miss Rate" can mean misses divided by loads, misses divided by time, misses divided by instructions.
    - Using "loads" in the denominator opens up all the many types of "load-like" transactions referred to above, and may refer to load instructions, or cache lines that are expected to be accessed by the load instructions, and may or may not include store misses, software prefetches, and/or hardware prefetches.
    - Using "time" in the denominator appears to be the least confusing, but can have its own complexities for parallel codes that have load imbalances.
    - Using "instructions" in the denominator is common in academic papers, but this assumes that the number of instructions executed has a simple relationship to the work being done. This can be seriously biased in parallel codes that have load imbalances and spin-waiting, and it can be fairly strongly biased by changes in the compilers code generation decisions (since compilers try to minimize execution time, not instruction count, the number of instructions executed can vary by a quite a bit between compilations even if the code changes or compiler option changes appear insignificant).
    - Any of these definitions can lead to confusion, but not knowing which one is being used is always a problem.

When I am working with STREAM and cache hits and misses, I usually just read the memory controller counters for cache line reads and writes and compare the numbers to the values I expect to see (assuming no re-use of the main STREAM arrays in any level of the cache). These are not "rates", they are just counts of cache line transfers. This has the advantage of not requiring counting on all cores, but has the disadvantage of not providing any insight into what is causing the memory accesses. (Not a problem with STREAM, since I know what causes the memory accesses, but often a problem when dealing with someone else's code.)

I have not tested the OFFCORE_RESPONSE events on SKX processors. They are generally tricky to program. The three places I look for examples are:

The model-specific tables of Chapter 19 of Volume 3 of the Intel Architectures SW Developer's Manual.
The model-specific configuration tables used by VTune.
The model-specific tables at the 01.org web site (e.g., https://download.01.org/perfmon/SKX/)

From the last site, the table at https://download.01.org/perfmon/SKX/skylakex_offcore_v1.10.tsv lists potentially useful values for the auxiliary MSRs to use with the offcore response events. One set of events that seems close to what you want would be:

Program one core performance counter with OFFCORE_RESPONSE Event 0xB7, Umask 0x01, and program the auxiliary MSR 0x1a6 with the event OFFCORE_RESPONSE.ALL_DATA_RD.L3_MISS.SNOOP_MISS_OR_NO_FWD (0x063fc00491).
- This will count all demand and prefetch data reads that miss in the L2 and L3 caches, for which the data is returned from local or remote DRAM.
Program another core performance counter with OFFCORE_RESPONSE Event 0xBB, Umask 0x01, and program the auxiliary MSR 0x1a7 with the event OFFCORE_RESPONSE.ALL_RFO.L3_MISS.SNOOP_MISS_OR_NO_FWD (0x063fc00122).
- This will count all demand and prefetch stores (RFO = "Read For Ownership" is the transaction generated by a store that misses in a cache) that miss in the L2 and L3 caches, for which the data is returned from local or remote DRAM.
Note that neither of these events count data being written back from L2 to L3, from L2 to DRAM, or from L3 to DRAM.
- With properly sized arrays (i.e., much larger than the caches), STREAM will generate one writeback to DRAM for each RFO.
- Other applications will have different behavior. For example, it is not uncommon for a "dirty" cache line to be evicted from the L2 to the L3, then brought back into the L2 for additional updates many times before the cache line is finally written back to memory.
- Understanding traffic through the cache hierarchy in more general programs requires use of the "uncore" counters for the L3 cache in addition to the core counters.

Fan__Steven · ‎07-06-2018

Thank you very much for your advice Dr. Bandwidth! I tried the offcore_response.all_data_rd.l3_miss.snoop_miss_or_no_fwd counter you mentioned. I tested your STREAM program and use perf to check the counters. However the results seem to vary a lot from run to run(sometimes as much as 50%). In comparison, the counter OFFCORE_RESPONSE_0_DATA_IN_LOCAL_DRAM used in previous machine has little variation. The load and store counts are similar from run to run. The running time also varies but not as much. Do you know reasons for this variation? Is it because there is non-determinism in the cache policy on the new machine?

McCalpinJohn · ‎07-09-2018

I have not experimented with any of the offcore response counter options on SKX, so I don't have any expectations about whether they should be trusted....

There is certainly a great deal of non-determinism in Intel processors (especially with regard to hardware prefetching), and it is likely that this is increased in the SKX generation. I would not expect this to have much influence on the offcore response event listed above, but it is not uncommon for performance counter events to have bugs whose magnitude depends on dynamically-adjusted behavior.

The most common example of dynamic variation in counts occurs with "demand read" events, which can be very low (if prefetching is aggressive), or high (if prefetching is disabled or if the access pattern is one that cannot be prefetched).

The event that you selected should be counting both demand and prefetch loads, so the split between the two should not matter.

An alternative dynamic mechanism that may apply here relates to bypass paths -- some data transfers can use either the "normal" (queued) data path or a "bypass" path, and in some cases these must be counted separately. There have been cases where no event to count traffic on the bypass path was provided (e.g., Sandy Bridge EP), requiring disabling the bypass path to obtain accurate measurements

You mentioned that you see large variations in this count --- how do the counts compare to the expected values?

When I want to use external (whole-program) counters with STREAM, I typically increase the NTIMES variable from the default of 10 to 100. This reduces the relative overhead of the setup and validation steps, so the simple estimate of expected traffic is closer.

When compiled with streaming stores, the expected number of cache line reads and writes for a run of STREAM is

CAS.READS = STREAM_ARRAY_SIZE * (6 * NTIMES + 1 + 3) * sizeof(STREAM_TYPE) / 64

CAS.WRITES = STREAM_ARRAY_SIZE * (4 * NTIMES + 3) * sizeof(STREAM_TYPE) / 64

Here the "6" is the number of explicit reads in the four STREAM kernels, "1" is the number of reads in the initial timing granularity check, "3" is the number of reads in the result validation code. There are "4" explicit stores in the four STREAM kernels, and 3 explicit stores in the initialization section. (The latter may or may not actually happen, depending on horrible details of how the OS uses sneaky hardware features to instantiate pages.)