I am currently studying about cache miss and trying to collect data of cache miss rate.The machine I am using has a Xeon Silver 4114 processor with a Skylake SP microarchitecture.
Some suggests that miss rate can be calculated by dividing OFFCORE_RESPONSE_0_DATA_IN_LOCAL_DRAM with the sum of MEM_INST_RETIRED_ALL LOADS and MEM_INST_RETIRED_ALL_STORES but I failed to find OFFCORE_RESPONSE_0_DATA_IN_LOCAL_DRAM, I found a counter called OFFCORE_RESPONSE_0_OPTIONS and the MSR value for DATA_IN_LOCAL_DRAM but the result I get is always zero. I am using Stream as the testing program and likwid to collect hardware counters.
I check several forums for the answer and some suggests that Skylake Xeon has a very different flow because of the non-inclusive L3, so I am also wondering if I have to find some other ways to get the miss rate.
Is there any reference I can read or any tools I can use?
You need to be very clear about what you mean by "cache miss rate". There are lots of complications....
When I am working with STREAM and cache hits and misses, I usually just read the memory controller counters for cache line reads and writes and compare the numbers to the values I expect to see (assuming no re-use of the main STREAM arrays in any level of the cache). These are not "rates", they are just counts of cache line transfers. This has the advantage of not requiring counting on all cores, but has the disadvantage of not providing any insight into what is causing the memory accesses. (Not a problem with STREAM, since I know what causes the memory accesses, but often a problem when dealing with someone else's code.)
I have not tested the OFFCORE_RESPONSE events on SKX processors. They are generally tricky to program. The three places I look for examples are:
From the last site, the table at https://download.01.org/perfmon/SKX/skylakex_offcore_v1.10.tsv lists potentially useful values for the auxiliary MSRs to use with the offcore response events. One set of events that seems close to what you want would be:
Thank you very much for your advice Dr. Bandwidth! I tried the offcore_response.all_data_rd.l3_miss.snoop_miss_or_no_fwd counter you mentioned. I tested your STREAM program and use perf to check the counters. However the results seem to vary a lot from run to run(sometimes as much as 50%). In comparison, the counter OFFCORE_RESPONSE_0_DATA_IN_LOCAL_DRAM used in previous machine has little variation. The load and store counts are similar from run to run. The running time also varies but not as much. Do you know reasons for this variation? Is it because there is non-determinism in the cache policy on the new machine?
I have not experimented with any of the offcore response counter options on SKX, so I don't have any expectations about whether they should be trusted....
There is certainly a great deal of non-determinism in Intel processors (especially with regard to hardware prefetching), and it is likely that this is increased in the SKX generation. I would not expect this to have much influence on the offcore response event listed above, but it is not uncommon for performance counter events to have bugs whose magnitude depends on dynamically-adjusted behavior.
The most common example of dynamic variation in counts occurs with "demand read" events, which can be very low (if prefetching is aggressive), or high (if prefetching is disabled or if the access pattern is one that cannot be prefetched).
The event that you selected should be counting both demand and prefetch loads, so the split between the two should not matter.
An alternative dynamic mechanism that may apply here relates to bypass paths -- some data transfers can use either the "normal" (queued) data path or a "bypass" path, and in some cases these must be counted separately. There have been cases where no event to count traffic on the bypass path was provided (e.g., Sandy Bridge EP), requiring disabling the bypass path to obtain accurate measurements
You mentioned that you see large variations in this count --- how do the counts compare to the expected values?
When I want to use external (whole-program) counters with STREAM, I typically increase the NTIMES variable from the default of 10 to 100. This reduces the relative overhead of the setup and validation steps, so the simple estimate of expected traffic is closer.
When compiled with streaming stores, the expected number of cache line reads and writes for a run of STREAM is
CAS.READS = STREAM_ARRAY_SIZE * (6 * NTIMES + 1 + 3) * sizeof(STREAM_TYPE) / 64
CAS.WRITES = STREAM_ARRAY_SIZE * (4 * NTIMES + 3) * sizeof(STREAM_TYPE) / 64
Here the "6" is the number of explicit reads in the four STREAM kernels, "1" is the number of reads in the initial timing granularity check, "3" is the number of reads in the result validation code. There are "4" explicit stores in the four STREAM kernels, and 3 explicit stores in the initialization section. (The latter may or may not actually happen, depending on horrible details of how the OS uses sneaky hardware features to instantiate pages.)