Xiaowei, I think it is easy

shang__xiaowei · ‎03-27-2019

Dear experts,

I test the cache miss ratio under the following experiment on Intel(R) Xeon(R) Gold 6138 CPU.

In the experiment, I use Intel CAT technology to allocate 5MB (i.e., 2-way) to a core on one socket, and run a thread on the core to sequentially access an array with sizes of 1MB, 2MB, 3MB, 4MB and 5MB respectively. I find the cache miss ratio is like following.

Array_size Cache miss ratio

1MB 0.01%

2MB 5.55%

3MB 15.43%

4MB 24.35%

5MB 34.49%

I cannot understand 1) why the cache miss ratio is so high? 2) why the cache miss ratio is increasing consecutively when the array size is increasing within the its cache partitioning size (i.e., 5MB).

After the above experiment, I also conduct another two experiments and see the similar trends of cache miss ratio. The first experiment is that I don’t use intel CAT to allocate cache size for the core (i.e., the core can use the full size of L3 cache), and the cache miss ratio trend is similar when the array size increases from 1MB to 27.5MB.

The second experiment is that I use huge pages (i.e., 1GB huge page) to redo above two experiments, and I also see the similar cache miss ratio trends.

Would you please help me understand my questions? Thanks much in advance.

Our CPU hardware parameters:

Core(s) per socket: 20

Socket(s): 4

Model name: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

L1d cache: 32K

L1i cache: 32K

L2 cache: 1024K

L3 cache: 28160K

Cache information

L3 Cache

Num ways: 11

Way size: 2621440 bytes

Num sets: 40960

Line size: 64 bytes

Total size: 28835840 bytes

L2 Cache

Num ways: 16

Way size: 65536 bytes

Num sets: 1024

Line size: 64 bytes

Total size: 1048576 bytes

HadiBrais · ‎03-27-2019

How exactly are you calculating the cache miss ratio? Can you also show the code that allocates the array, initializes it, the loop that sequentially accesses the array, and the compile options used to compile the code? Are hardware prefetchers enabled?

shang__xiaowei · ‎03-29-2019

Hi Hadi Brais,

Thank you for your replies. I answer your questions one by one as follows.

Q1: Can you also show the code that allocates the array, initializes it, the loop that sequentially accesses the array, and the compile options used to compile the code?

A1: Please see my programs here (https://github.com/xiaoweishang/tmp). "seq.c" is the main program to access the array sequentially with different sizes (you may want to edit "THREAD1_SIZE" macro to change the array size). "compile.sh" compiles "seq.c". Even huge pages (i.e., 1GB huge page) are used, I also see the similar cache miss ratio trends as shown in the previous post.

Q2: How exactly are you calculating the cache miss ratio?

A2: I use "perf record -e LLC-load-misses" to get the cache miss number (i.e., miss_num). The cache miss ratio is calculated with "miss_num/total_access_cache_line_number".

Q3: Are hardware prefetchers enabled?

A3: Yes, hardware prefetchers are enabled.

HadiBrais · ‎03-30-2019

I think the loop that is relevant to this discussion is the following:

for (j = 0; j < 60000; j++) {
    for (i = 0; i < THREAD1_SIZE*1024*16; i++) {
        _j = thread1_data.str[0];
    }
}

This means that the total number of cache lines loaded is 60000*THREAD1_SIZE*1024*16. Note that since you are compiling the code without optimizations, there are other loads in this loop nest, but this is not a problem in this case as I'll show.

"perf record -e LLC-load-misses" is the wrong way to count L3 load misses. First, "perf record" samples events rather counts the total number of events. Second, since you didn't qualify the event with ":u", kernel-mode events would also be sampled, which I don't think you care about. Instead, you should use "perf stat -e LLC-load-misses:u". I ran this command on a Haswell processor with an 8MB L3 cache for array sizes of 1MB, 2MB, 3MB, 4MB, and 5MB. In all cases, the number of L3 load misses is fewer than 0.01%, as expected. Also I noticed that the standard deviation (across 5 runs) is over 10%. This indicates that these 0.01% misses are occurring due to other system activity that is not under my control.

"perf record" can be used to determine on which instructions the L3 miss events are occurring. LLC-load-misses doesn't support PEBS. Instead, we can use MEM_LOAD_RETIRED.L3_MISS as follows:

perf record -e cpu/event=0xd1,umask=0x20/ppu ./seq 0

perf report

This shows that over 99% of L3 load misses are attributed to the instruction that loads from the array in the inner most loop.

shang__xiaowei · ‎03-30-2019

Hi Hadi Brais,

Thanks much for your answers. However, with your suggestions, I still see much cache misses. I did experiments step by step as follows. Please advise.

1, Allocate 5MB cache partitioning for core 0 with pqos_2way_dedicated.sh script (https://github.com/xiaoweishang/tmp/blob/master/pqos_2way_dedicated.sh). You may want to install the RDT (Resource Director Technology) tool (https://github.com/intel/intel-cmt-cat) for cache partitioning.

2, Run the program with 5MB array size (https://github.com/xiaoweishang/tmp/blob/master/seq.c), and profiling cache misses with different tools and commands as follows.

RDT tool					
cache miss per second(k)		cache miss ratio	bandwidth (GB/s)	exe. time (s)	
111789		                     40.91%	                16.678733	     17.507325	
					
# perf record -e LLC-load-misses ./seq 0					
	cache miss #.		total cache access #.		cache miss ratio
	322310346		     4915200000		                      6.56%
					
# perf stat -e LLC-load-misses ./seq 0					
	cache miss #.		total cache access #.		cache miss ratio				
	380,625,494		    4915200000		                       7.74%
					
# perf stat -e LLC-load-misses:u ./seq 0					
	cache miss #.		total cache access #.		cache miss ratio				
	352,701,985		    4915200000		                      7.18%

Note that, RDT tool can be used with "$ sudo pqos -I" to monitor cache misses dynamically like following. I calculated cache miss with the results of RDT tool like this: (111789*1024)/(16.678733*1024*1024*1024/64).

TIME 2019-03-30 12:39:33
    CORE   IPC   MISSES     LLC[KB]   MBL[MB/s]   MBR[MB/s]
       0  0.70  111789k      3200.0      5034.1         0.0
       1  0.17       0k         0.0         0.0         0.0
       2  0.12       0k       160.0         0.0         0.0
       3  0.27       0k         0.0         0.0         0.0
       4  0.30       0k         0.0         0.0         0.0
       5  0.28       0k        80.0         0.0         0.0
       6  0.15       0k         0.0         0.0         0.0
       7  0.19       0k         0.0         0.0         0.0
...

HadiBrais · ‎03-30-2019

RDT measures LLC misses using the LONGEST_LAT_CACHE.MISS event and that is over the entire program. The count includes data reads, code reads, RFOs, software prefetches, and hardware prefetches. An event is counted even if the instruction that generated the access did not retire. It appears to me that you have calculated the number 40.91% by dividing LONGEST_LAT_CACHE.MISS by 4915200000, which doesn't make sense because you'd be counting misses that did not occur in the region of interest, thereby making the miss ratio appear much larger than what it actually is. If this is how you calculated the numbers shown in your first comment, then these numbers are not useful and misleading.

You should use LLC-load-misses or MEM_LOAD_RETIRED.L3_MISS to count L3 misses. The latter is known be very accurate on all microarchitectures that support it and it seems to me that this is the event that you need. Can you calculate the L3 miss ratio using "perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq" for different array sizes and share also the standard deviation reported by perf?

shang__xiaowei · ‎03-30-2019

Sure. The profiling results are like following (Note: core0 is allocated with 5MB cache partitioning with RDT tool).

# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
Time diff is 2.514546 seconds
bandwidth is 23.065794 GB/s.
thread is on core 0
Time diff is 2.351004 seconds
bandwidth is 24.670311 GB/s.
thread is on core 0
Time diff is 2.425095 seconds
bandwidth is 23.916589 GB/s.
thread is on core 0
Time diff is 2.403081 seconds
bandwidth is 24.135682 GB/s.
thread is on core 0
Time diff is 2.360191 seconds
bandwidth is 24.574282 GB/s.

 Performance counter stats for './seq 0' (5 runs):

           731,794      cpu/event=0xd1,umask=0x20/u                                     ( +- 47.45% )

       2.447486563 seconds time elapsed                                          ( +-  1.25% )


# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
Time diff is 6.011606 seconds
bandwidth is 19.462353 GB/s.
thread is on core 0
Time diff is 5.881326 seconds
bandwidth is 19.893473 GB/s.
thread is on core 0
Time diff is 5.846925 seconds
bandwidth is 20.010518 GB/s.
thread is on core 0
Time diff is 5.816047 seconds
bandwidth is 20.116756 GB/s.
thread is on core 0
Time diff is 5.821555 seconds
bandwidth is 20.097723 GB/s.

 Performance counter stats for './seq 0' (5 runs):

        22,784,016      cpu/event=0xd1,umask=0x20/u                                     ( +-  4.71% )

       5.954286855 seconds time elapsed                                          ( +-  0.65% )

 
# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
Time diff is 9.650672 seconds
bandwidth is 18.133452 GB/s.
thread is on core 0
Time diff is 9.551177 seconds
bandwidth is 18.322349 GB/s.
thread is on core 0
Time diff is 9.613556 seconds
bandwidth is 18.203462 GB/s.
thread is on core 0
Time diff is 9.670079 seconds
bandwidth is 18.097060 GB/s.
thread is on core 0
Time diff is 9.640903 seconds
bandwidth is 18.151827 GB/s.

 Performance counter stats for './seq 0' (5 runs):

        96,222,465      cpu/event=0xd1,umask=0x20/u                                     ( +-  1.42% )

       9.752804448 seconds time elapsed                                          ( +-  0.21% )

 
# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
Time diff is 13.397798 seconds
bandwidth is 17.465557 GB/s.
thread is on core 0
Time diff is 13.325796 seconds
bandwidth is 17.559927 GB/s.
thread is on core 0
Time diff is 13.243609 seconds
bandwidth is 17.668900 GB/s.
thread is on core 0
Time diff is 13.322083 seconds
bandwidth is 17.564821 GB/s.
thread is on core 0
Time diff is 13.225328 seconds
bandwidth is 17.693323 GB/s.

 Performance counter stats for './seq 0' (5 runs):

       172,009,375      cpu/event=0xd1,umask=0x20/u                                     ( +-  1.48% )

      13.481374305 seconds time elapsed                                          ( +-  0.24% )

# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
Time diff is 17.324234 seconds
bandwidth is 16.855002 GB/s.
thread is on core 0
Time diff is 17.344714 seconds
bandwidth is 16.835100 GB/s.
thread is on core 0
Time diff is 17.392101 seconds
bandwidth is 16.789231 GB/s.
thread is on core 0
Time diff is 17.344088 seconds
bandwidth is 16.835708 GB/s.
thread is on core 0
Time diff is 17.295725 seconds
bandwidth is 16.882785 GB/s.

 Performance counter stats for './seq 0' (5 runs):

       307,467,022      cpu/event=0xd1,umask=0x20/u                                     ( +-  0.51% )

      17.582115998 seconds time elapsed                                          ( +-  0.09% )

Cache miss ratio with different array sizes is like following.

array size (MB)	cache miss	cache miss ratio
1	             731,794	0.01%
2	           22,784,016	0.46%
3	          96,222,465	1.96%
4	          172,009,375	3.50%
5	          307,467,022	6.26%

HadiBrais · ‎03-30-2019

These L3 miss ratios look good to me. Now to determine why some of the miss ratios are larger than 1%, run the same experiments but this time without cache partitioning, i.e., make the whole L3 available for the program. The results will basically tell us the impact of cache partitioning on the miss ratio.

shang__xiaowei · ‎03-30-2019

When there has no cache partitioning of 5MB for core 0, the miss ratio will be decreased significantly if the array size is lower or equal than 5MB. This is because the LLC size (i.e., 27.5MB) is larger than 5MB.

However, when the array size is increased to 27.5MB without cache partitioning, the cache miss ratio is 24.66%. Following experiments show the results when the array size is 5MB, 27MB and 27.5MB without cache partitioning.

array_size=5MB
# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
No sharing data - Thread1: Time diff is 13.704025 seconds
No sharing data - Thread1: bandwidth is 21.307609 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 13.625132 seconds
No sharing data - Thread1: bandwidth is 21.430985 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 13.621745 seconds
No sharing data - Thread1: bandwidth is 21.436314 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 13.657914 seconds
No sharing data - Thread1: bandwidth is 21.379546 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 13.622117 seconds
No sharing data - Thread1: bandwidth is 21.435728 GB/s.

 Performance counter stats for './seq 0' (5 runs):

               331      cpu/event=0xd1,umask=0x20/u                                     ( +- 13.99% )

      13.820940359 seconds time elapsed                                          ( +-  0.14% )


array_size=27MB
# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
No sharing data - Thread1: Time diff is 89.669023 seconds
No sharing data - Thread1: bandwidth is 17.642659 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 89.601210 seconds
No sharing data - Thread1: bandwidth is 17.656012 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 89.667037 seconds
No sharing data - Thread1: bandwidth is 17.643050 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 89.688417 seconds
No sharing data - Thread1: bandwidth is 17.638844 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 89.631560 seconds
No sharing data - Thread1: bandwidth is 17.650033 GB/s.

 Performance counter stats for './seq 0' (5 runs):

     1,002,955,588      cpu/event=0xd1,umask=0x20/u                                     ( +-  0.17% )

      90.858200299 seconds time elapsed                                          ( +-  0.02% )


array_size=27.5MB
# perf stat -r 5 -e cpu/event=0xd1,umask=0x20/u ./seq 0
thread is on core 0
No sharing data - Thread1: Time diff is 94.254427 seconds
No sharing data - Thread1: bandwidth is 17.095232 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 94.286443 seconds
No sharing data - Thread1: bandwidth is 17.089427 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 94.470349 seconds
No sharing data - Thread1: bandwidth is 17.056159 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 94.337323 seconds
No sharing data - Thread1: bandwidth is 17.080210 GB/s.
thread is on core 0
No sharing data - Thread1: Time diff is 94.702035 seconds
No sharing data - Thread1: bandwidth is 17.014431 GB/s.

 Performance counter stats for './seq 0' (5 runs):

     1,212,181,937      cpu/event=0xd1,umask=0x20/u                                     ( +-  0.18% )

      95.628169725 seconds time elapsed                                          ( +-  0.08% )

array size (MB)	          cache miss	cache miss ratio
           5	                 331	   0.00%
           27           1,002,955,588	   20.41%
         27.5	        1,212,181,937	   24.66%

HadiBrais · ‎03-30-2019

In the case of cache partitioning, the miss ratios that are larger than 0.00% could be explained by hardware prefetching into the L3 that may cause useful cache lines to be evicted or sub-optimal cache placement or replacement policies or other potential interference factors, all of which may get exacerbated when approaching the partition size.

shang__xiaowei · ‎03-30-2019

How to understand the experiments when the array size is 27.5MB without cache partitioning? In such cases, the cache miss ratio is pretty high (i.e., 24.66%).

HadiBrais · ‎03-30-2019

If you want to investigate this, then it's very important to use 1GB huge pages for all loads in the loop nest (not just the array), including memory that holds instructions. Other than that, the L3 cache is used by the whole system. It's not like your program is the only thing that is using the L3 cache, even if your system *appears* to be idle. It would be interesting to run the same test on a "bare-metal" system, i.e., without an OS, to isolate the impact from the OS factors from the L3 design factors.

shang__xiaowei · ‎03-31-2019

When I use huge pages (i.e., 1GB) to do the same experiments, the results are like following.

array size (MB)	               cache miss	     cache miss ratio
           27	               411,514,220	         8.37%
         27.5	               516,231,008	        10.50%

Why do you think OS contributes to the cache miss? From my understanding, except core 0 is running the program, other cores are idling so their impacts to cache miss can be ignored. Can you please explain why OS is a key factor in the case?

HadiBrais · ‎03-31-2019

The larger the number of 4KB pages allocated is, the more likely it is for two or more pages to conflict on the same cache sets rather than being evenly distributed over all the whole cache. Your results show that 10-15% of L3 cache misses are occurring due to this very reason, which is significant, as expected. I think you have only used 1GB page for the array and the code, stack, and all other variables are still allocated from 4KB pages, right?

Regarding the OS, you can run htop on a seemingly idle system and observe how the utilization of all cores is very low, but not really zero. I don't how much this activity may impact the contents of the L3 cache. The only way to know is by doing the experiment on a bare-metal system.

shang__xiaowei · ‎03-31-2019

Yes, the 1GB huge page is only for the array. The code, stack, and all other variables are allocated from 4KB pages because I think their (i.e., the code, stack, and all other variables) impacts to cache miss compared to the array can be ignored. Do you think so?

BTW, do you know any easy approaches for 1) running the experiments with 1GB huge page for the array, the code, stack, and all other variables? 2) how to run the experiments on bare-metal without OS?

HadiBrais · ‎03-31-2019

1) running the experiments with 1GB huge page for the array, the code, stack, and all other variables?

There is a library called libhugetlbfs (https://linux.die.net/man/7/libhugetlbfs) that can make any or all segments of an object file backed by huge pages. I have never used this, so I don't know if it works. I'm not aware of a way to make the stack backed by a huge page, but your loop nest does not seem to include any accesses to the stack, so I don't think it matters.

2) how to run the experiments on bare-metal without OS?

To my knowledge, there is no such framework, unfortunately. But it's not very difficult to develop one. One can start with an open-source minimal OS (or "bare-metal") project and add code to program and read the performance counters.

shang__xiaowei · ‎03-31-2019

Thank you. But I still cannot understand why the cache miss ratio is as high as 10.5% when the huge page (i.e., 1GB) is used. I don't think any noise can impact the cache miss ratio to be so high. Do you know why?

McCalpinJohn · ‎04-01-2019

This topic is much more complicated than a reasonable person might expect....

(1) Intel processors use a "distributed" L3, with physical (cacheline) addresses assigned to L3 slices using an undocumented hash function. In my SC18 paper on performance variability in the Xeon Platinum 8160 processor, I included some additional data measured on all Xeon Scalable processors with 14 to 28 cores. Slides 26 and 27 of my presentation (link below) show that the 20-core processor experiences "snoop filter conflicts" (causing L2 misses) for contiguous physical addresses that span most 32 MiB boundaries. The cache lines that are evicted from the L2 caches by "snoop filter evictions" usually hit in the L3, but this extra traffic will displace other lines from the L3, making its net effective size smaller.

http://sites.utexas.edu/jdm4372/2019/01/07/sc18-paper-hpl-and-dgemm-performance-variability-on-intel-xeon-platinum-8160-processors/

(2) Slides 18-19 of my presentation at the IXPUG Fall 2018 Conference (link below) provide a few bits of information on the hash functions used by the various Intel Xeon Scalable Processors. One relevant point from slide 19 is that on the 20-core processor, each naturally-aligned 256-cacheline block of addresses assigns 13 cache lines to each of the first 16 CHA+L3 blocks and 12 cache lines to each of the last 4 CHA+L3 blocks. This alone makes it impossible to use the entire L3 cache for contiguous addresses.

https://www.ixpug.org/components/com_solutionlibrary/assets/documents/1538092216-IXPUG_Fall_Conf_2018_paper_20%20-%20John%20McCalpin.pdf

(3) I have used libhugetlbfs to get the stack on 1GiB pages. It was not particularly difficult, but requires studying a fair amount of documentation to figure out what it is doing and how to use it for only the features you want. If you are going to that level of trouble, you probably want to add inline core and uncore performance counter instrumentation. There are two codes at https://github.com/jdmccalpin/SKX-SF-Conflicts that I developed to measure "snoop filter conflicts" for repeatedly loading (nominally) L2-containable arrays on Xeon Platinum 8160 (24-core) processors. It is not difficult to modify these to run on 20-core processors, and it is easy to increase the array size to cover the (nominally) (L2+L3)-containable array sizes that you are interested in. These codes include inline measurements of all fixed-function and programmable core counters, all CHA counters, and all IMC counters. One of the codes is configured to work on 2MiB pages (with each run getting a different set of physical addresses) and the other code is set up to run an various offsets in one or more 1 GiB page(s).

(4) There are other complications that I don't have time/space to discuss today. Some of these include:

The L3 is used as the target for Intel's "Direct Cache Access" technology, which sends all I/O DMA traffic to the L3 instead of to system memory. This could displace user data and/or change the LRU policy.
There are inclusivity issues related to instruction caching that may impact the apparent L2 and/or L3 capacity.
Using the L2+L3 requires that clean L2 capacity victims be sent to the L3 cache. This decision is made by the hardware using undocumented heuristics, and sometimes the hardware decides *not* to send the clean L2 capacity victims to the L3 (so the line must be re-loaded from memory). The core performance counter event IDI_MISC.WB_UPGRADE measures L2 capacity victims that are sent to the L3, while the core performance counter event IDI_MISC.WB_DOWNGRADE measures L2 capacity victims that are *not* sent to the L3. The "downgrade" counts vary from execution to execution for reasons that are not obvious....

HadiBrais · ‎04-01-2019

Xiaowei, I think it is easy to investigate some of the factors that John has pointed out. I think you can start with factor 2 and 4 (the third point).

Due to the way the L3 hash function works, a contiguous array of size 27.5, i.e., equal to the L3 size, does not actually fit in the L3 alone (let's disregard the L2 capacity even though the L3 is exclusive of the L2 for simplicity). We have to calculate the maximum array size that fits in the L3 taking into account the hash function.

The total number of L3 cache lines is 27.5*1024*1024/64 = 450560.

The total number of L3 cache lines per slice is 450560/20 = 22528 (because there are 20 physical cores and one slice per core).

The total number of 256-cacheline blocks to fill the whole L3 cache is 450560/256 = 1760.

For an array size is 27.5MiB allocated from a single GiB huge page:

The number of cache lines mapped to each slice 0-15 is 13*1760 = 22880. This exceeds the slice capacity by 352 lines.
The number of cache lines mapped to each slice 16-19 is 12*1760= 21120. This is fewer than the slice capacity by 1760 lines.

What is the maximum X such that 13*X <= 22528? The answer is X = 1732. So with an array size of 1732*256 = 443392 lines, the number of lines mapped to each L3 slice does not exceed the slice capacity. Under ideal system behavior, and no matter how the hardware prefetchers perform, there should be basically no L3 misses (MEM_LOAD_RETIRED.L3_MISS).

You can repeat the experiment with an array size of 27.0625MiB (443392*64/1024/1024) and measure MEM_LOAD_RETIRED.L3_MISS, IDI_MISC.WB_UPGRADE (event 0xFE and umask 0x02), and IDI_MISC.WB_DOWNGRADE (event 0xFE and umask 0x04).

But remember that you loop nest accesses a couple of cache lines other than the array (examine the generated assembly code). It's probably better to enable compiler optimizations so that the loop nest contains only loads from the array.

The next experiment should be designed to investigate the snoop filter impact on the miss ratio (factor 1).

The L3 is used as the target for Intel's "Direct Cache Access" technology, which sends all I/O DMA traffic to the L3 instead of to system memory. This could displace user data and/or change the LRU policy.

To my knowledge, I/O traffic can only occur if initiated by the OS (to read from a file or a network socket, for example). I'm not sure if there is any easy way to completely eliminate this factor other than running the experiment on a bare-metal system.

There are inclusivity issues related to instruction caching that may impact the apparent L2 and/or L3 capacity.

I think this factor can only be eliminated by allocating the array and the code from the same GiB huge page (using libhugetlbfs) and making sure that the total size does not exceed 27.0625MiB. Doing this may require some effort.

shang__xiaowei · ‎04-03-2019

Hi John and Brais, thanks much for your explanations.

I have tested the number of IDI_MISC.WB_DOWNGRADE as follows. It shows that the number reported by IDI_MISC.WB_DOWNGRADE counter (2,863,348,453) is much greater than # of L3 misses (516,231,008). When the array size is 27.5MB, the equation should be roughly correct: (# of cache lines loaded from memory to L2) + (# of cache lines loaded from memory to L3) = the # of IDI_MISC.WB_DOWNGRADE.

Would you please help me with questions: 1) why the number reported by IDI_MISC.WB_DOWNGRADE counter is much greater than the # of L3 misses when repeatedly scanning an array; 2) which performance counter event should I use to know the "# of cache lines loaded from memory to L2"?

# perf stat -r 5 -e cpu/event=0xfe,umask=0x04/u ./seq 0
thread is on core 0
Time diff is 87.627047 seconds
bandwidth is 18.388173 GB/s.
thread1 is on core 0
Time diff is 87.730192 seconds
bandwidth is 18.366554 GB/s.
thread1 is on core 0
Time diff is 87.694799 seconds
bandwidth is 18.373966 GB/s.
thread is on core 0
Time diff is 87.647871 seconds
bandwidth is 18.383804 GB/s.
thread is on core 0
Time diff is 87.653892 seconds
bandwidth is 18.382541 GB/s.

 Performance counter stats for './seq 0' (5 runs):

     2,863,348,453      cpu/event=0xfe,umask=0x04/u                                     ( +-  0.32% )

      89.035905332 seconds time elapsed                                          ( +-  0.02% )

McCalpinJohn · ‎04-03-2019

IDI_MISC.WB_DOWNGRADE has two effects: (1) The cache line is dropped from the L2, so when it is re-loaded it must come from memory; and (2) The cache line does *not* displace a line in the L3, so the L3 cache is not fully "flushed" and you can get L3 hits that you did not expect.

Testing/validating these counters is challenging, but there are some events that seem (to me) to be reliable. For your test, these events seem reliable and useful:

Core: L1D.REPLACEMENT (Event 0x51, Umask 0x01) -- counts all lines moved into L1
Core: L2_LINES_IN.ALL (Event 0xF1, Umask 0x1F) -- counts all lines moved into L2
Core: L2_REQUESTS.REFERENCES (Event 0x24, Umask 0xFF)
Core: L2_REQUESTS.MISS (Event 0x24, Umask 0x3F)
CHA: LLC_LOOKUP.DATA_READ (Event 0x34, Umask 0x03)
- Set Cn_MSR_PMON_BOX_FILTER0 to 0x01e20000 to measure all LLC accesses (hits and misses), or
- Set Cn_MSR_PMON_BOX_FILTER0 to 0x00020000 to measure LLC misses only
- Set Cn_MSR_PMON_BOX_FILTER1 to 0x0000003c to disable packet-matching
IMC: CAS_COUNT.READS (Event 0x03, Umask 0x04) -- all DRAM reads

Lots of ugly things happen when you approach full cache capacity, and lots of noise is added to the measurements if you run them on the whole program (instead of before and after the specific code that you are trying to measure), but these events seem consistent on SKX. I have not found any reliable events for L2 or L3 writebacks on SKX (except for IMC:CAS_COUNT.WRITES for the cases that go all the way to DRAM).

Why cache miss ratio is so high on Intel(R) Xeon(R) Gold 6138 CPU.