What is the replacement policy of L2 and L3 on BroadWell CPU?

ZWang45 · ‎12-14-2017

Hi All,

Recently, I tried to measure the read bandwidth of L1, L2 or L3 on my BroadWell CPU.

My method is to control the buffer size, 16KB for L1, 128KB for L2 and 1M for L3. The result seems to be reasonable on one CPU core, 50GB/s for L1, 44Gb/s for L2 and 25GB/s for L3.

At the same time, I use Intel PCM to measure the cache misses from all levels. For the case with 128KB buffer size, the number of L2 cache misses is much more the number of cold misses. So I guess the replacement policy of L2 cache is not LRU.

Could anybody comment on it?

Thanks, Zeke.

McCalpinJohn · ‎12-18-2017

Intel typically describes the cache replacement policies for L1 Data Cache and L2 Cache as pseudo-LRU, but without more detail.

I am confused about your comment on the 128KiB case. The array should be fully L2-contained, so should have no L2 misses. Two things that could cause L2 misses are:

Failure to pin the process to a single core. If the OS migrates the process, the first iteration on the new core will have 100% L2 misses.
Bad luck with page colors using 4KiB pages. This should be relatively rare when using only 1/2 of the L2 cache, and should generally disappear on repeated runs. Using 2MiB hugepages will eliminate this possibility.

Lots more details are necessary to understand your observations:

Specific processor
OS
compiler, compile options, runtime options (e.g. process pinning)
description of array allocation code and test code
description of specific performance counter events measured

Travis_D_ · ‎12-18-2017

You should show your code, but your L1 bandwidth figure is way too low. Using a simple 32-byte "add all bytes" kernel like so:

top:
vpaddb ymm0,ymm0,YMMWORD PTR [rdx]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x20]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x40]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x60]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x80]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0xa0]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0xc0]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0xe0]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x100]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x120]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x140]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x160]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x180]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x1a0]
vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x1c0]
vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x1e0]
add    rdx,0x200
sub    rdi,0x1
jne    top

I get a timing of about 0.504 cyles per load, which corresponds to a bandwidth of 3.4 GHz / 0.504 * 32 = ~216 GB/s.

That corresponds well to what we know already: modern Intel can sustain 2 loads per cycle, including 32-byte AVX/AVX2 loads. With SKL-X that's doubled to 2 x 64-byte loads per cycle!