- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi All,
Recently, I tried to measure the read bandwidth of L1, L2 or L3 on my BroadWell CPU.
My method is to control the buffer size, 16KB for L1, 128KB for L2 and 1M for L3. The result seems to be reasonable on one CPU core, 50GB/s for L1, 44Gb/s for L2 and 25GB/s for L3.
At the same time, I use Intel PCM to measure the cache misses from all levels. For the case with 128KB buffer size, the number of L2 cache misses is much more the number of cold misses. So I guess the replacement policy of L2 cache is not LRU.
Could anybody comment on it?
Thanks, Zeke.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel typically describes the cache replacement policies for L1 Data Cache and L2 Cache as pseudo-LRU, but without more detail.
I am confused about your comment on the 128KiB case. The array should be fully L2-contained, so should have no L2 misses. Two things that could cause L2 misses are:
- Failure to pin the process to a single core. If the OS migrates the process, the first iteration on the new core will have 100% L2 misses.
- Bad luck with page colors using 4KiB pages. This should be relatively rare when using only 1/2 of the L2 cache, and should generally disappear on repeated runs. Using 2MiB hugepages will eliminate this possibility.
Lots more details are necessary to understand your observations:
- Specific processor
- OS
- compiler, compile options, runtime options (e.g. process pinning)
- description of array allocation code and test code
- description of specific performance counter events measured
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should show your code, but your L1 bandwidth figure is way too low. Using a simple 32-byte "add all bytes" kernel like so:
top: vpaddb ymm0,ymm0,YMMWORD PTR [rdx] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x20] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x40] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x60] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x80] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0xa0] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0xc0] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0xe0] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x100] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x120] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x140] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x160] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x180] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x1a0] vpaddb ymm0,ymm0,YMMWORD PTR [rdx+0x1c0] vpaddb ymm1,ymm1,YMMWORD PTR [rdx+0x1e0] add rdx,0x200 sub rdi,0x1 jne top
I get a timing of about 0.504 cyles per load, which corresponds to a bandwidth of 3.4 GHz / 0.504 * 32 = ~216 GB/s.
That corresponds well to what we know already: modern Intel can sustain 2 loads per cycle, including 32-byte AVX/AVX2 loads. With SKL-X that's doubled to 2 x 64-byte loads per cycle!
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page