- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So, I have recently been writing a Linux perf tutorial, and in the process of discussing performance monitoring counters, discovered some interesting CPU behavior that I'm not sure I fully understand. To keep the tutorial accurate, I'd appreciate to some clarifications from someone more microarchitecturally enlightened side if that's alright.
Let's say I'm running a basic memory-bound microbenchmark along the lines of STREAM add (need at least two loads and one store per cycle to saturate L1 bandwidth). The main computation loop is this...
for (size_t i = 0 ; i < buffer_size / 4 ; ++i) {
output[i] = input1[i] + input2[i];
}
...and after hinting GCC a bit about the importance of loop unrolling, it gets compiled into this sort of reasonable-looking AVX assembly :
232: vmovups (%rcx,%r9,1),%ymm14
vaddps (%rdi,%r9,1),%ymm14,%ymm15
vmovups %ymm15,(%rsi,%r9,1)
vmovups 0x20(%rcx,%r9,1),%ymm6
vaddps 0x20(%rdi,%r9,1),%ymm6,%ymm0
vmovups %ymm0,0x20(%rsi,%r9,1)
vmovups 0x40(%rcx,%r9,1),%ymm5
vaddps 0x40(%rdi,%r9,1),%ymm5,%ymm1
vmovups %ymm1,0x40(%rsi,%r9,1)
vmovups 0x60(%rcx,%r9,1),%ymm4
vaddps 0x60(%rdi,%r9,1),%ymm4,%ymm2
vmovups %ymm2,0x60(%rsi,%r9,1)
vmovups 0x80(%rcx,%r9,1),%ymm3
vaddps 0x80(%rdi,%r9,1),%ymm3,%ymm7
vmovups %ymm7,0x80(%rsi,%r9,1)
vmovups 0xa0(%rcx,%r9,1),%ymm8
vaddps 0xa0(%rdi,%r9,1),%ymm8,%ymm9
vmovups %ymm9,0xa0(%rsi,%r9,1)
vmovups 0xc0(%rcx,%r9,1),%ymm10
vaddps 0xc0(%rdi,%r9,1),%ymm10,%ymm11
vmovups %ymm11,0xc0(%rsi,%r9,1)
vmovups 0xe0(%rcx,%r9,1),%ymm12
vaddps 0xe0(%rdi,%r9,1),%ymm12,%ymm13
vmovups %ymm13,0xe0(%rsi,%r9,1)
add $0x100,%r9
cmp %r9,%rbx
↑ jne 232
Furthermore, I'm running this on one thread per physical CPU core of my i9-10900 CPU, just to make sure I'm saturating the underlying memory bus, without needing to reason through the extra complexity of hyperthreading.
And finally, I'm allocating overaligned memory buffers using posix_memalign, so I don't expect evil torn loads and stores crossing cache line boundaries to muddy up the picture.
---
In this very simple scenario, my expectation would be that as buffer_size grows large, such a benchmark would eventually trash all cache levels and go straight to RAM, so every load should be an L1 miss, and L2 miss, and an L3 miss.
However, this is not what happens. Even at very large buffer sizes, I get this sort of picture:
$ OMP_NUM_THREADS=10 perf stat -e mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit,mem_load_retired.l2_hit,mem_load_retired.l2_miss,mem_load
_retired.l3_hit,mem_load_retired.l3_miss ./mem_scaler.bin 36
Performance counter stats for './mem_scaler.bin 36':
4374030955 mem_inst_retired.all_loads (49.98%)
479044494 mem_load_retired.l1_hit (49.99%)
1980286400 mem_load_retired.l1_miss (50.00%)
1908543494 mem_load_retired.fb_hit (50.02%)
766938995 mem_load_retired.l2_hit (50.03%)
1210933909 mem_load_retired.l2_miss (50.03%)
544807902 mem_load_retired.l3_hit (50.01%)
669426801 mem_load_retired.l3_miss (49.99%)
15.705905408 seconds time elapsed
149.391597000 seconds user
0.243986000 seconds sys
Now, from my understanding of what the Line Fill Buffer is about, I can reason about the fact that there are about as many "fb_hit" as there are "l1_miss". I'm doing AVX operations, which manipulate data in 32 byte chunks, whereas the cache manipulates data in 64 byte chunks. The LFB is about efficiently handling this "impedance mismatch", so the first AVX load that targets a given cache line gets handled as an L1 miss and the second load that targets the other AVX vector in that cache line (which is still in the process of being loaded) is handled by the same LFB entry. So far so good.
What is not clear, however, is why I have plenty of l1_hits, l2_hits, and l3_hits, accounting for a third of my loads together, where I would expect none or close to none. After all, due to the bandwidth difference between the L1, L2 and L3 cache, I would expect the CPU to consistently "outrun" the L2 and L3 cache, which as a result should never be hit even if they take the right prefetching decisions (as they should in this simple case). From this perspective, the cache hit rate in at least those two lower cache levels should be near zero.
Taking a step back and looking at the basic output of perf stat provides a possible clue on what could be happening :
$ OMP_NUM_THREADS=10 perf stat ./mem_scaler.bin 36
Performance counter stats for './mem_scaler.bin 36':
149556.67 msec task-clock # 9.533 CPUs utilized
257 context-switches # 1.718 /sec
36 cpu-migrations # 0.241 /sec
11029 page-faults # 73.745 /sec
185266674638 cycles # 1.239 GHz
7492907961 instructions # 0.04 insn per cycle
316367561 branches # 2.115 M/sec
799964 branch-misses # 0.25% of all branches
15.688555495 seconds time elapsed
149.364548000 seconds user
0.192005000 seconds sys
While the very low IPC certainly doesn't surprise me, that clock rate of 1.205 GHz is suspiciously low, when the base clock of that CPU is 2.8 GHz and I frequently see it reach above that !
From this observation, my intuition would be that when the CPU detects this sort of heavy trashing memory bound scenario, it might actually not stupidly poll the L1 cache in a loop waiting for data to arrive, as I would have naively expected it so, but could instead use some kind of "blocking" halting and wakeup notification mechanism to save power.
As a result of this mechanism, the core would send memory requests more slowly, so the L2 and L3 cache have time to "catch up" and prefetch more data from RAM. Hence the nontrivial L2 and L3 cache hit rate.
Is that the correct explanation ?
Link Copied

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page