Making sure I understand what's going on in a cache trashing microbenchmark

HadrienG · ‎12-14-2022

So, I have recently been writing a Linux perf tutorial, and in the process of discussing performance monitoring counters, discovered some interesting CPU behavior that I'm not sure I fully understand. To keep the tutorial accurate, I'd appreciate to some clarifications from someone more microarchitecturally enlightened side if that's alright.

Let's say I'm running a basic memory-bound microbenchmark along the lines of STREAM add (need at least two loads and one store per cycle to saturate L1 bandwidth). The main computation loop is this...

for (size_t i = 0 ; i < buffer_size / 4 ; ++i) {
     output[i] = input1[i] + input2[i];
}

...and after hinting GCC a bit about the importance of loop unrolling, it gets compiled into this sort of reasonable-looking AVX assembly :

232:   vmovups    (%rcx,%r9,1),%ymm14
       vaddps     (%rdi,%r9,1),%ymm14,%ymm15
       vmovups    %ymm15,(%rsi,%r9,1)
       vmovups    0x20(%rcx,%r9,1),%ymm6
       vaddps     0x20(%rdi,%r9,1),%ymm6,%ymm0
       vmovups    %ymm0,0x20(%rsi,%r9,1)
       vmovups    0x40(%rcx,%r9,1),%ymm5
       vaddps     0x40(%rdi,%r9,1),%ymm5,%ymm1
       vmovups    %ymm1,0x40(%rsi,%r9,1)
       vmovups    0x60(%rcx,%r9,1),%ymm4
       vaddps     0x60(%rdi,%r9,1),%ymm4,%ymm2
       vmovups    %ymm2,0x60(%rsi,%r9,1)
       vmovups    0x80(%rcx,%r9,1),%ymm3
       vaddps     0x80(%rdi,%r9,1),%ymm3,%ymm7
       vmovups    %ymm7,0x80(%rsi,%r9,1)
       vmovups    0xa0(%rcx,%r9,1),%ymm8
       vaddps     0xa0(%rdi,%r9,1),%ymm8,%ymm9
       vmovups    %ymm9,0xa0(%rsi,%r9,1)
       vmovups    0xc0(%rcx,%r9,1),%ymm10
       vaddps     0xc0(%rdi,%r9,1),%ymm10,%ymm11
       vmovups    %ymm11,0xc0(%rsi,%r9,1)
       vmovups    0xe0(%rcx,%r9,1),%ymm12
       vaddps     0xe0(%rdi,%r9,1),%ymm12,%ymm13
       vmovups    %ymm13,0xe0(%rsi,%r9,1)
       add        $0x100,%r9
       cmp        %r9,%rbx
     ↑ jne        232

Furthermore, I'm running this on one thread per physical CPU core of my i9-10900 CPU, just to make sure I'm saturating the underlying memory bus, without needing to reason through the extra complexity of hyperthreading.

And finally, I'm allocating overaligned memory buffers using posix_memalign, so I don't expect evil torn loads and stores crossing cache line boundaries to muddy up the picture.

---

In this very simple scenario, my expectation would be that as buffer_size grows large, such a benchmark would eventually trash all cache levels and go straight to RAM, so every load should be an L1 miss, and L2 miss, and an L3 miss.

However, this is not what happens. Even at very large buffer sizes, I get this sort of picture:

$ OMP_NUM_THREADS=10 perf stat -e mem_inst_retired.all_loads,mem_load_retired.l1_hit,mem_load_retired.l1_miss,mem_load_retired.fb_hit,mem_load_retired.l2_hit,mem_load_retired.l2_miss,mem_load
_retired.l3_hit,mem_load_retired.l3_miss ./mem_scaler.bin 36 

Performance counter stats for './mem_scaler.bin 36': 

       4374030955      mem_inst_retired.all_loads                                           (49.98%) 
        479044494      mem_load_retired.l1_hit                                              (49.99%) 
       1980286400      mem_load_retired.l1_miss                                             (50.00%) 
       1908543494      mem_load_retired.fb_hit                                              (50.02%) 
        766938995      mem_load_retired.l2_hit                                              (50.03%) 
       1210933909      mem_load_retired.l2_miss                                             (50.03%) 
        544807902      mem_load_retired.l3_hit                                              (50.01%) 
        669426801      mem_load_retired.l3_miss                                             (49.99%) 

     15.705905408 seconds time elapsed 

    149.391597000 seconds user 
      0.243986000 seconds sys

Now, from my understanding of what the Line Fill Buffer is about, I can reason about the fact that there are about as many "fb_hit" as there are "l1_miss". I'm doing AVX operations, which manipulate data in 32 byte chunks, whereas the cache manipulates data in 64 byte chunks. The LFB is about efficiently handling this "impedance mismatch", so the first AVX load that targets a given cache line gets handled as an L1 miss and the second load that targets the other AVX vector in that cache line (which is still in the process of being loaded) is handled by the same LFB entry. So far so good.

What is not clear, however, is why I have plenty of l1_hits, l2_hits, and l3_hits, accounting for a third of my loads together, where I would expect none or close to none. After all, due to the bandwidth difference between the L1, L2 and L3 cache, I would expect the CPU to consistently "outrun" the L2 and L3 cache, which as a result should never be hit even if they take the right prefetching decisions (as they should in this simple case). From this perspective, the cache hit rate in at least those two lower cache levels should be near zero.

Taking a step back and looking at the basic output of perf stat provides a possible clue on what could be happening :

$ OMP_NUM_THREADS=10 perf stat ./mem_scaler.bin 36 

Performance counter stats for './mem_scaler.bin 36': 

        149556.67 msec task-clock                       #    9.533 CPUs utilized           
              257      context-switches                 #    1.718 /sec                    
               36      cpu-migrations                   #    0.241 /sec                    
            11029      page-faults                      #   73.745 /sec                    
     185266674638      cycles                           #    1.239 GHz                     
       7492907961      instructions                     #    0.04  insn per cycle          
        316367561      branches                         #    2.115 M/sec                   
           799964      branch-misses                    #    0.25% of all branches         

     15.688555495 seconds time elapsed 

    149.364548000 seconds user 
      0.192005000 seconds sys

While the very low IPC certainly doesn't surprise me, that clock rate of 1.205 GHz is suspiciously low, when the base clock of that CPU is 2.8 GHz and I frequently see it reach above that !

From this observation, my intuition would be that when the CPU detects this sort of heavy trashing memory bound scenario, it might actually not stupidly poll the L1 cache in a loop waiting for data to arrive, as I would have naively expected it so, but could instead use some kind of "blocking" halting and wakeup notification mechanism to save power.

As a result of this mechanism, the core would send memory requests more slowly, so the L2 and L3 cache have time to "catch up" and prefetch more data from RAM. Hence the nontrivial L2 and L3 cache hit rate.

Is that the correct explanation ?