Software Tuning, Performance Optimization & Platform Monitoring
Discussion regarding monitoring and software tuning methodologies, Performance Monitoring Unit (PMU) of Intel microprocessors, and platform updating.

[SPEC CPU 2017] When I increased rates, L2 bound jumped up.


Hello, I am a student studying computer architecture in South Korea.

I profiled SPEC CPU2017 rate Integer and floating point suites with VTune amplifier.

The experiment was conducted with the prediction that increasing the rate would cause bottleneck due to DRAM bandwidth limit.

As I expected, I confirmed that the run time increased and DRAM bound also increased in several benchmarks.

However, unexpectedly I found that increasing the rate caused the L2 cache bound increase.

I estimated that L3 cache and DRAM would have been affected by increasing the rate because they were shared by several cores. But the L2 cache is private, and as many input files are generated as the number of rates. So, I had thought since they didn't have any shared data, increasing rate wouldn't affect L2 bound. However, L2 cache bound increased significantly. So, I have developed several hypotheses for the unexpected results.

  1. Super Queue full
    • Because the cores share L3 cache, increasing the rate would cause super queue full which is located between L2 cache and L3 cache.
    • But, according to VTune amplifier, SQ full is a sub-level of L3 bound.
  2. Coherence problem
    • As the amount of data has increased, the number of stall cycles that are consumed by the cache coherency has increased, which is why the L2 bound is measured high.
    • But, there are no accurate measurements and it's just a guess.

So, I collected hardware events and compared them. (SPEC CPU fprate 549 fotonik3d)


Event Type                                        rate8                  rate16                   rate16/rate8

MEM_LOAD_UOPS_RETIRED.L1_HIT_PS     6794170191240     13608020412000       2

MEM_LOAD_UOPS_RETIRED.L2_HIT_PS     46129383840        85988579580            2

MEM_LOAD_UOPS_RETIRED.L3_HIT_PS     27323471040        63521667900            2

MEM_LOAD_UOPS_RETIRED.L3_MISS_PS   51033572100        101641114380           2

CYCLE_ACTIVITY.STALLS_MEM_ANY        7406651109960     43659065488500        6

CYCLE_ACTIVITY.STALLS_L1D_MISS         7263130894680     43181344771920        6

CYCLE_ACTIVITY.STALLS_L2_MISS           5272207908300     25061437592100        5

stall on L1 (MEM_ANY - L1D_MISS)        143520215280       477720716580            3

stall on L2 (L1D_MISS - L2_MISS)           1990922986380     18119907179820         9


The number of retired uops increased eightfold when I set rate from 1 to 8 and doubled when I set rate from 8 to 16. It is exactly the same amount as the rate increased. However, the stall cycle increased significantly.

So, the L2 bound increased due to L2 cache stall cycle, rather than due to increased L2 hit ratio.


Finally, these are my questions.

  1. Why is the stall cycle increasing when L1 miss and L2 hit in multiprocessing processes that do not share data with each other?
  2. Is there any way I can dig in more detail about L2 bound?



Experimental environment

CPU | Intel(R) Xeon(R) E5-2698 v4 @2.2GHz code named Broadwell

Total # of Cores | 20

L3 Cache | 50MB

L2 Cache | 256KB

Memory | DDR4 2400MHz, 4 channels

OS | Ubuntu 16.04.2 LTS

Hyper-Threading | OFF

Turbo Boost | OFF

Profiler | Intel VTune Amplifier 2019




0 Kudos
1 Reply
Black Belt

My first thought was back-invalidations from conflicts in the inclusive L3 cache.   There is some indication of this -- MEM_LOAD_UOPS_RETIRED.L2_HIT in the 16p case is only 1.86 times the value in the 8p case, so the L2 hit rate is about 7% lower in the 16p case.   This could be due to "active" L2 lines being invalidated.    This does not look like a large enough mechanism, however....  Each extra L2 miss would have to add several thousand stall cycles to account for the difference.   These evictions can probably be measured directly using the "opcode match" functionality of the uncore performance counters, but it would take some testing to figure out which opcode(s) are used by the L3 to invalidate L1+L2 caches.

These performance counter events can be confusing because they only report the location where the demand load found the data, not the location that the data was moved from (by hardware prefetches).   So I like to augment these values with total counts of cacheline transfers at L2, L3, and DRAM. 

0 Kudos