Hello, I am a student studying computer architecture in South Korea.
I profiled SPEC CPU2017 rate Integer and floating point suites with VTune amplifier.
The experiment was conducted with the prediction that increasing the rate would cause bottleneck due to DRAM bandwidth limit.
As I expected, I confirmed that the run time increased and DRAM bound also increased in several benchmarks.
However, unexpectedly I found that increasing the rate caused the L2 cache bound increase.
I estimated that L3 cache and DRAM would have been affected by increasing the rate because they were shared by several cores. But the L2 cache is private, and as many input files are generated as the number of rates. So, I had thought since they didn't have any shared data, increasing rate wouldn't affect L2 bound. However, L2 cache bound increased significantly. So, I have developed several hypotheses for the unexpected results.
- Super Queue full
- Because the cores share L3 cache, increasing the rate would cause super queue full which is located between L2 cache and L3 cache.
- But, according to VTune amplifier, SQ full is a sub-level of L3 bound.
- Coherence problem
- As the amount of data has increased, the number of stall cycles that are consumed by the cache coherency has increased, which is why the L2 bound is measured high.
- But, there are no accurate measurements and it's just a guess.
So, I collected hardware events and compared them. (SPEC CPU fprate 549 fotonik3d)
Event Type rate8 rate16 rate16/rate8
MEM_LOAD_UOPS_RETIRED.L1_HIT_PS 6794170191240 13608020412000 2
MEM_LOAD_UOPS_RETIRED.L2_HIT_PS 46129383840 85988579580 2
MEM_LOAD_UOPS_RETIRED.L3_HIT_PS 27323471040 63521667900 2
MEM_LOAD_UOPS_RETIRED.L3_MISS_PS 51033572100 101641114380 2
CYCLE_ACTIVITY.STALLS_MEM_ANY 7406651109960 43659065488500 6
CYCLE_ACTIVITY.STALLS_L1D_MISS 7263130894680 43181344771920 6
CYCLE_ACTIVITY.STALLS_L2_MISS 5272207908300 25061437592100 5
stall on L1 (MEM_ANY - L1D_MISS) 143520215280 477720716580 3
stall on L2 (L1D_MISS - L2_MISS) 1990922986380 18119907179820 9
The number of retired uops increased eightfold when I set rate from 1 to 8 and doubled when I set rate from 8 to 16. It is exactly the same amount as the rate increased. However, the stall cycle increased significantly.
So, the L2 bound increased due to L2 cache stall cycle, rather than due to increased L2 hit ratio.
Finally, these are my questions.
- Why is the stall cycle increasing when L1 miss and L2 hit in multiprocessing processes that do not share data with each other?
- Is there any way I can dig in more detail about L2 bound?
CPU | Intel(R) Xeon(R) E5-2698 v4 @2.2GHz code named Broadwell
Total # of Cores | 20
L3 Cache | 50MB
L2 Cache | 256KB
Memory | DDR4 2400MHz, 4 channels
OS | Ubuntu 16.04.2 LTS
Hyper-Threading | OFF
Turbo Boost | OFF
Profiler | Intel VTune Amplifier 2019
My first thought was back-invalidations from conflicts in the inclusive L3 cache. There is some indication of this -- MEM_LOAD_UOPS_RETIRED.L2_HIT in the 16p case is only 1.86 times the value in the 8p case, so the L2 hit rate is about 7% lower in the 16p case. This could be due to "active" L2 lines being invalidated. This does not look like a large enough mechanism, however.... Each extra L2 miss would have to add several thousand stall cycles to account for the difference. These evictions can probably be measured directly using the "opcode match" functionality of the uncore performance counters, but it would take some testing to figure out which opcode(s) are used by the L3 to invalidate L1+L2 caches.
These performance counter events can be confusing because they only report the location where the demand load found the data, not the location that the data was moved from (by hardware prefetches). So I like to augment these values with total counts of cacheline transfers at L2, L3, and DRAM.