I am analyzing “memory bound” metric in my code with Vtune. According to "Intel® 64 and IA-32 Architectures Optimization Reference Manual-B.3.2.3":
%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/CLOCKS
But in my Vtune results, CYCLE_ACTIVITY.STALLS_L1D_PENDING is smaller than CYCLE_ACTIVITY.STALLS_L2_PENDING, why?
I don't know what analysis type you used, or just used event counters directly?
I recommend you to use general-exploration analysis VTune Amplifier provided, which includes LLC Miss, LLC Hit, DTLB, etc. LLC Hit can indicate that L2 Cache miss but LLC hit
Remember that L2 memory can use different event to measure, I mean these events function are overlapped.
Back to your questions:
1. There is no L2 Bound in general-exploration, the reason is that its overhead is very small, approximate ~6 CPU cycles
2. You can use events from optimization user manual directly, like this:
# amplxe-cl -collect-with runsa -knob event-config=CYCLE_ACTIVITY.STALLS_L1D_PENDING,CYCLE_ACTIVITY.STALLS_L2_PENDING -duration 30 -- ./cache_test
My data was from Sandy Bridge processor:
CYCLE_ACTIVITY.STALLS_L1D_PENDING 5168007752 2584 2000003
CYCLE_ACTIVITY.STALLS_L2_PENDING 6186009279 3093 2000003
They are incorrect - CYCLE_ACTIVITY.STALLS_L1D_PENDING should include L2 hit & L2 pending (LLC hits & LL3 pending).
3. So I use other events to measure, like this
# amplxe-cl -collect-with runsa -knob event-config=L1D_PEND_MISS.PENDING,MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS,MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS -duration 30 -- ./cache_test
L1D_PEND_MISS.PENDING 289268433902 144634 2000003
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS 102107147 1021 100007
MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS 2810629969 56189 50021
L2 Bound = L1D_PEND_MISS.PENDING - MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS - MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS
Does it make sense?
Hi Peter Wang,
Thank you for reply.
Yes, I am using general-exploration in VTune Amplifier in the viewpoint of hardware event counts on Sandy Bridge CPU.
1. Why L2 overhead is very small? In Table 2-11 of optimization user manual, I find the latency of L2 is ~12 CPU cycles
2&3. So you mean CYCLE_ACTIVITY.STALLS_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can't be used for calculating L2 bound? I should use L1D_PEND_MISS.PENDING, MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS, and MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS to calculate, right?
Another question is : I find my another code CYCLE_ACTIVITY.STALLS_L1D_PENDING is 0 which means L2 no hit and no miss, but CYCLE_ACTIVITY.STALLS_L2_PENDING is not 0, why? Currently, I am a little confused on these metric.
What I can tell you is that you need to use performance counters recommended by VTune. I don't know why these events you mentioned give *wrong* data...but using VTune's metrics can avoid this.
1. Optimization manual provides L2 Bound about ~12 cycles overhead, that is for *commonly* supported processors, for advanced processors the data should be less, it depends on what processor you used. For example, you can use ~6 cycles penalty for Haswell, ~8 cycles for IvyBridge, ~10 cycles for SandyBridge, ~12 cycles for Nahelem. Remember that this is an approximated value.
2. Please wait, I found that L1D_PEND_MISS.PENDING result was also incorrect (not recommended by VTune). Actually I think that you can use event MEM_LOAD_UOPS_RETIRED.L2_HIT directly to measure. The penalty of L2 Bound is: 10 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS ; I worked on SandyBridge.
Again, please try to run MEM_LOAD_*** events which are recommended by VTune Amplifier.
Hmmm.... Both CYCLE_ACTIVITY.STALLS_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING (both 0xA3 events) are included in the "snbep_db.txt" configuration file for VTune Amplifier XE 2013 Update 17 (Build Number: 353306). The event L1D_PEND_MISS.PENDING is also included in that database, but that is a different event (0x48) and does not have options for counting stall cycles.
The CYCLE_ACTIVITY.STALLS_L*_PENDING (0xA3) events are new with Sandy Bridge. Since they are both new and fairly tricky, it would not be surprising to find bugs.
They are very interesting events to monitor -- counting stall cycles with pending demand misses does not guarantee that the stalls were due to the misses, but it is (at least intuitively, which may be a mistake) a step in the right direction. If the events are event approximately correct, they should be more useful than "cost" estimates based on fixed latencies.
Hi Peter and John,
Thank you for your comments.
I will try to use MEM_LOAD_*** to analyze memory bound. Anyway, I think CYCLE_ACTIVITY.STALLS_L*_PENDING is more meaningful for us to do optimization, because maybe we don't need to care some cache miss which can be hiden by execution pipeline.