Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
29 Views

“memory bound” metric in Vtune

Hello,

I am analyzing “memory bound” metric in my code with Vtune. According to "Intel® 64 and IA-32 Architectures Optimization Reference Manual-B.3.2.3":

%L2 Bound =(CYCLE_ACTIVITY.STALLS_L1D_PENDING - CYCLE_ACTIVITY.STALLS_L2_PENDING)/CLOCKS

But in my Vtune results, CYCLE_ACTIVITY.STALLS_L1D_PENDING is smaller than CYCLE_ACTIVITY.STALLS_L2_PENDING, why?

0 Kudos
5 Replies
Highlighted
Employee
29 Views

I don't know what analysis type you used, or just used event counters directly?

I recommend you to use general-exploration analysis VTune Amplifier provided, which includes LLC Miss, LLC Hit, DTLB, etc. LLC Hit can indicate that L2 Cache miss but LLC hit

Remember that L2 memory can use different event to measure, I mean these events function are overlapped.

Back to your questions:

1. There is no L2 Bound in general-exploration, the reason is that its overhead is very small, approximate ~6 CPU cycles

2. You can use events from optimization user manual directly, like this:

# amplxe-cl -collect-with runsa -knob event-config=CYCLE_ACTIVITY.STALLS_L1D_PENDING,CYCLE_ACTIVITY.STALLS_L2_PENDING -duration 30 -- ./cache_test 

My data was from Sandy Bridge processor:

CYCLE_ACTIVITY.STALLS_L1D_PENDING                 5168007752                              2584  2000003          
CYCLE_ACTIVITY.STALLS_L2_PENDING                  6186009279                              3093  2000003      

They are incorrect -  CYCLE_ACTIVITY.STALLS_L1D_PENDING should include L2 hit & L2 pending (LLC hits & LL3 pending).

3.  So I use other events to measure, like this

# amplxe-cl -collect-with runsa -knob event-config=L1D_PEND_MISS.PENDING,MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS,MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS -duration 30 -- ./cache_test

L1D_PEND_MISS.PENDING                                289268433902                            144634  2000003          
MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS                  102107147                              1021  100007           
MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS                       2810629969                             56189  50021            

L2 Bound = L1D_PEND_MISS.PENDING - MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS - MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS

Does it make sense? 

 

 

0 Kudos
Highlighted
Beginner
29 Views

Hi Peter Wang,

Thank you for reply.

Yes, I am using general-exploration in VTune Amplifier in the viewpoint of hardware event counts on Sandy Bridge CPU.

1. Why L2 overhead is very small? In Table 2-11 of optimization user manual, I find  the latency of L2 is ~12 CPU cycles

2&3. So you mean CYCLE_ACTIVITY.STALLS_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING can't be used for calculating L2 bound? I should use  L1D_PEND_MISS.PENDING, MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS, and MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS to calculate, right?

Another question is : I find my another code CYCLE_ACTIVITY.STALLS_L1D_PENDING is 0 which means L2 no hit and no miss, but CYCLE_ACTIVITY.STALLS_L2_PENDING is not 0, why? Currently, I am a little confused on these metric.

0 Kudos
Highlighted
Employee
29 Views

Hello,

What I can tell you is that you need to use performance counters recommended by VTune. I don't know why these events you mentioned give *wrong* data...but using VTune's metrics can avoid this.

1. Optimization manual provides L2 Bound about ~12 cycles overhead, that is for *commonly* supported  processors, for advanced processors the data should be less, it depends on what processor you used. For example, you can use ~6 cycles penalty for Haswell, ~8 cycles for IvyBridge, ~10 cycles for SandyBridge, ~12 cycles for Nahelem. Remember that this is an approximated value.

2. Please wait, I found that L1D_PEND_MISS.PENDING result was also incorrect (not recommended by VTune). Actually I think that you can use event MEM_LOAD_UOPS_RETIRED.L2_HIT directly to measure. The penalty of L2 Bound is: 10 * MEM_LOAD_UOPS_MISC_RETIRED.LLC_MISS_PS ; I worked on SandyBridge.

Again, please try to run MEM_LOAD_*** events which are recommended by VTune Amplifier. 

 

0 Kudos
Highlighted
Black Belt
29 Views

Hmmm....  Both CYCLE_ACTIVITY.STALLS_L1D_PENDING and CYCLE_ACTIVITY.STALLS_L2_PENDING (both 0xA3 events) are included in the "snbep_db.txt" configuration file for VTune Amplifier XE 2013 Update 17 (Build Number: 353306).  The event L1D_PEND_MISS.PENDING is also included in that database, but that is a different event (0x48) and does not have options for counting stall cycles.

The CYCLE_ACTIVITY.STALLS_L*_PENDING (0xA3) events are new with Sandy Bridge.  Since they are both new and fairly tricky, it would not be surprising to find bugs.  

They are very interesting events to monitor -- counting stall cycles with pending demand misses does not guarantee that the stalls were due to the misses, but it is (at least intuitively, which may be a mistake) a step in the right direction.   If the events are event approximately correct, they should be more useful than "cost" estimates based on fixed latencies.

"Dr. Bandwidth"
0 Kudos
Highlighted
Beginner
29 Views

Hi Peter and John,

Thank you for your comments.

I will try to use MEM_LOAD_*** to analyze memory bound. Anyway, I think CYCLE_ACTIVITY.STALLS_L*_PENDING is more meaningful for us to do optimization, because maybe we don't need to care some cache miss which can be hiden by execution pipeline.

0 Kudos