Showing results for

- Intel Community
- Software
- Software Development Topics
- Software Tuning, Performance Optimization & Platform Monitoring
- Memory bound characterization on Ivy Bridge

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Yunqi_Z_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2013
02:56 PM

83 Views

Memory bound characterization on Ivy Bridge

Hi all,

I found it confusing when I tried to characterize the memory bound on Ivy Bridge as it is mentioned in the *Intel 64 and IA-32 Architectures Optimization Reference Manual Appendix B.3.2.3*, that I got larger number on STALLS_L2_PENDING than STALLS_L1D_PENDING. Consequently, If I do the calculation for *%L2 Bound* as the manual tells, I will get **negative number for %L2 Bound.** Could anyone help me with this please?

This it the code segment I tried to characterize:

#define N 1024

double A

void code_to_monitor() {

int i, j, k;

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

A

C

}

}

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

for (k = 0; k < N; k++) {

C

}

}

}

}

And these are the numbers I got from the experiments.

CYCLE_ACTIVITY:STALLS_LDM_PENDING : 25129701285

CYCLE_ACTIVITY:STALLS_L1D_PENDING : 22822968083

CYCLE_ACTIVITY:STALLS_L2_PENDING : 24375543727

TOTAL CYCLES: 43885183166

Link Copied

7 Replies

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2013
09:26 PM

83 Views

Yunqi_Z_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2013
09:32 PM

83 Views

Hi Sergey,

Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication, it's just a piece of sample code to check if the memorgy bound characterization work well which had given me negative numbers.

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-04-2013
10:58 PM

83 Views

>>> I will get negative number for *%L2 Bound>>>*

This is the formula used to calculate %L2 Bound : (CYCLE_ACTIVITY:STALLS_L1D_PENDING - CYCLE_ACTIVITY:STALLS_L2_PENDING) / CLOCKS

Now by looking at the formula values STALLS_L1D_PENDING is less than STALLS_L2D_PENDING so you are getting a negative result.

Yunqi_Z_

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013
12:44 AM

83 Views

Yes, iliyapolak. That's why I'm confused.

Bernard

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013
03:16 AM

83 Views

Maybe it should be this way.

SergeyKostrov

Valued Contributor II

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-05-2013
05:44 AM

83 Views

perfwise

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

03-06-2013
04:45 AM

83 Views

The issue here is you have 3 arrays all coming from different cache locations. B is definitely in the L3, it attenuates the associativity of the L1 and the L2, no way it can fit into the L1 or L2. C is coming from L1, if not there then the L2, it is sequentially accessed and you'd have to evict every set it fits into the cache to find it in the L2, possible but unlikely. A is partly in the L1 and the rest is in the L2, it's reused over and over and accessed sequentially.

These pending stats are only partially accurate in my experience. If I want to know where I'm bound I measure the hw pref activity from the L1 as well as all the L2 stats which tell me about I-cache, L1D and HW pref activity. You'll know then if you're L2 bound, and you might measure the demand request stream from the L3, just to get an idea if they're not getting serviced by the L2 hw pref and making their way to the L3. Still, SB can deliver 2.5 upc operating out of it's L3 with 40-50 requests per thousand getting there (though this is with the HW pref picking up on that pattern). Problem for you is B is striding by 8192 B and the HW pref don't handle that pattern, so you're demand req are definitely getting to the L3. Every 8 iterations on the K loop you need to fetch 1024 cachelines from the L3.

perfwise

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.