- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi all,

I found it confusing when I tried to characterize the memory bound on Ivy Bridge as it is mentioned in the *Intel 64 and IA-32 Architectures Optimization Reference Manual Appendix B.3.2.3*, that I got larger number on STALLS_L2_PENDING than STALLS_L1D_PENDING. Consequently, If I do the calculation for *%L2 Bound* as the manual tells, I will get **negative number for %L2 Bound.** Could anyone help me with this please?

This it the code segment I tried to characterize:

#define N 1024

double A

void code_to_monitor() {

int i, j, k;

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

A

C

}

}

for (i = 0; i < N; i++) {

for (j = 0; j < N; j++) {

for (k = 0; k < N; k++) {

C

}

}

}

}

And these are the numbers I got from the experiments.

CYCLE_ACTIVITY:STALLS_LDM_PENDING : 25129701285

CYCLE_ACTIVITY:STALLS_L1D_PENDING : 22822968083

CYCLE_ACTIVITY:STALLS_L2_PENDING : 24375543727

TOTAL CYCLES: 43885183166

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

**Intel Core i7-3840QM**( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ): Size of

**L3**Cache = 8MB ( shared between all cores for data & instructions ) Size of

**L2**Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of

**L1**Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) >>#define N 1024 >> >>double A

**A**matrix is 4MB ( 4,194,304 bytes ) - Size of

**B**matrix is 4MB ( 4,194,304 bytes ) - Size of

**C**matrix is 4MB ( 4,194,304 bytes ) As you can see only two matricies, for example

**A**and

**B**, could "fit" into

**L3**Cache at the same time in the best case (!). But, the "core" of your calculations uses

**C**matrix as well: ... C

+= A

* B;
...
and I think it creates a problem you're observing.
It is Not clear why you got a negative number for

**L2**Bound. Of course, none of these matricies could "fit" into

**L2**and

**L1**Caches. By the way, that is why

**Loop-Blocking**optimization technique is recommended in such cases and it is described in the manual.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Sergey,

Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication, it's just a piece of sample code to check if the memorgy bound characterization work well which had given me negative numbers.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

>>> I will get negative number for *%L2 Bound>>>*

This is the formula used to calculate %L2 Bound : (CYCLE_ACTIVITY:STALLS_L1D_PENDING - CYCLE_ACTIVITY:STALLS_L2_PENDING) / CLOCKS

Now by looking at the formula values STALLS_L1D_PENDING is less than STALLS_L2D_PENDING so you are getting a negative result.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Yes, iliyapolak. That's why I'm confused.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Maybe it should be this way.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The issue here is you have 3 arrays all coming from different cache locations. B is definitely in the L3, it attenuates the associativity of the L1 and the L2, no way it can fit into the L1 or L2. C is coming from L1, if not there then the L2, it is sequentially accessed and you'd have to evict every set it fits into the cache to find it in the L2, possible but unlikely. A is partly in the L1 and the rest is in the L2, it's reused over and over and accessed sequentially.

These pending stats are only partially accurate in my experience. If I want to know where I'm bound I measure the hw pref activity from the L1 as well as all the L2 stats which tell me about I-cache, L1D and HW pref activity. You'll know then if you're L2 bound, and you might measure the demand request stream from the L3, just to get an idea if they're not getting serviced by the L2 hw pref and making their way to the L3. Still, SB can deliver 2.5 upc operating out of it's L3 with 40-50 requests per thousand getting there (though this is with the HW pref picking up on that pattern). Problem for you is B is striding by 8192 B and the HW pref don't handle that pattern, so you're demand req are definitely getting to the L3. Every 8 iterations on the K loop you need to fetch 1024 cachelines from the L3.

perfwise

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page