Memory bound characterization on Ivy Bridge

Yunqi_Z_ · ‎03-04-2013

Hi all,

I found it confusing when I tried to characterize the memory bound on Ivy Bridge as it is mentioned in the Intel 64 and IA-32 Architectures Optimization Reference Manual Appendix B.3.2.3, that I got larger number on STALLS_L2_PENDING than STALLS_L1D_PENDING. Consequently, If I do the calculation for %L2 Bound as the manual tells, I will get negative number for %L2 Bound. Could anyone help me with this please?

This it the code segment I tried to characterize:

#define N 1024

double A, B, C;

void code_to_monitor() {
int i, j, k;

for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
A = B = i + j;
C = 0.0;
}
}

for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
C += A * B;
}
}
}
}

And these are the numbers I got from the experiments.

CYCLE_ACTIVITY:STALLS_LDM_PENDING : 25129701285
CYCLE_ACTIVITY:STALLS_L1D_PENDING : 22822968083
CYCLE_ACTIVITY:STALLS_L2_PENDING : 24375543727
TOTAL CYCLES: 43885183166

SergeyKostrov · ‎03-04-2013

Let's consider Intel Core i7-3840QM ( Ivy Bridge / 4 cores / 8 logical CPUs / ark.intel.com/compare/70846 ): Size of L3 Cache = 8MB ( shared between all cores for data & instructions ) Size of L2 Cache = 1MB ( 256KB per core / shared for data & instructions ) Size of L1 Cache = 256KB ( 32KB per core for data & 32KB per core for instructions ) >>#define N 1024 >> >>double A, B, C; >>... It means, that: - Size of A matrix is 4MB ( 4,194,304 bytes ) - Size of B matrix is 4MB ( 4,194,304 bytes ) - Size of C matrix is 4MB ( 4,194,304 bytes ) As you can see only two matricies, for example A and B, could "fit" into L3 Cache at the same time in the best case (!). But, the "core" of your calculations uses C matrix as well: ... C += A * B; ... and I think it creates a problem you're observing. It is Not clear why you got a negative number for L2 Bound. Of course, none of these matricies could "fit" into L2 and L1 Caches. By the way, that is why Loop-Blocking optimization technique is recommended in such cases and it is described in the manual.

Yunqi_Z_ · ‎03-04-2013

Hi Sergey,

Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication, it's just a piece of sample code to check if the memorgy bound characterization work well which had given me negative numbers.

Bernard · ‎03-04-2013

>>> I will get negative number for %L2 Bound>>>

This is the formula used to calculate %L2 Bound : (CYCLE_ACTIVITY:STALLS_L1D_PENDING - CYCLE_ACTIVITY:STALLS_L2_PENDING) / CLOCKS

Now by looking at the formula values STALLS_L1D_PENDING is less than STALLS_L2D_PENDING so you are getting a negative result.

Yunqi_Z_ · ‎03-05-2013

Yes, iliyapolak. That's why I'm confused.

Bernard · ‎03-05-2013

Maybe it should be this way.

SergeyKostrov · ‎03-05-2013

>>...Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication... I understood this and I think Intel software engineers should review that formula in the Intel 64 and IA-32 Architectures Optimization Reference manual.

perfwise · ‎03-06-2013

The issue here is you have 3 arrays all coming from different cache locations. B is definitely in the L3, it attenuates the associativity of the L1 and the L2, no way it can fit into the L1 or L2. C is coming from L1, if not there then the L2, it is sequentially accessed and you'd have to evict every set it fits into the cache to find it in the L2, possible but unlikely. A is partly in the L1 and the rest is in the L2, it's reused over and over and accessed sequentially.

These pending stats are only partially accurate in my experience. If I want to know where I'm bound I measure the hw pref activity from the L1 as well as all the L2 stats which tell me about I-cache, L1D and HW pref activity. You'll know then if you're L2 bound, and you might measure the demand request stream from the L3, just to get an idea if they're not getting serviced by the L2 hw pref and making their way to the L3. Still, SB can deliver 2.5 upc operating out of it's L3 with 40-50 requests per thousand getting there (though this is with the HW pref picking up on that pattern). Problem for you is B is striding by 8192 B and the HW pref don't handle that pattern, so you're demand req are definitely getting to the L3. Every 8 iterations on the K loop you need to fetch 1024 cachelines from the L3.

perfwise