- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi all,
I found it confusing when I tried to characterize the memory bound on Ivy Bridge as it is mentioned in the Intel 64 and IA-32 Architectures Optimization Reference Manual Appendix B.3.2.3, that I got larger number on STALLS_L2_PENDING than STALLS_L1D_PENDING. Consequently, If I do the calculation for %L2 Bound as the manual tells, I will get negative number for %L2 Bound. Could anyone help me with this please?
This it the code segment I tried to characterize:
#define N 1024
double A
void code_to_monitor() {
int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
A
C
}
}
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
for (k = 0; k < N; k++) {
C
}
}
}
}
And these are the numbers I got from the experiments.
CYCLE_ACTIVITY:STALLS_LDM_PENDING : 25129701285
CYCLE_ACTIVITY:STALLS_L1D_PENDING : 22822968083
CYCLE_ACTIVITY:STALLS_L2_PENDING : 24375543727
TOTAL CYCLES: 43885183166
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sergey,
Thanks for you reply. Actually I'm not trying to optimize the matrix multiplication, it's just a piece of sample code to check if the memorgy bound characterization work well which had given me negative numbers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>> I will get negative number for %L2 Bound>>>
This is the formula used to calculate %L2 Bound : (CYCLE_ACTIVITY:STALLS_L1D_PENDING - CYCLE_ACTIVITY:STALLS_L2_PENDING) / CLOCKS
Now by looking at the formula values STALLS_L1D_PENDING is less than STALLS_L2D_PENDING so you are getting a negative result.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, iliyapolak. That's why I'm confused.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Maybe it should be this way.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The issue here is you have 3 arrays all coming from different cache locations. B is definitely in the L3, it attenuates the associativity of the L1 and the L2, no way it can fit into the L1 or L2. C is coming from L1, if not there then the L2, it is sequentially accessed and you'd have to evict every set it fits into the cache to find it in the L2, possible but unlikely. A is partly in the L1 and the rest is in the L2, it's reused over and over and accessed sequentially.
These pending stats are only partially accurate in my experience. If I want to know where I'm bound I measure the hw pref activity from the L1 as well as all the L2 stats which tell me about I-cache, L1D and HW pref activity. You'll know then if you're L2 bound, and you might measure the demand request stream from the L3, just to get an idea if they're not getting serviced by the L2 hw pref and making their way to the L3. Still, SB can deliver 2.5 upc operating out of it's L3 with 40-50 requests per thousand getting there (though this is with the HW pref picking up on that pattern). Problem for you is B is striding by 8192 B and the HW pref don't handle that pattern, so you're demand req are definitely getting to the L3. Every 8 iterations on the K loop you need to fetch 1024 cachelines from the L3.
perfwise
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page