- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I have a puzzling finding that changing the order of the outer loops led to significant performance increase. I am playing with the following two versions of a small code piece:
Version 1: ii, k, j, i
1529 do ii = iis, iie 1530 value = vals(ii) 1531 do k = ks, ke 1532 do j = js, je 1533 ind_offset = ( (k-1)*N2 + (j-1) ) * N1g 1534 !DIR$ SIMD 1535 do i = is, ie 1536 l0 = ind_offset + i 1537 dF(l0) = dF(l0) + value * F(l0 + ii) 1538 end do 1539 end do 1540 end do 1541 end do
Version 2: k, ii, j, i
1529 do k = ks, ke 1530 do ii = iis, iie 1531 value = vals(ii) 1532 do j = js, je 1533 ind_offset = ( (k-1)*N2 + (j-1) ) * N1g 1534 !DIR$ SIMD 1535 do i = is, ie 1536 l0 = ind_offset + i 1537 dF(l0) = dF(l0) + value * F(l0 + ii) 1538 end do 1539 end do 1540 end do 1541 end do
The ONLY difference between these two versions is the order of the outermost two loops: Version 1 has a loop order of ii, k, j, i while Version 2 has a loop order of k, ii, j, i. The profiling results of these two versions are summarized as below:
CPU Time(s) Load Instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits MainMemory Hits
Version 1 11.282 1.36E+10 75.86% 3.46% 20.69% 0.00%
Version 2 7.372 1.36E+10 94.76% 1.24% 4.00% 0.00%
The results really surprised me in two ways:
(1) I observed a non-trivial speedup 11.282/7.372 = 1.53 and a significant increase in L1 Cache Hits.
(2) The only change I made was rearranging the order of the two OUTER loops, i.e., do ii loop and do k loop.
I have checked the vectorization report and found the inner loop (do i loop) in both of the two versions have been vectorized. So now I really have no idea what is going on. I compiled the code using ifort 13.1.0 with -O2 -xHost. The loop bound (length) for each loop level is:
do ii loop: 5
do k loop: 48
do j loop: 40
do i loop: 36
I truly appreciate your time and help.
Best regards,
Wentao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Wentao
In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small
While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.
Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.
Hope this explains.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi, Wentao
In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small
While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.
Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.
Hope this explains.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yolanda Chen (Intel) wrote:
Hi, Wentao
In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small
While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.
Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.
Hope this explains.
Hi Yolanda,
You explanations REALLY helps. Many thanks!
Best regards,
Wentao
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I think another factor is at play as well
(k-1)*N2 is loop invariant after k is established, and thus can be moved to immediately following do k=. The second version will compute (k-1)*N2 fewer times than the first. The L1 cache influence is likely greater contributor.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
jimdempseyatthecove wrote:
I think another factor is at play as well
(k-1)*N2 is loop invariant after k is established, and thus can be moved to immediately following do k=. The second version will compute (k-1)*N2 fewer times than the first. The L1 cache influence is likely greater contributor.
Jim Dempsey
Hi Jim,
Thanks for pointing out this :-)
Best regards,
Wentao

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page