Solved: Puzzle: changing the order of outer loops leads to significant performance increase

Wentao_Z_ · ‎03-26-2014

Hi,

I have a puzzling finding that changing the order of the outer loops led to significant performance increase. I am playing with the following two versions of a small code piece:

Version 1: ii, k, j, i

1529         do ii = iis, iie
1530           value = vals(ii)
1531           do k = ks, ke
1532             do j = js, je
1533               ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
1534 !DIR$ SIMD
1535               do i = is, ie
1536                 l0 = ind_offset + i
1537                 dF(l0) = dF(l0) + value * F(l0 + ii)
1538               end do
1539             end do
1540           end do
1541         end do

Version 2: k, ii, j, i

1529         do k = ks, ke
1530           do ii = iis, iie
1531             value = vals(ii)
1532             do j = js, je
1533               ind_offset = ( (k-1)*N2 + (j-1) ) * N1g
1534 !DIR$ SIMD
1535               do i = is, ie
1536                 l0 = ind_offset + i
1537                 dF(l0) = dF(l0) + value * F(l0 + ii)
1538               end do
1539             end do
1540           end do
1541         end do

The ONLY difference between these two versions is the order of the outermost two loops: Version 1 has a loop order of ii, k, j, i while Version 2 has a loop order of k, ii, j, i. The profiling results of these two versions are summarized as below:

CPU Time(s) Load Instructions L1 Cache Hits L2 Cache Hits L3 Cache Hits MainMemory Hits
Version 1 11.282 1.36E+10 75.86% 3.46% 20.69% 0.00%
Version 2 7.372 1.36E+10 94.76% 1.24% 4.00% 0.00%

The results really surprised me in two ways:
(1) I observed a non-trivial speedup 11.282/7.372 = 1.53 and a significant increase in L1 Cache Hits.
(2) The only change I made was rearranging the order of the two OUTER loops, i.e., do ii loop and do k loop.

I have checked the vectorization report and found the inner loop (do i loop) in both of the two versions have been vectorized. So now I really have no idea what is going on. I compiled the code using ifort 13.1.0 with -O2 -xHost. The loop bound (length) for each loop level is:

do ii loop: 5
do k loop: 48
do j loop: 40
do i loop: 36

I truly appreciate your time and help.

Best regards,
Wentao

Yuan_C_Intel · ‎03-27-2014

Hi, Wentao

In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small

While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.

Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.

Hope this explains.

View solution in original post

Yuan_C_Intel · ‎03-27-2014

Hi, Wentao

In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small

While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.

Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.

Hope this explains.

Wentao_Z_ · ‎03-27-2014

Yolanda Chen (Intel) wrote:

Hi, Wentao

In your second version, for the same k, within loop ii, ind_offset was unchanged for Dimension k. Thus l0 is only counted by Dimension j and i. The corresponding memory block (size 40*36) of dF can be re-used in loop ii for 5 times. Since ii is quite small

While in your first version, for each the same ii, l0 is counted by Dimension k, j and i which refers to much larger data blocks (size 48*40*36) and cannot fit L1 cashe. Most of them will lie on L2 or L3. That's why you will get lower L1 cashe hits and bigger L and L3 cashe hits.

Access of Array F is counted by all loop variables: ii, k, j, and i, so does not impact much.

Hope this explains.

Hi Yolanda,

You explanations REALLY helps. Many thanks!

Best regards,
Wentao

jimdempseyatthecove · ‎03-28-2014

I think another factor is at play as well

(k-1)*N2 is loop invariant after k is established, and thus can be moved to immediately following do k=. The second version will compute (k-1)*N2 fewer times than the first. The L1 cache influence is likely greater contributor.

Jim Dempsey

Wentao_Z_ · ‎03-28-2014

jimdempseyatthecove wrote:

I think another factor is at play as well

(k-1)*N2 is loop invariant after k is established, and thus can be moved to immediately following do k=. The second version will compute (k-1)*N2 fewer times than the first. The L1 cache influence is likely greater contributor.

Jim Dempsey

Hi Jim,

Thanks for pointing out this :-)

Best regards,
Wentao