Array index order in loops not behaving as expected slow/fast.

jkwi · ‎06-28-2013

I realize that F90 gives us some array operations but just trying to figure this out. Old school thinking has us looping over the last array index in the outer most loop to address memory consecutively.
The results I'm getting are not what I expect. With default optimization I used -opt-report and for the "slow" code the compiler is optimizing and switching the order of the loops. For the "fast" code (where I loop over the last index first) it does not and that runs *slower*. What is going on? If I set -O0 then I get the expected result, code below runs faster with j in outer loop.

Source codes attached.

What do I take away from this? Should we not try and be smart about the index order in loops? Thanks for any insight.

      integer ndimi,ndimj,ntimes
      parameter (ndimi=2000, ndimj=3000, ntimes=1000)
      integer x(ndimi,ndimj),y(ndimi,ndimj), i,j,k
      integer timesec1, timesec2

      call system_clock(timesec1)
      print *, 'time: ', timesec1

      do k = 1,ntimes
       do j=1,ndimj
        do i=1,ndimi
        x(i,j) = 5
        y(i,j) = 6
        x (i,j) = x(i,j) * y(i,j)
        end do
       end do
      end do

      call system_clock(timesec2)

      print *, 'time: ',timesec2
      print *, 'diff: ' ,timesec2 - timesec1

      end program

ifort (IFORT) 12.1.6 20130222
ifort -mcmodel=medium -shared-intel -opt-report loopindex_slow.f >& report_slow.txt
./a.out
time:   2033097649
time:   2033115630
diff:        17981
ifort -mcmodel=medium -shared-intel -opt-report loopindex.f > & report.txt
./a.out
time:   2033245879
time:   2033338024
diff:        92145

report_slow.txt has:
<loopindex_slow.f;10:10;hlo_linear_trans;MAIN__;0>
LOOP INTERCHANGE in loops at line: 10 12 13
Loopnest permutation ( 1 2 3 ) --> ( 3 1 2 )

Casey · ‎06-28-2013

My first thought is that your observed behavior with -O2 or greater has less to do with your loop iteration sequence but more to do with your operations in the loop. Your statements do not depent on k and though you loop over i and j, your statements do not depend on i or j. A better test of the effects you are attempting to explore would be a statement that depends on i,j,k and incorperates references to things like x(i+1,j-1) so that the compiler cannot as easily optimize away your entire loop.

TimP · ‎06-29-2013

Apparently, the outer loop is shortcut, as well as the inner loops being interchanged, in the case you intended to be slow. As Casey hinted, you should construct a benchmark which focuses on the point you are trying to make.

You wouldn't need to repeat your benchmark so many times if you would declare the system_clock arguments as integer(8). All currently maintained compilers support this much of Fortran 2003 (although it doesn't help on ifort Windows).