Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29361 Discussions

determining optimum loop nesting and array indexing

gregfi04
Beginner
9,702 Views
I just ran into a situation where my intuition about how to nest loops led me astray fairly dramatically, from a performance perspective.

The first block of code below shows how the code was originally structured. To my mind, this makes sense, with the most-nested loop indexing over the first subscript of the arrays (insofar as possible). Each successive loop indexes the next subscript to the right.

The second block of code is showing much better performance. The second and third most-nested loops index over the last subscripts of the psi array (which is quite large). The outer-most nested loops index over the subscripts that are farther to the left in the psi array.

Should I have been able to guess that code block #2 would perform better than code block #1? Is there any better method to finding optimal performance, other than just trying all of the different possible orderings?

Thanks,
Greg


---code blocks follow---

do nlo=1,nloct
do ndl=1,nldir
do zz=1,zm
do yy=1,ym
do xx=1,xm
do phi=0,phim
mat(phi,xx,yy,zz)=mat(phi,xx,yy,zz)+psi(7,xx,yy,zz,ndl,nlo)*fac(phi,ndl,nlo)
enddo
enddo
enddo
enddo
enddo
enddo


do zz=1,zm
do yy=1,ym
do xx=1,xm
do nlo=1,nloct
do ndl=1,nldir
do phi=0,phim
mat(phi,xx,yy,zz)=mat(phi,xx,yy,zz)+psi(7,xx,yy,zz,ndl,nlo)*fac(phi,ndl,nlo)
enddo
enddo
enddo
enddo
enddo
enddo
0 Kudos
3 Replies
jimdempseyatthecove
Honored Contributor III
9,702 Views
The (representative) values of phim, xm, ym, zm, nldir, nloct (phim in particular) are not stated as well as if these are parameters (constants visible to the compiler) so it is difficult to say what is going on. The psi array, unless it is dimensioned (7:7, xm, ym, zm), will have stride access which is not particularly suitable for vectorization regardless of which loop is used. IMHO what you are observing is the second loop is experiencing higher cache hit count for the read/modify/write of mat (i.e. the same row of mat is referenced ndl*nlo times before you advance to the next row).

Jim Dempsey
0 Kudos
gregfi04
Beginner
9,702 Views
Some representative values are given below. They are not constants visibile to the compiler, they depend on the specifics of the problem. The first subscript of the psi array is dimensioned (1:7) or (1:9).

phim=15
xm=67
ym=139
zm=102
nldir=36
nloct=4
0 Kudos
TimP
Honored Contributor III
9,702 Views
The 2nd loop organization has the advantage of much better locality for the updates. You should be looking at -opt-report for any differences and to see if the reduction loop over ndl and nlo (which might be linearizable) has been vectorized (more likely at -xSSE4.x).
0 Kudos
Reply