determining optimum loop nesting and array indexing

gregfi04 · ‎09-01-2011

I just ran into a situation where my intuition about how to nest loops led me astray fairly dramatically, from a performance perspective.

The first block of code below shows how the code was originally structured. To my mind, this makes sense, with the most-nested loop indexing over the first subscript of the arrays (insofar as possible). Each successive loop indexes the next subscript to the right.

The second block of code is showing much better performance. The second and third most-nested loops index over the last subscripts of the psi array (which is quite large). The outer-most nested loops index over the subscripts that are farther to the left in the psi array.

Should I have been able to guess that code block #2 would perform better than code block #1? Is there any better method to finding optimal performance, other than just trying all of the different possible orderings?

Thanks,
Greg

---code blocks follow---

do nlo=1,nloct

do ndl=1,nldir

do zz=1,zm

do yy=1,ym

do xx=1,xm

do phi=0,phim

mat(phi,xx,yy,zz)=mat(phi,xx,yy,zz)+psi(7,xx,yy,zz,ndl,nlo)*fac(phi,ndl,nlo)

enddo

do zz=1,zm

do yy=1,ym

do xx=1,xm

do nlo=1,nloct

do ndl=1,nldir

do phi=0,phim

mat(phi,xx,yy,zz)=mat(phi,xx,yy,zz)+psi(7,xx,yy,zz,ndl,nlo)*fac(phi,ndl,nlo)

enddo

jimdempseyatthecove · ‎09-01-2011

The (representative) values of phim, xm, ym, zm, nldir, nloct (phim in particular) are not stated as well as if these are parameters (constants visible to the compiler) so it is difficult to say what is going on. The psi array, unless it is dimensioned (7:7, xm, ym, zm), will have stride access which is not particularly suitable for vectorization regardless of which loop is used. IMHO what you are observing is the second loop is experiencing higher cache hit count for the read/modify/write of mat (i.e. the same row of mat is referenced ndl*nlo times before you advance to the next row).

Jim Dempsey

gregfi04 · ‎09-01-2011

Some representative values are given below. They are not constants visibile to the compiler, they depend on the specifics of the problem. The first subscript of the psi array is dimensioned (1:7) or (1:9).

phim=15

xm=67
ym=139

zm=102

nldir=36

nloct=4

TimP · ‎09-01-2011

The 2nd loop organization has the advantage of much better locality for the updates. You should be looking at -opt-report for any differences and to see if the reduction loop over ndl and nlo (which might be linearizable) has been vectorized (more likely at -xSSE4.x).