- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I just ran into a situation where my intuition about how to nest loops led me astray fairly dramatically, from a performance perspective.
The first block of code below shows how the code was originally structured. To my mind, this makes sense, with the most-nested loop indexing over the first subscript of the arrays (insofar as possible). Each successive loop indexes the next subscript to the right.
The second block of code is showing much better performance. The second and third most-nested loops index over the last subscripts of the psi array (which is quite large). The outer-most nested loops index over the subscripts that are farther to the left in the psi array.
The first block of code below shows how the code was originally structured. To my mind, this makes sense, with the most-nested loop indexing over the first subscript of the arrays (insofar as possible). Each successive loop indexes the next subscript to the right.
The second block of code is showing much better performance. The second and third most-nested loops index over the last subscripts of the psi array (which is quite large). The outer-most nested loops index over the subscripts that are farther to the left in the psi array.
Should I have been able to guess that code block #2 would perform better than code block #1? Is there any better method to finding optimal performance, other than just trying all of the different possible orderings?
Thanks,
Greg
---code blocks follow---
do nlo=1,nloct
do ndl=1,nldir
do zz=1,zm do yy=1,ym
do xx=1,xm
do phi=0,phim mat(phi,xx,yy,zz)=mat(phi,xx,yy,zz)+psi(7,xx,yy,zz,ndl,nlo)*fac(phi,ndl,nlo) enddo
enddo
enddo
enddo
enddo
enddo
do zz=1,zm
do yy=1,ym
do xx=1,xm
do nlo=1,nloct
do ndl=1,nldir
do phi=0,phim
mat(phi,xx,yy,zz)=mat(phi,xx,yy,zz)+psi(7,xx,yy,zz,ndl,nlo)*fac(phi,ndl,nlo)
enddo
enddo
enddo
enddo
enddo
enddo
Link Copied
3 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The (representative) values of phim, xm, ym, zm, nldir, nloct (phim in particular) are not stated as well as if these are parameters (constants visible to the compiler) so it is difficult to say what is going on. The psi array, unless it is dimensioned (7:7, xm, ym, zm), will have stride access which is not particularly suitable for vectorization regardless of which loop is used. IMHO what you are observing is the second loop is experiencing higher cache hit count for the read/modify/write of mat (i.e. the same row of mat is referenced ndl*nlo times before you advance to the next row).
Jim Dempsey
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Some representative values are given below. They are not constants visibile to the compiler, they depend on the specifics of the problem. The first subscript of the psi array is dimensioned (1:7) or (1:9).
phim=15
xm=67
ym=139
ym=139
zm=102
nldir=36
nloct=4
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 2nd loop organization has the advantage of much better locality for the updates. You should be looking at -opt-report for any differences and to see if the reduction loop over ndl and nlo (which might be linearizable) has been vectorized (more likely at -xSSE4.x).
Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page