Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

difficult optimization problems

happyIntelCamper
Beginner
378 Views
Consider the loop:

do i = 1, n
out(i) = out(i) + in( index(i) )
enddo

You can get it to vectorize using -xSSE4.1 but it is still very slow. This is due to the
indirect memory reference.
The loop

do i = 1, n
out(i) = out(i) + in( i )
enddo

Would run 4 or 5 times faster.

What's the most efficient way to perform loops with indirect memory references?
0 Kudos
1 Solution
jimdempseyatthecove
Honored Contributor III
378 Views

Try a localized gather without read/modify/write

[cpp]do i = 1, n, 128
  jmax = min(128,n-i+1)
  do j=i, jmax
    inTemp(j) = in( index(i+j-1) )
  enddo
  do j=i, jmax
    out(i+j-1) = out(i+j-1) + inTemp( j )
  enddo
enddo

[/cpp]

Jim Dempsey

View solution in original post

0 Kudos
2 Replies
TimP
Honored Contributor III
378 Views
SSE4 vectorization speedup for a gather depends a great deal on cache locality. It could do as well as double the speed with good locality, or show no gain with poor locality. If the loop length is on the order of 1000, and there is a fair amount of cache locality so that no thread has to read all cache lines, OpenMP parallel should show a significant gain.
0 Kudos
jimdempseyatthecove
Honored Contributor III
379 Views

Try a localized gather without read/modify/write

[cpp]do i = 1, n, 128
  jmax = min(128,n-i+1)
  do j=i, jmax
    inTemp(j) = in( index(i+j-1) )
  enddo
  do j=i, jmax
    out(i+j-1) = out(i+j-1) + inTemp( j )
  enddo
enddo

[/cpp]

Jim Dempsey
0 Kudos
Reply