Solved: difficult optimization problems

happyIntelCamper · ‎08-25-2009

Consider the loop:

do i = 1, n
out(i) = out(i) + in( index(i) )
enddo

You can get it to vectorize using -xSSE4.1 but it is still very slow. This is due to the
indirect memory reference.
The loop

do i = 1, n
out(i) = out(i) + in( i )
enddo

Would run 4 or 5 times faster.

What's the most efficient way to perform loops with indirect memory references?

jimdempseyatthecove · ‎08-26-2009

Try a localized gather without read/modify/write

[cpp]do i = 1, n, 128
  jmax = min(128,n-i+1)
  do j=i, jmax
    inTemp(j) = in( index(i+j-1) )
  enddo
  do j=i, jmax
    out(i+j-1) = out(i+j-1) + inTemp( j )
  enddo
enddo

[/cpp]

Jim Dempsey

View solution in original post

TimP · ‎08-25-2009

SSE4 vectorization speedup for a gather depends a great deal on cache locality. It could do as well as double the speed with good locality, or show no gain with poor locality. If the loop length is on the order of 1000, and there is a fair amount of cache locality so that no thread has to read all cache lines, OpenMP parallel should show a significant gain.

jimdempseyatthecove · ‎08-26-2009

Try a localized gather without read/modify/write

[cpp]do i = 1, n, 128
  jmax = min(128,n-i+1)
  do j=i, jmax
    inTemp(j) = in( index(i+j-1) )
  enddo
  do j=i, jmax
    out(i+j-1) = out(i+j-1) + inTemp( j )
  enddo
enddo

[/cpp]

Jim Dempsey