Optimization of scatte/gather loops

Keith_R_ · ‎11-09-2004

I'm trying to optimize some simple scatter & gather loops
for optimum performance, particularly on Pentium IA32 and Itanium IA64.

do i = 1, m
b(i) = a(c(i))
enddo

There are a number of optimization issues here; the main ones I see
are (a) loop dependency (b) cache performance. As long as I as programmer
can ensure that the entries of c(i) are unique then there is no dependency
problem. Certainly adding the directive !DIR$ ivdep appears to allow the
compiler to perform better software pipelining in the Itanium.

What about cache? I would reason that the access to (a) is going to be
fairly random so prefetching of array a is pointless. So I would have thought
that turning OFF prefetching of a would improve performance. But the
example in the users guide (though it does not state the purpose) does
the opposite:

CDEC$ NOPREFETCH c
CDEC$ PREFETCH a
do i = 1, m
b(i) = a(c(i)) + 1
enddo

and furthermore turns off prefetching for c, which *is* accessed
continuously. Is this just a pointless example or am I missing something?

If anyone has some experience of optimizing scatter/gather performance on
these platforms, they are willing to share, I'd like to hear about it.

One final puzzle. How does the performance compare with writing the loop
as a Fortran 90 vector subscript?

b(1:m) = a(c(1:m))

or even

b = a(c)

?

Keith Refson

TimP · ‎11-09-2004

My similar example with ifort 8.1 -O3 shows unrolled by 4, each of the 3 operands pre-fetched (once per group of 4). Scheduling is for 7 clocks per group of 4, so the default prefetch distance of 100 looks a little high. My take is that prefetch won't kick in until 400 loop iterations have passed; you should be able to improve on that by directive. Apparently, there would still be misses on the indirect operand, unless all operands happened to be covered by prefetching every 4th one.

I don't see any difference between f90 array assignment and f77 syntax for this loop, but ifort ignores directives like IVDEP when you use array assignments.

I assume that the users' guide example you are looking at assumes no unrolling. It would be better to risk the cache miss every 16 or 32 loop iterations rather than issue so many redundant prefetches. Prefetching the indirect operand every time, by looking ahead a few hundred iterations, might gain if the distribution in memory is irregular.

Significant changes in hardware and software prefetch have occurred over the various IA32 CPU families. With hardware prefetch available, you would probably consider prefetch intrinsics only for the indirect operand. If the issue is DTLB misses, it could take a huge prefetch distance to make any difference.

Message Edited by tim18 on 11-09-2004 09:06 AM

Message Edited by tim18 on 11-09-2004 11:00 AM

TimP · ‎11-09-2004

I just checked this with ifort 8.1.021 on IPF. The f77 version is running 15% faster than the f90 array assignment, but the CDEC$ PREFETCH directive does not appear to influence prefetch distance. It does influence the prefetch hints, for both the array assignment and DO loop versions.