Optimization for power-of-2 loop counts

piet_de_weer · ‎06-18-2010

I'm working on an audio processing application which operates on blocks of data which' size is a power of 2 - more specifially 256, 512, 1024, 2048 or 4096.

Now it seems to me that for loop unrolling (removing unnecessary checks) and vectorizing (removing unnecessary non-vectorized code for the final iterations), it could be very helpful for the compiler to know that the loop count is always a power of 2 (possibly with a minimum vallue) - or that it's a multiple of 2/4/8/....

But I cannot find how to tell the compiler this. Am I missing something?

TimP · ‎06-18-2010

If all your array references for the first iteration of the loop are 16-byte aligned, you can assert that by
#pragma vector aligned
immediately before each inner for().
If you use the AVX code option, this requires 32-byte alignment.
For the 9.1 compiler, you could also set a #pragma loop count listing the specific loop lengths you prefer. I doubt this will have an effect for current compilers.

kfsone · ‎06-19-2010

Uhm is there a list of these pragmas someplace?

piet_de_weer · ‎06-19-2010

I've looked up #pragma loop count, but as I thought I remembered that just gives an indication of the order to the compiler. So the compiler cannot assume that there's always a multiple of x. (I didn't know that it doesn't work anymore for the latest compiler versions...)

Some time ago in compiler version 10.1 I noticed big performance changes in loops with a hard-coded loop count between 1024 and 1023 or 1025 iterations (1024 performed a lot better, unfortunately at the time I haven't looked at the assembly code that came out).

Take for example the following code (x and y are arrays of 16 bit values):

[bash]for (i=0; i{
  x = y
}
[/bash]

If this is compiled, you would probably get something like this:

[bash]If j < 4, jump to 05
SSE2 instruction to copy 4 values
i += 4
i < j - 3 --> jump to 02
i >= j -> jump to 09
Non-SSE2 instruction to copy 1 value
i ++
i < j --> jump to 06
END[/bash]

If the compiler knows that j is always a multiple of 4, step 5-8 could be removed. That probably doesn't even make that much difference for the performance, but it does reduce the code size.

While typing this I'm starting to think that it might be easy to tell the compiler that j is a multiple of 4 by using j & -4 or (j / 4) * 4... I'll do some tests to see if this has any effect.

TimP · ‎06-19-2010

A simple data copy such as you quote is normally treated the same as a memcpy() or memmove(). Then there are run-time checks for alignment and overlap, and use of parallel move instructions when feasible.
If you write a loop with in-line vector code, where the compiler doesn't know the alignments, there will be a scalar loop up to the first aligned destination, followed (in the case of SSE2) by a main loop body which probably takes 4 iterations at a time, followed by a scalar loop to take care of remainders.
#pragma vector aligned tells the compiler to generate code only for the case where data are aligned, as well as skipping any consideration about whether vectorization will come out ahead or produce exceptions by replacing conditionals by unconditional operations and masks. There are sections in the icc doc about the pragma vector and pragma loop count.
#pragma vector nontemporal is like #pragma vector always with the additional assertion that your loop is long enough to benefit from cache bypassed stores. Your longest loops might be long enough, if you don't want to have the result immediately available for further use. This tends to suppress unrolling, as it is an assertion that you expect performance to be limited by memory bandwidth.
You could try #pragma unroll(n) in an effort to persuade the compiler to take 2n loop iterations at a time. n==4 is often a good value for Core I7 architecture; perhaps 8 for Core 2, dependent on whether fitting inside the Loop Stream Detector limit is effective. gcc will easily go to more aggressive unrolling.