Unrolling huge loops

mike_31415 · ‎01-15-2009

11.0.066 -O3
In the performance report I see that compiller did a good job unrolling tiny loops.

I need to unroll main huge loop to increase SSE2 pipeline usage.
I had it unrolled manually via macroses, and it was fine and very fast.

Now I want to get rid of ugly macros and use Intel C++ nice unroll pragma.

Unfortunately, compiler does not try to unroll the loop even when number of itterations is known, and
#pragma unroll(3) is set. This loop is not included in optimization report.

The body of the loop is around 450 SSE2 instruction, with multiple output values (i.e. it defenetly out of cache-line limits).

Any thoughts?

jimdempseyatthecove · ‎01-15-2009

Mike,

I will guess you are referring to the body of the loop exceeding the L1 instruction cache size and thus unrolling might not suffer too much of a degredation. While I cannot answeryour query onC++ unrolling large loops question I can suggest...

Have you considered splitting the body of the loop into multiple parts. The trick would be to keep the code size for each part to fit into the L1 instruction cache, then split up the run length through your full data set intosub-sets of shorter run lengths, andhave the first run of the first part NOT exceed the L2 cache limit, then the first run of the second (and third...) part can re-use what may be left over in the L2 cache from the first run of the first part. Once all parts of the loop have processed the L2 sized batch then advance to the next L2 sized batch through your full data set.

Note, division in this manner will parallel-ize well too.

Jim Dempsey

mike_31415 · ‎01-15-2009

Quoting - jimdempseyatthecove

Mike,

I will guess you are referring to the body of the loop exceeding the L1 instruction cache size and thus unrolling might not suffer too much of a degredation. While I cannot answeryour query onC++ unrolling large loops question I can suggest...

Have you considered splitting the body of the loop into multiple parts. The trick would be to keep the code size for each part to fit into the L1 instruction cache, [...] n advance to the next L2 sized batch through your full data set.

Note, division in this manner will parallel-ize well too.

I doubt 460 SSE2 instructions exceed L1 cache size.
My goal is to make my code as simple as possible, so adding artificial stages would harm to other implementations relying on the same code (CUDA & BROOK)

jimdempseyatthecove · ‎01-16-2009

Depending on what code is run outside of your loop the L1 cache might be fully depleated of any instructions from within the loop from prior execution. Under that circumstance the unrolled loop will pull instructions from the slower L2 (or slower yet L3 if L3 present)or slowest RAM. L2 is typically 4x slower than L1, L3 is another several multipliers, and RAM much more so.

Since you currently can hand unroll this loop I suggest you run an experiment. Run a faithful representation of the loop test within your application and not just the loop vs unrolled loopin a stand alone test. You should be interested in the affect on the application and not merely the effect on an artifical test program.

I have limited experience with Brook+ and from my little understanding of it, it does the C++ pseudo code loop unroll automatically. For compatible source reasons it might be advantagious to keep the loop.

For programming Brook+ you would likely make the body of your loop a kernel. Then your loop would call the kernal over the iteration space. Brook+ preprocessor would convert the code to an unrolled loop with the kernal code inline (to the extent of unrolling). On the C++ side, when compiled for C++ only the function for the body of the loop can be declared as inline to avoid the call overhead and you may have better luck in having the compiler unroll the inlined function as opposed to the code itself. Often it is the use of compiler temps and the appearance of usibility of those temps as the code falls outside the loop that interferes with the unrolling of the loop.

Jim Dempsey

jimdempseyatthecove · ‎01-16-2009

usibility - ah for the want of a spell checker.

Actually it is cold here (-17) and my fingers haven't warmed up. Hitting the wrong keys alot.

Jim Dempsey

mike_31415 · ‎01-18-2009

Still No luck, downgraded to latest 10x compiler,

#pragma unroll (3)
for(int key_pos=0;key_pos {
check_block_sse2_inline(key_pos, data_d, pwd_len, charset_len, keys_per_thread, result);
check_block_sse2_inline2(key_pos+4, data_d, pwd_len, charset_len, keys_per_thread, result);
check_block_sse2_inline(key_pos+8, data_d, pwd_len, charset_len, keys_per_thread, result);
}

check_block_sse2_inline2 just a copy&paste __forceinline function to help it understand that they can be done at the same time... Compiller still does all these things one after another, without increasing usage of SSE2 block if CPU. (and I am getting speed around 50% of the top performance in comparison to old core when I manually copy each step 3 times to help loading SSE2 module able to do 3 instructions per cycle.)

What I want to do is to have it mix the code for all 3 functions one on top of another, so it easily can issue 3 SSE2 instructions per cycle. Manual copying shows that it gives very nice performance.(I've tried different number of hand-unroll, 3 gives best speed)

I am not worried here about CUDA & Brook as long as I am not adding artificial changes to the code (like adding "stages"), they are compilled with different set of macroses, and their speed it around theoretical limit.

jimdempseyatthecove · ‎01-19-2009

Mike,

Verify that O3 is enabled, and verify that this unrolled loop is not also located within an unrolled loop.

Also, if appropriate, consider adding one or more of

#pragma vector nontemporal
#pragma vector always
#pragma vector aligned

Jim Dempsey