Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.

Loop Peeling for Vectorization

Royi
Novice
994 Views

Hello,

I'm optimizing the simplest code of looping over 2 arrays and summing them into third array.
I used:

__assume_aligned(arrayA, 32);
__assume_aligned(arrayB, 32);
__assume_aligned(arrayC, 32);
 
Yet still I can see the compiler is using peeling on those.
How come it doesn't understand the arrays are aligned?
 
Thank You.
0 Kudos
8 Replies
TimP
Honored Contributor III
994 Views
assume_aligned is more likely to be observed when in close proximity to a loop. There is a also #pragma vector unaligned to prevent peeling.
0 Kudos
Royi
Novice
994 Views

It is observed as I an see that when I comment this out the Compiler chose unaligned access pattern for the optimized code.
Using #pragma vector aligned indeed forces aligned access pattern but then turn off OpenMP.

According to:

https://software.intel.com/en-us/articles/data-alignment-to-assist-vectorization
https://software.intel.com/en-us/node/532796

It should work without peeling.
I just wonder if this is a symptom the compiler misses something or behave unexpectedly.

0 Kudos
TimP
Honored Contributor III
994 Views
If you want aligned data with openmp you must take care with chunk sizes so that all chunks will be aligned. If your chunks are small it may be worth testing the no peeling version regardless of alignment.
0 Kudos
Royi
Novice
994 Views

Hi Tim,

Yes I understood this on OpenMP.
Again, my problem is that the compiler is using peeling even though it knows the arrays are aligned.

I wonder why and if this actually a bug (Or just done always as safety net).

Thank You.

0 Kudos
jimdempseyatthecove
Honored Contributor III
994 Views

Also lookup the OpenMP simd clause. This can be used to force the loop partitioning (RE Tim P's #4) to be split at vector boundaries.

Jim Dempsey

0 Kudos
Royi
Novice
994 Views

Guys,

I don't care about the interaction with OpenMP.
I just wanted to raise issue of peeling when peeling isn't needed.

It might point the compiler make the wrong decision in this case and I'd like to know why.
Maybe on Intel side the will even find it as a bug or a case for farther optimization.

So, my question is:

1. Did I miss something on how to imply the Compiler the arrays are aligned for AVX / SSE (Besides using the pragma which Intel itself doesn't advise)?
2. If I did not, why does the compiler make this choice? Is that on purpose (Safety net against users) or a case not fully optimized. As it seems in past blog posts of Intel peeling wasn't done.

Thank You.

0 Kudos
jimdempseyatthecove
Honored Contributor III
994 Views

Maybe it would help if you show:

a) the source code, including information regarding: sources, targets, loop counts discoverable by the compiler
b) dissassembly code, preferably a screenshot of the disassembly view from VTune (inclusive of all alternate paths if possible)

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
994 Views

You may still be misunderstanding the under-documented #pragma vector unaligned.  It doesn't require any non-alignment . To my knowledge, it simply removes cost analysis and exception coverage (like the other vector pragmas) and also prevents peeling, in case you are serious about that.  If your hardware is AVX or newer, there is no performance penalty for executing unaligned instructions on aligned data, with the possible exception of the splitting of unaligned loads and stores for sandy and ivy bridge targets.

0 Kudos
Reply