Question about auto-vectorization on a for loop

susangao · ‎05-09-2012

I am trying to get family what cases the icc can do auto-vectorization.

I got following two loops, the only difference is the lower bound of inner loop, and I cannot found data dependence in it. I found that, on E5-2680 with icc of12.1.3 20120212, the first one is not auto-vectorized and 2nd one is auto-vectorized. Both of them are auto-vectorized on another platform with Xeon 5660 (-msse4.2, icc 12.1.2 20111128).

for (int j = 1; j < LEN2; j++) {

for (int i = 0; i < LEN2; i++) {

bb = bb[j-1];

}

for (int j = 1; j < LEN2; j++) {

for (int i = 1; i < LEN2; i++) {

bb = bb[j-1];

}

The platform info are as following:

icc: icc (ICC) 12.1.3 20120212

OS: ubuntu

cpu: Xeon E5-2680

I like to know whether this problem caused by using wrong icc version for E5-2680. Thank you for reading.

Best Regards,

Susan

jimdempseyatthecove · ‎05-09-2012

What is the type for bb?
What is the value for LEN2 (and is it constant known to the compiler or variable)?

In the second case (index i offset by 1) data may not be aligned and thus vectorization not used.

Jim Dempsey

jimdempseyatthecove · ‎05-09-2012

From your tar file I see bb is float and LEN2 is 256

In the case where the loop uses i=0;... the cells are aligned
In the case where the loop uses i=1;... the first three elements could have been copied one at a time, then the remaining 252 cells could have been copied 4 at a time.

Are you sure you looked at enough code to verify that you were not looking at the preamble that was copying at the start of the inner loop (with the remaining 252 cells copied using 4-up vectors).

Jim Dempsey

susangao · ‎05-09-2012

LEN2 is 256

__attribute__((aligned(16))) float bb[LEN2][LEN2];

Sorry I didnt mension here, but it is included in the code I attached, it can compile/run directly.

Om_S_Intel · ‎05-09-2012

In the case where the loop uses i=1;... there will be cacheline split and the compiler will not vectorize. In Intel Core2 m/c a cache line is 64 byte long. The data from memory is read one cacheline at a time. The cacheline splits has high penalty for memory reads.

susangao · ‎05-10-2012

Thank you very much. I will try to allocate the buffer with 64B aligned.

TimP · ‎05-10-2012

This case is set up specifically to illlustrate difficulties in parallelization or vectorization, and to trip up compilers.
Best optimization for AVX is obtained with parallelized outer loop and AVX intrinsics in the inner loop, but the Xeon 5660 has enough cores, and Intel compilers have improved enough, that most of the ultimate performance is obtained by achieving parallelization without vectorization or intrinsics. Simply vectorizing it, without parallelization or optimization of the sequential dependency, doesn't approach full performance. So this is one of those cases where management goals in the past have conflicted with performance, but you can satisfy either a goal of maximum threaded performance scaling or of vectorization without caring about performance, as you choose.
It's a rare enough situation in practice that skewing a compiler to get good vector performance on a small number of cores without specific hand coding isn't necessarily a valid goal.

susangao · ‎05-10-2012

Thank you for reply. I dont quite clear about your question. I try to explain what I saw as following:

I got to know it (whether it is vectorized) through auto vectorization report given by icc (--vec-report=2). After I saw your advice above, I checked the .s file, seems that the part for first function's for loop (around line 163, 165) contains almost no vector instruction; for the second function, I can see obvious vector instructions such as unaligned move (vmovups) work on array bb.

susangao · ‎05-10-2012

Thank you very much for your helpful reply. I guess I start to understand it.