Community
cancel
Showing results for 
Search instead for 
Did you mean: 
susangao
Beginner
73 Views

Question about auto-vectorization on a for loop

I am trying to get family what cases the icc can do auto-vectorization.
I got following two loops, the only difference is the lower bound of inner loop, and I cannot found data dependence in it. I found that, on E5-2680 with icc of
12.1.3 20120212, the first one is not auto-vectorized and 2nd one is auto-vectorized. Both of them are auto-vectorized on another platform with Xeon 5660 (-msse4.2, icc 12.1.2 20111128).
for (int j = 1; j < LEN2; j++) {
for (int i = 0; i < LEN2; i++) {
bb = bb[j-1];
}
}
for (int j = 1; j < LEN2; j++) {
for (int i = 1; i < LEN2; i++) {
bb = bb[j-1];
}
}
The platform info are as following:
icc: icc (ICC) 12.1.3 20120212
OS: ubuntu
cpu: Xeon E5-2680
I like to know whether this problem caused by using wrong icc version for E5-2680. Thank you for reading.
Best Regards,
Susan
0 Kudos
8 Replies
jimdempseyatthecove
Black Belt
73 Views

What is the type for bb?
What is the value for LEN2 (and is it constant known to the compiler or variable)?

In the second case (index i offset by 1) data may not be aligned and thus vectorization not used.

Jim Dempsey
jimdempseyatthecove
Black Belt
73 Views

From your tar file I see bb is float and LEN2 is 256

In the case where the loop uses i=0;... the cells are aligned
In the case where the loop uses i=1;... the first three elements could have been copied one at a time, then the remaining 252 cells could have been copied 4 at a time.

Are you sure you looked at enough code to verify that you were not looking at the preamble that was copying at the start of the inner loop (with the remaining 252 cells copied using 4-up vectors).

Jim Dempsey
susangao
Beginner
73 Views

LEN2 is 256
__attribute__((aligned(16))) float bb[LEN2][LEN2];
Sorry I didnt mension here, but it is included in the code I attached, it can compile/run directly.
Om_S_Intel
Employee
73 Views

In the case where the loop uses i=1;... there will be cacheline split and the compiler will not vectorize. In Intel Core2 m/c a cache line is 64 byte long. The data from memory is read one cacheline at a time. The cacheline splits has high penalty for memory reads.
susangao
Beginner
73 Views

Thank you very much. I will try to allocate the buffer with 64B aligned.
TimP
Black Belt
73 Views

This case is set up specifically to illlustrate difficulties in parallelization or vectorization, and to trip up compilers.
Best optimization for AVX is obtained with parallelized outer loop and AVX intrinsics in the inner loop, but the Xeon 5660 has enough cores, and Intel compilers have improved enough, that most of the ultimate performance is obtained by achieving parallelization without vectorization or intrinsics. Simply vectorizing it, without parallelization or optimization of the sequential dependency, doesn't approach full performance. So this is one of those cases where management goals in the past have conflicted with performance, but you can satisfy either a goal of maximum threaded performance scaling or of vectorization without caring about performance, as you choose.
It's a rare enough situation in practice that skewing a compiler to get good vector performance on a small number of cores without specific hand coding isn't necessarily a valid goal.
susangao
Beginner
73 Views

Thank you for reply. I dont quite clear about your question. I try to explain what I saw as following:
I got to know it (whether it is vectorized) through auto vectorization report given by icc (--vec-report=2). After I saw your advice above, I checked the .s file, seems that the part for first function's for loop (around line 163, 165) contains almost no vector instruction; for the second function, I can see obvious vector instructions such as unaligned move (vmovups) work on array bb.
susangao
Beginner
73 Views

Thank you very much for your helpful reply. I guess I start to understand it.
Reply