Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
6 Views

Performance degradation due to Auto Vectorization

Architecture:          x86_64 (Haswell with 6 cores)
Compiler Version: icc 15.0
Performance degradation while compiling with autovectorization(-O2) on the code snippet below:

#define N 200000
void foo()
{
	__declspec(align(64)) int a;
	int i,cnt=0;
	for(cnt=0;cnt<1000000;cnt++)
        {
		for(i = 2; i < N; i++)
		{
			a = a[i-2] + 1; 
		}
	}
}

Compilation method 1 with vectorization: icc -O2 <filename> -opt-report5
Result: Time taken (3m 24 sec)
Report says for loop above is getting vectorized with estimated potential speed up of about 1.2

Compilation method 2 without vectorization: icc <filename> -opt-report5  -O2 -no-vec
Result: Time taken (1m 08 sec)

Why is autovectorization degrading the performance even though estimated potential speedup is 1.2?

0 Kudos
4 Replies
Highlighted
6 Views

The body of your loop (one statement) has temporal dependencies that lies within the (or suitable) vector widths. This would be difficult to vectorize efficiently. Your best bet at optimization would be something like this:

int t1, t2;
a[2] = t1 = a[0] + 1;
a[3] = t2 = a[1] + 1;
int loopLimit = N&(~1);
for(i=4; i<loopLimit; i+=2) {
  a = ++t1; a[i+1] = ++t2; }
if(N&1) a[N-1] = ++t1;

In looking at the above loop, there should be a way to extend the number of temps to 8 and then vectorize that. The optimization gurus at Intel are very good, and I expect they could do this. However, as to if it would be as efficient as the scalar code I cannot make that guess.

Jim Dempsey

 

0 Kudos
Highlighted
Black Belt
6 Views

It's difficult to draw conclusions from an example involving dead code.

0 Kudos
Highlighted
Employee
6 Views

Hi Sharath,
As Tim pointed out, you have dead code and also wondering how did you vectorize with the code you have since the compiler will not? Changing the code to below should do so:

#define N 200000
int __declspec(align(64)) a;
void foo()
{
    int i, cnt;
 
  for (cnt = 0; cnt < 1000000; cnt++) {
       
 for (i = 2; i < N; i++) {
          
a = a[i - 2] + 1;
        
}
    
}
}

The scalar code is 3x faster with what I tried and the reason for this is due to scalar replacement. If you generate the asm code with -S option you'll notice that the vector code does "load, op and then store" while scalar code does only op and then store without any load.  Note that the vectorizer (or optimizers) in general cannot really predict what downstream optimizations would happen. It's based on heuristics and the compiler tries to build those heuristics well enough for a lot of generic programs but there could be some special ones such as this which may not. Hope this helps.

_Kittur 

 

0 Kudos
Highlighted
Employee
6 Views

Hi Sharath,
Here's a summary of your code investigation:

1) Your original code doesn't consume a[] and may perform dead code elimination so you should be aware of this
2) Vector parallelism is limited to 2-way, due to nature of the data dependency which the compiler knows about and uses it in it's cost modeling analysis
3) Of course, as I pointed out we can't reproduce what you see as is? In terms of the compiler message on speedup as well. How did you compiler and see that? Reason, the cost model shouldn't do half vector vectorization without arm twisting (like using simd etc)
4) Vectorizer (optimizers as well) in general can't really predict what downstream optimizations would happen like I said earlier. The heuristics generally work well enough for all general programs but there can certainly be outliners.
5) What we observed:  Vector code (w/o scalar replacement) is ~3x slower than scalar code w/scaler replacement optimization (vector code by SIMD directive or changing the array type to double).
6) Optimization heuristics may be different between 32bit and 64bit windows or Linux, fyi.

Hope the above helps.
_Kittur 

0 Kudos