Intel® C++ Compiler
Community support and assistance for creating C++ code that runs on platforms based on Intel® processors.
7956 Discussions

How to avoid "partial loop vectorization"?

Andreas_Klaedtke
Beginner
588 Views
Hi,

I am currently struggling to get the compiler to do what I want it to do on the following code (on a 32 bit system):
[cpp]void reduce (size_t const N,
             float const * RESTRICT const x, 
             float const * RESTRICT const y,
             float const * RESTRICT const z,
             float const * RESTRICT const v,
             float * RESTRICT const A)
{
   float sum1 = 0;
   float sum2 = 0;
   float sum3 = 0;
   float sum4 = 0;
   float sum5 = 0;
   float sum6 = 0;
   float sum7 = 0;
   float sum8 = 0;
   float sum9 = 0;
   float sum10 = 0;
  
   for (size_t i = 0; i < N; ++i) {
      sum1 += x;
      sum2 += x * x;
      sum3 += y;
      sum4 += y * y;
      sum5 += z;
      sum6 += z * z;
      sum7 += x * y;
      sum8 += x * z;
      sum9 += y * z;
      sum10 += x * y * z;
   }
   A[0] = sum1;
   A[1] = sum2;
   A[2] = sum3;
   A[3] = sum4;
   A[4] = sum5;
   A[5] = sum6;
   A[6] = sum7;
   A[7] = sum8;
   A[8] = sum9;
   A[9] = sum10;
}[/cpp]

Now, the problem is that this vectorized nicely with the icpc version 11.1.056 (11.1 20091012) and the performance was about twice as good as without vectorization. Btw: I use -xSSE2 in this case as a minimum.
reductions.cc(56): (col. 4) remark: LOOP WAS VECTORIZED.

With version 12.0.0 20101006, it suddenly tries to partially vectorize:
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
This would not be a problem to me if the performance would be on par with the old vectorized result. But it is even slower than the old unvectorized version, so four times slower than the vectorized old result.

How can you avoid this partial vectorization and get a result which is as good as with version 11.1?

Regards
Andreas


0 Kudos
10 Replies
TimP
Honored Contributor III
588 Views
The usual way to control automatic loop splitting is with the pragma, e.g.
  1. for(size_ti=0;i
  2. sum1+=x;
  3. sum2+=x*x;
  4. sum3+=y;
  5. sum4+=y*y;
  6. sum5+=z;
  7. #pragma distribute point
  8. sum6+=z*z;
  9. sum7+=x*y;
  10. sum8+=x*z;
  11. sum9+=y*z;
  12. sum10+=x*y*z;
  13. }
would require the compile to split in just 2 loops, rather than 10. This might work better if you could sort the reductions so that some of the variables are used on only one side of the split. There aren't enough registers in 32-bit mode to optimize this loop without splitting. What you found I would consider a serious enough performance regression to submit as a problem report.
0 Kudos
Andreas_Klaedtke
Beginner
588 Views
Tim18,

Is there a way to avoid partial loop vectorization at all?

What I do not understand then, is why version 11.1 of the compiler seems to vectorize the entire loop (at least it says so, I have not looked at the assembler code yet), and 12.0 does not.

Aren't there at least 8 vector registers? This should be sufficient, should it not?
Load x[] into 0, y[] into 1, z[] into 2, and so on... ???

Regards
Andreas
0 Kudos
TimP
Honored Contributor III
588 Views
The 10 PARTIAL indications you showed add up to vectorization of the entire loop. It's certainly likely this doesn't give best performance, compared to splitting into a near optimum number of divisions.
You have request accumulation of 10 sums, which would require at least 11 named registers to be available. Thus, vectorization isn't possible without either splitting the loop, or spilling sums to stack. As you hinted, it may be necessary to look at asm code in order to get an idea how your options are working.
0 Kudos
Andreas_Klaedtke
Beginner
588 Views
I tried the #pragma distribute point in the middle. This works. But it does not help, runtime wise.

I then thought I could be extra clever and put the distribute point right at the start of the loop, right after the for statement. Funnily enough: this works. The vector report states:
reductions.cc(69) (col. 4): remark: LOOP WAS VECTORIZED.
And the runtime is similar to the icpc 11.1.056 results.

The next thing, I will be looking at is the assembler code, but this might take some time.
I will keep you posted.

Regards
Andreas
0 Kudos
TimP
Honored Contributor III
588 Views
Yes, distribute point at the top of the loop is intended to prevent splitting (beginning with icc 10.0), even at the expense of other optimizations. There are situations where this is the right thing to do. In your case, it is good that it works better than other options, but it may indicate that the compiler isn't doing the right thing when the loop is split into groups of 3,4, or 5. The compiler should handle groups of 2 easily, so the fact that it doesn't try that voluntarily is a bad sign.
0 Kudos
levicki
Valued Contributor I
588 Views
You should file a bug report with a test case for this on Intel Premier Support because it is obviously a regression in optimization.
0 Kudos
jimdempseyatthecove
Honored Contributor III
588 Views

If N is sufficiently large, can you parallize the code?

#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
for(size_ti=0;isum1+=x;
sum2+=x*x;
sum7+=x*y;
sum3+=y;
sum4+=y*y;
}
}
#pragma omp section
{
for(size_ti=0;i sum5+=z;
sum6+=z*z;
sum8+=x*z;
sum9+=y*z;
sum10+=x*y*z;}
}
}

Jim Dempsey

0 Kudos
Andreas_Klaedtke
Beginner
588 Views
Jim,

Thanks for the hint, but parallelisation at this level is not advisable in the application.
The parallelisation happens at a more global level.
0 Kudos
jimdempseyatthecove
Honored Contributor III
588 Views
Andreas,

The following is some untested code (it compiles OK).

[cpp]// sum.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include 
#include 
using namespace std;

#define RESTRICT restrict

__declspec(noinline)
void reduce (size_t const N,
             float const * RESTRICT const x, 
             float const * RESTRICT const y,
             float const * RESTRICT const z,
             float const * RESTRICT const v,
             float * RESTRICT const A)
{
	{
		__declspec(align(16))
		float temp_xyzv[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
		float sum_product_xyz = 0.0f;
		__m128 SSE_temp_xyzv;
		__m128 SSE_sum_xyzv = _mm_setzero_ps();
		__m128 SSE_sum_square_xyzv = _mm_setzero_ps();
	   for (size_t i = 0; i < N; ++i)
	   {
			temp_xyzv[0] = x;
			temp_xyzv[1] = y;
			temp_xyzv[2] = z;
			sum_product_xyz += x * y * z;
			SSE_temp_xyzv = _mm_load_ps(temp_xyzv);
			SSE_sum_xyzv = _mm_add_ps(SSE_sum_xyzv, SSE_temp_xyzv);
			SSE_sum_square_xyzv = _mm_add_ps(SSE_sum_square_xyzv, _mm_mul_ps(SSE_temp_xyzv, SSE_temp_xyzv));
	   }
		_mm_storeu_ps(&A[0], SSE_sum_xyzv);
		_mm_storeu_ps(&A[3], SSE_sum_square_xyzv);
		A[9] = sum_product_xyz;
	}
	{
		__m128 SSE_temp_xxyv;
		__m128 SSE_temp_yzzv;
		__m128 SSE_sum_xy_xz_yz_vv = _mm_setzero_ps();
	   for (size_t i = 0; i < N; ++i)
	   {
		   {
				__declspec(align(16))
				float temp_xxyv[4];
				temp_xxyv[0] = x;
				temp_xxyv[1] = x;
				temp_xxyv[2] = y;
				SSE_temp_xxyv = _mm_load_ps(temp_xxyv);
		   }
		   {
				__declspec(align(16))
				float temp_yzzv[4];
			   temp_yzzv[0] = y;
			   temp_yzzv[1] = z;
			   temp_yzzv[2] = z;
			   SSE_temp_yzzv = _mm_load_ps(temp_yzzv);
		   }
		SSE_sum_xy_xz_yz_vv = _mm_add_ps(SSE_sum_xy_xz_yz_vv, _mm_mul_ps(SSE_temp_xxyv, SSE_temp_yzzv));
	   }
   _mm_storeu_ps(&A[6], SSE_sum_xy_xz_yz_vv);
   } // for (size_t i = 0; i < N; ++i)
}

const size_t N = 1000;
float x; 
float y; 
float z; 
float v; 
float A[N*10]; 

int _tmain(int argc, _TCHAR* argv[])
{
	// reference variables so optimization doesn't eliminate code
	for(int i=0; i < N; ++i)
	{
		x = i; y = i; z = i; v = i;
	} //
	reduce (N, x, y, z, v, A);
	cout << A[0] << endl;
	return 0;
}

[/cpp]

Jim Dempsey
0 Kudos
Dale_S_Intel
Employee
588 Views
I'll go ahead and submit an issue on this apparent performance regression. I'll let you know what I find.

Dale
0 Kudos
Reply