How to avoid "partial loop vectorization"?

Andreas_Klaedtke · ‎11-13-2010

Hi,

I am currently struggling to get the compiler to do what I want it to do on the following code (on a 32 bit system):

[cpp]void reduce (size_t const N,
             float const * RESTRICT const x, 
             float const * RESTRICT const y,
             float const * RESTRICT const z,
             float const * RESTRICT const v,
             float * RESTRICT const A)
{
   float sum1 = 0;
   float sum2 = 0;
   float sum3 = 0;
   float sum4 = 0;
   float sum5 = 0;
   float sum6 = 0;
   float sum7 = 0;
   float sum8 = 0;
   float sum9 = 0;
   float sum10 = 0;
  
   for (size_t i = 0; i < N; ++i) {
      sum1 += x;
      sum2 += x * x;
      sum3 += y;
      sum4 += y * y;
      sum5 += z;
      sum6 += z * z;
      sum7 += x * y;
      sum8 += x * z;
      sum9 += y * z;
      sum10 += x * y * z;
   }
   A[0] = sum1;
   A[1] = sum2;
   A[2] = sum3;
   A[3] = sum4;
   A[4] = sum5;
   A[5] = sum6;
   A[6] = sum7;
   A[7] = sum8;
   A[8] = sum9;
   A[9] = sum10;
}[/cpp]

Now, the problem is that this vectorized nicely with the icpc version 11.1.056 (11.1 20091012) and the performance was about twice as good as without vectorization. Btw: I use -xSSE2 in this case as a minimum.
reductions.cc(56): (col. 4) remark: LOOP WAS VECTORIZED.

With version 12.0.0 20101006, it suddenly tries to partially vectorize:
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
This would not be a problem to me if the performance would be on par with the old vectorized result. But it is even slower than the old unvectorized version, so four times slower than the vectorized old result.

How can you avoid this partial vectorization and get a result which is as good as with version 11.1?

Regards
Andreas

TimP · ‎11-15-2010

The usual way to control automatic loop splitting is with the pragma, e.g.

for(size_ti=0;i
sum1+=x;
sum2+=x*x;
sum3+=y;
sum4+=y*y;
sum5+=z;
#pragma distribute point
sum6+=z*z;
sum7+=x*y;
sum8+=x*z;
sum9+=y*z;
sum10+=x*y*z;
}

would require the compile to split in just 2 loops, rather than 10. This might work better if you could sort the reductions so that some of the variables are used on only one side of the split. There aren't enough registers in 32-bit mode to optimize this loop without splitting. What you found I would consider a serious enough performance regression to submit as a problem report.

Andreas_Klaedtke · ‎11-15-2010

Tim18,

Is there a way to avoid partial loop vectorization at all?

What I do not understand then, is why version 11.1 of the compiler seems to vectorize the entire loop (at least it says so, I have not looked at the assembler code yet), and 12.0 does not.

Aren't there at least 8 vector registers? This should be sufficient, should it not?
Load x[] into 0, y[] into 1, z[] into 2, and so on... ???

Regards
Andreas

TimP · ‎11-15-2010

The 10 PARTIAL indications you showed add up to vectorization of the entire loop. It's certainly likely this doesn't give best performance, compared to splitting into a near optimum number of divisions.
You have request accumulation of 10 sums, which would require at least 11 named registers to be available. Thus, vectorization isn't possible without either splitting the loop, or spilling sums to stack. As you hinted, it may be necessary to look at asm code in order to get an idea how your options are working.

Andreas_Klaedtke · ‎11-15-2010

I tried the #pragma distribute point in the middle. This works. But it does not help, runtime wise.

I then thought I could be extra clever and put the distribute point right at the start of the loop, right after the for statement. Funnily enough: this works. The vector report states:
reductions.cc(69) (col. 4): remark: LOOP WAS VECTORIZED.
And the runtime is similar to the icpc 11.1.056 results.

The next thing, I will be looking at is the assembler code, but this might take some time.
I will keep you posted.

Regards
Andreas

TimP · ‎11-15-2010

Yes, distribute point at the top of the loop is intended to prevent splitting (beginning with icc 10.0), even at the expense of other optimizations. There are situations where this is the right thing to do. In your case, it is good that it works better than other options, but it may indicate that the compiler isn't doing the right thing when the loop is split into groups of 3,4, or 5. The compiler should handle groups of 2 easily, so the fact that it doesn't try that voluntarily is a bad sign.

levicki · ‎12-16-2010

You should file a bug report with a test case for this on Intel Premier Support because it is obviously a regression in optimization.

jimdempseyatthecove · ‎12-16-2010

If N is sufficiently large, can you parallize the code?

#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
for(size_ti=0;isum1+=x;
sum2+=x*x;
sum7+=x*y;
sum3+=y;
sum4+=y*y;
}
}
#pragma omp section
{
for(size_ti=0;i sum5+=z;
sum6+=z*z;
sum8+=x*z;
sum9+=y*z;
sum10+=x*y*z;}
}
}

Jim Dempsey

Andreas_Klaedtke · ‎12-29-2010

Jim,

Thanks for the hint, but parallelisation at this level is not advisable in the application.
The parallelisation happens at a more global level.

jimdempseyatthecove · ‎12-30-2010

Andreas,

The following is some untested code (it compiles OK).

[cpp]// sum.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include 
#include 
using namespace std;

#define RESTRICT restrict

__declspec(noinline)
void reduce (size_t const N,
             float const * RESTRICT const x, 
             float const * RESTRICT const y,
             float const * RESTRICT const z,
             float const * RESTRICT const v,
             float * RESTRICT const A)
{
	{
		__declspec(align(16))
		float temp_xyzv[4] = { 0.0f, 0.0f, 0.0f, 0.0f };
		float sum_product_xyz = 0.0f;
		__m128 SSE_temp_xyzv;
		__m128 SSE_sum_xyzv = _mm_setzero_ps();
		__m128 SSE_sum_square_xyzv = _mm_setzero_ps();
	   for (size_t i = 0; i < N; ++i)
	   {
			temp_xyzv[0] = x;
			temp_xyzv[1] = y;
			temp_xyzv[2] = z;
			sum_product_xyz += x * y * z;
			SSE_temp_xyzv = _mm_load_ps(temp_xyzv);
			SSE_sum_xyzv = _mm_add_ps(SSE_sum_xyzv, SSE_temp_xyzv);
			SSE_sum_square_xyzv = _mm_add_ps(SSE_sum_square_xyzv, _mm_mul_ps(SSE_temp_xyzv, SSE_temp_xyzv));
	   }
		_mm_storeu_ps(&A[0], SSE_sum_xyzv);
		_mm_storeu_ps(&A[3], SSE_sum_square_xyzv);
		A[9] = sum_product_xyz;
	}
	{
		__m128 SSE_temp_xxyv;
		__m128 SSE_temp_yzzv;
		__m128 SSE_sum_xy_xz_yz_vv = _mm_setzero_ps();
	   for (size_t i = 0; i < N; ++i)
	   {
		   {
				__declspec(align(16))
				float temp_xxyv[4];
				temp_xxyv[0] = x;
				temp_xxyv[1] = x;
				temp_xxyv[2] = y;
				SSE_temp_xxyv = _mm_load_ps(temp_xxyv);
		   }
		   {
				__declspec(align(16))
				float temp_yzzv[4];
			   temp_yzzv[0] = y;
			   temp_yzzv[1] = z;
			   temp_yzzv[2] = z;
			   SSE_temp_yzzv = _mm_load_ps(temp_yzzv);
		   }
		SSE_sum_xy_xz_yz_vv = _mm_add_ps(SSE_sum_xy_xz_yz_vv, _mm_mul_ps(SSE_temp_xxyv, SSE_temp_yzzv));
	   }
   _mm_storeu_ps(&A[6], SSE_sum_xy_xz_yz_vv);
   } // for (size_t i = 0; i < N; ++i)
}

const size_t N = 1000;
float x; 
float y; 
float z; 
float v; 
float A[N*10]; 

int _tmain(int argc, _TCHAR* argv[])
{
	// reference variables so optimization doesn't eliminate code
	for(int i=0; i < N; ++i)
	{
		x = i; y = i; z = i; v = i;
	} //
	reduce (N, x, y, z, v, A);
	cout << A[0] << endl;
	return 0;
}

[/cpp]

Jim Dempsey

Dale_S_Intel · ‎12-30-2010

I'll go ahead and submit an issue on this apparent performance regression. I'll let you know what I find.

Dale