- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
I am currently struggling to get the compiler to do what I want it to do on the following code (on a 32 bit system):
Now, the problem is that this vectorized nicely with the icpc version 11.1.056 (11.1 20091012) and the performance was about twice as good as without vectorization. Btw: I use -xSSE2 in this case as a minimum.
reductions.cc(56): (col. 4) remark: LOOP WAS VECTORIZED.
With version 12.0.0 20101006, it suddenly tries to partially vectorize:
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
This would not be a problem to me if the performance would be on par with the old vectorized result. But it is even slower than the old unvectorized version, so four times slower than the vectorized old result.
How can you avoid this partial vectorization and get a result which is as good as with version 11.1?
Regards
Andreas
I am currently struggling to get the compiler to do what I want it to do on the following code (on a 32 bit system):
[cpp]void reduce (size_t const N, float const * RESTRICT const x, float const * RESTRICT const y, float const * RESTRICT const z, float const * RESTRICT const v, float * RESTRICT const A) { float sum1 = 0; float sum2 = 0; float sum3 = 0; float sum4 = 0; float sum5 = 0; float sum6 = 0; float sum7 = 0; float sum8 = 0; float sum9 = 0; float sum10 = 0; for (size_t i = 0; i < N; ++i) { sum1 += x; sum2 += x * x; sum3 += y; sum4 += y * y; sum5 += z; sum6 += z * z; sum7 += x * y; sum8 += x * z; sum9 += y * z; sum10 += x * y * z; } A[0] = sum1; A[1] = sum2; A[2] = sum3; A[3] = sum4; A[4] = sum5; A[5] = sum6; A[6] = sum7; A[7] = sum8; A[8] = sum9; A[9] = sum10; }[/cpp]
Now, the problem is that this vectorized nicely with the icpc version 11.1.056 (11.1 20091012) and the performance was about twice as good as without vectorization. Btw: I use -xSSE2 in this case as a minimum.
reductions.cc(56): (col. 4) remark: LOOP WAS VECTORIZED.
With version 12.0.0 20101006, it suddenly tries to partially vectorize:
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.
This would not be a problem to me if the performance would be on par with the old vectorized result. But it is even slower than the old unvectorized version, so four times slower than the vectorized old result.
How can you avoid this partial vectorization and get a result which is as good as with version 11.1?
Regards
Andreas
Link Copied
10 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The usual way to control automatic loop splitting is with the pragma, e.g.
- for(size_ti=0;i
- sum1+=x;
- sum2+=x*x;
- sum3+=y;
- sum4+=y*y;
- sum5+=z;
- #pragma distribute point
- sum6+=z*z;
- sum7+=x*y;
- sum8+=x*z;
- sum9+=y*z;
- sum10+=x*y*z;
- }
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim18,
Is there a way to avoid partial loop vectorization at all?
What I do not understand then, is why version 11.1 of the compiler seems to vectorize the entire loop (at least it says so, I have not looked at the assembler code yet), and 12.0 does not.
Aren't there at least 8 vector registers? This should be sufficient, should it not?
Load x[] into 0, y[] into 1, z[] into 2, and so on... ???
Regards
Andreas
Is there a way to avoid partial loop vectorization at all?
What I do not understand then, is why version 11.1 of the compiler seems to vectorize the entire loop (at least it says so, I have not looked at the assembler code yet), and 12.0 does not.
Aren't there at least 8 vector registers? This should be sufficient, should it not?
Load x[] into 0, y[] into 1, z[] into 2, and so on... ???
Regards
Andreas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The 10 PARTIAL indications you showed add up to vectorization of the entire loop. It's certainly likely this doesn't give best performance, compared to splitting into a near optimum number of divisions.
You have request accumulation of 10 sums, which would require at least 11 named registers to be available. Thus, vectorization isn't possible without either splitting the loop, or spilling sums to stack. As you hinted, it may be necessary to look at asm code in order to get an idea how your options are working.
You have request accumulation of 10 sums, which would require at least 11 named registers to be available. Thus, vectorization isn't possible without either splitting the loop, or spilling sums to stack. As you hinted, it may be necessary to look at asm code in order to get an idea how your options are working.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried the #pragma distribute point in the middle. This works. But it does not help, runtime wise.
I then thought I could be extra clever and put the distribute point right at the start of the loop, right after the for statement. Funnily enough: this works. The vector report states:
reductions.cc(69) (col. 4): remark: LOOP WAS VECTORIZED.
And the runtime is similar to the icpc 11.1.056 results.
The next thing, I will be looking at is the assembler code, but this might take some time.
I will keep you posted.
Regards
Andreas
I then thought I could be extra clever and put the distribute point right at the start of the loop, right after the for statement. Funnily enough: this works. The vector report states:
reductions.cc(69) (col. 4): remark: LOOP WAS VECTORIZED.
And the runtime is similar to the icpc 11.1.056 results.
The next thing, I will be looking at is the assembler code, but this might take some time.
I will keep you posted.
Regards
Andreas
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Yes, distribute point at the top of the loop is intended to prevent splitting (beginning with icc 10.0), even at the expense of other optimizations. There are situations where this is the right thing to do. In your case, it is good that it works better than other options, but it may indicate that the compiler isn't doing the right thing when the loop is split into groups of 3,4, or 5. The compiler should handle groups of 2 easily, so the fact that it doesn't try that voluntarily is a bad sign.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You should file a bug report with a test case for this on Intel Premier Support because it is obviously a regression in optimization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If N is sufficiently large, can you parallize the code?
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
for(size_ti=0;i
sum2+=x*x;
sum7+=x*y;
sum3+=y;
sum4+=y*y;
}
}
#pragma omp section
{
for(size_ti=0;i
sum6+=z*z;
sum8+=x*z;
sum9+=y*z;
sum10+=x*y*z;}
}
}
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim,
Thanks for the hint, but parallelisation at this level is not advisable in the application.
The parallelisation happens at a more global level.
Thanks for the hint, but parallelisation at this level is not advisable in the application.
The parallelisation happens at a more global level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andreas,
The following is some untested code (it compiles OK).
Jim Dempsey
The following is some untested code (it compiles OK).
[cpp]// sum.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include#include using namespace std; #define RESTRICT restrict __declspec(noinline) void reduce (size_t const N, float const * RESTRICT const x, float const * RESTRICT const y, float const * RESTRICT const z, float const * RESTRICT const v, float * RESTRICT const A) { { __declspec(align(16)) float temp_xyzv[4] = { 0.0f, 0.0f, 0.0f, 0.0f }; float sum_product_xyz = 0.0f; __m128 SSE_temp_xyzv; __m128 SSE_sum_xyzv = _mm_setzero_ps(); __m128 SSE_sum_square_xyzv = _mm_setzero_ps(); for (size_t i = 0; i < N; ++i) { temp_xyzv[0] = x; temp_xyzv[1] = y; temp_xyzv[2] = z; sum_product_xyz += x * y * z; SSE_temp_xyzv = _mm_load_ps(temp_xyzv); SSE_sum_xyzv = _mm_add_ps(SSE_sum_xyzv, SSE_temp_xyzv); SSE_sum_square_xyzv = _mm_add_ps(SSE_sum_square_xyzv, _mm_mul_ps(SSE_temp_xyzv, SSE_temp_xyzv)); } _mm_storeu_ps(&A[0], SSE_sum_xyzv); _mm_storeu_ps(&A[3], SSE_sum_square_xyzv); A[9] = sum_product_xyz; } { __m128 SSE_temp_xxyv; __m128 SSE_temp_yzzv; __m128 SSE_sum_xy_xz_yz_vv = _mm_setzero_ps(); for (size_t i = 0; i < N; ++i) { { __declspec(align(16)) float temp_xxyv[4]; temp_xxyv[0] = x; temp_xxyv[1] = x; temp_xxyv[2] = y; SSE_temp_xxyv = _mm_load_ps(temp_xxyv); } { __declspec(align(16)) float temp_yzzv[4]; temp_yzzv[0] = y; temp_yzzv[1] = z; temp_yzzv[2] = z; SSE_temp_yzzv = _mm_load_ps(temp_yzzv); } SSE_sum_xy_xz_yz_vv = _mm_add_ps(SSE_sum_xy_xz_yz_vv, _mm_mul_ps(SSE_temp_xxyv, SSE_temp_yzzv)); } _mm_storeu_ps(&A[6], SSE_sum_xy_xz_yz_vv); } // for (size_t i = 0; i < N; ++i) } const size_t N = 1000; float x ; float y ; float z ; float v ; float A[N*10]; int _tmain(int argc, _TCHAR* argv[]) { // reference variables so optimization doesn't eliminate code for(int i=0; i < N; ++i) { x = i; y = i; z = i; v = i; } // reduce (N, x, y, z, v, A); cout << A[0] << endl; return 0; } [/cpp]
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'll go ahead and submit an issue on this apparent performance regression. I'll let you know what I find.
Dale

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page