Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Intel Community
- Software
- Software Development Tools (Compilers, Debuggers, Profilers & Analyzers)
- Intel® C++ Compiler
- How to avoid "partial loop vectorization"?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Andreas_Klaedtke

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-13-2010
03:40 PM

94 Views

How to avoid "partial loop vectorization"?

I am currently struggling to get the compiler to do what I want it to do on the following code (on a 32 bit system):

[cpp]void reduce (size_t const N, float const * RESTRICT const x, float const * RESTRICT const y, float const * RESTRICT const z, float const * RESTRICT const v, float * RESTRICT const A) { float sum1 = 0; float sum2 = 0; float sum3 = 0; float sum4 = 0; float sum5 = 0; float sum6 = 0; float sum7 = 0; float sum8 = 0; float sum9 = 0; float sum10 = 0; for (size_t i = 0; i < N; ++i) { sum1 += x; sum2 += x* x; sum3 += y; sum4 += y* y; sum5 += z; sum6 += z* z; sum7 += x* y; sum8 += x* z; sum9 += y* z; sum10 += x* y* z; } A[0] = sum1; A[1] = sum2; A[2] = sum3; A[3] = sum4; A[4] = sum5; A[5] = sum6; A[6] = sum7; A[7] = sum8; A[8] = sum9; A[9] = sum10; }[/cpp]

Now, the problem is that this vectorized nicely with the icpc version 11.1.056 (11.1 20091012) and the performance was about twice as good as without vectorization. Btw: I use -xSSE2 in this case as a minimum.

reductions.cc(56): (col. 4) remark: LOOP WAS VECTORIZED.

With version 12.0.0 20101006, it suddenly tries to partially vectorize:

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

This would not be a problem to me if the performance would be on par with the old vectorized result. But it is even slower than the old unvectorized version, so four times slower than the vectorized old result.

How can you avoid this partial vectorization and get a result which is as good as with version 11.1?

Regards

Andreas

Link Copied

10 Replies

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-15-2010
06:08 AM

94 Views

- for(size_ti=0;i
- sum1+=x
*;* - sum2+=x
**x**;* - sum3+=y
*;* - sum4+=y
**y**;* - sum5+=z
*;* - #pragma distribute point
- sum6+=z
**z**;* - sum7+=x
**y**;* - sum8+=x
**z**;* - sum9+=y
**z**;* - sum10+=x
**y***z**;* - }

Andreas_Klaedtke

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-15-2010
07:09 AM

94 Views

Is there a way to avoid partial loop vectorization at all?

What I do not understand then, is why version 11.1 of the compiler seems to vectorize the entire loop (at least it says so, I have not looked at the assembler code yet), and 12.0 does not.

Aren't there at least 8 vector registers? This should be sufficient, should it not?

Load x[] into 0, y[] into 1, z[] into 2, and so on... ???

Regards

Andreas

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-15-2010
08:25 AM

94 Views

You have request accumulation of 10 sums, which would require at least 11 named registers to be available. Thus, vectorization isn't possible without either splitting the loop, or spilling sums to stack. As you hinted, it may be necessary to look at asm code in order to get an idea how your options are working.

Andreas_Klaedtke

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-15-2010
11:09 AM

94 Views

I then thought I could be extra clever and put the distribute point right at the start of the loop, right after the for statement. Funnily enough: this works. The vector report states:

reductions.cc(69) (col. 4): remark: LOOP WAS VECTORIZED.

And the runtime is similar to the icpc 11.1.056 results.

The next thing, I will be looking at is the assembler code, but this might take some time.

I will keep you posted.

Regards

Andreas

TimP

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

11-15-2010
02:15 PM

94 Views

ILevi1

Valued Contributor I

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-16-2010
07:27 AM

94 Views

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-16-2010
11:39 AM

94 Views

#pragma omp parallel sections num_threads(2)

{

#pragma omp section

{

for(size_ti=0;i*;sum2+=x *x;sum7+=x*y;sum3+=y;sum4+=y*y; } } #pragma omp section { for(size_ti=0;i*

Andreas_Klaedtke

Beginner

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-29-2010
03:01 PM

94 Views

Thanks for the hint, but parallelisation at this level is not advisable in the application.

The parallelisation happens at a more global level.

jimdempseyatthecove

Black Belt

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-30-2010
09:15 AM

94 Views

The following is some untested code (it compiles OK).

[cpp]// sum.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include#include using namespace std; #define RESTRICT restrict __declspec(noinline) void reduce (size_t const N, float const * RESTRICT const x, float const * RESTRICT const y, float const * RESTRICT const z, float const * RESTRICT const v, float * RESTRICT const A) { { __declspec(align(16)) float temp_xyzv[4] = { 0.0f, 0.0f, 0.0f, 0.0f }; float sum_product_xyz = 0.0f; __m128 SSE_temp_xyzv; __m128 SSE_sum_xyzv = _mm_setzero_ps(); __m128 SSE_sum_square_xyzv = _mm_setzero_ps(); for (size_t i = 0; i < N; ++i) { temp_xyzv[0] = x ; temp_xyzv[1] = y; temp_xyzv[2] = z; sum_product_xyz += x* y* z; SSE_temp_xyzv = _mm_load_ps(temp_xyzv); SSE_sum_xyzv = _mm_add_ps(SSE_sum_xyzv, SSE_temp_xyzv); SSE_sum_square_xyzv = _mm_add_ps(SSE_sum_square_xyzv, _mm_mul_ps(SSE_temp_xyzv, SSE_temp_xyzv)); } _mm_storeu_ps(&A[0], SSE_sum_xyzv); _mm_storeu_ps(&A[3], SSE_sum_square_xyzv); A[9] = sum_product_xyz; } { __m128 SSE_temp_xxyv; __m128 SSE_temp_yzzv; __m128 SSE_sum_xy_xz_yz_vv = _mm_setzero_ps(); for (size_t i = 0; i < N; ++i) { { __declspec(align(16)) float temp_xxyv[4]; temp_xxyv[0] = x; temp_xxyv[1] = x; temp_xxyv[2] = y; SSE_temp_xxyv = _mm_load_ps(temp_xxyv); } { __declspec(align(16)) float temp_yzzv[4]; temp_yzzv[0] = y; temp_yzzv[1] = z; temp_yzzv[2] = z; SSE_temp_yzzv = _mm_load_ps(temp_yzzv); } SSE_sum_xy_xz_yz_vv = _mm_add_ps(SSE_sum_xy_xz_yz_vv, _mm_mul_ps(SSE_temp_xxyv, SSE_temp_yzzv)); } _mm_storeu_ps(&A[6], SSE_sum_xy_xz_yz_vv); } // for (size_t i = 0; i < N; ++i) } const size_t N = 1000; float x; float y ; float z ; float v ; float A[N*10]; int _tmain(int argc, _TCHAR* argv[]) { // reference variables so optimization doesn't eliminate code for(int i=0; i < N; ++i) { x = i; y= i; z= i; v= i; } // reduce (N, x, y, z, v, A); cout << A[0] << endl; return 0; } [/cpp]

Jim Dempsey

Dale_S_Intel

Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-30-2010
03:36 PM

94 Views

Dale

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

For more complete information about compiler optimizations, see our Optimization Notice.