- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I am currently struggling to get the compiler to do what I want it to do on the following code (on a 32 bit system):

[cpp]void reduce (size_t const N, float const * RESTRICT const x, float const * RESTRICT const y, float const * RESTRICT const z, float const * RESTRICT const v, float * RESTRICT const A) { float sum1 = 0; float sum2 = 0; float sum3 = 0; float sum4 = 0; float sum5 = 0; float sum6 = 0; float sum7 = 0; float sum8 = 0; float sum9 = 0; float sum10 = 0; for (size_t i = 0; i < N; ++i) { sum1 += x; sum2 += x* x; sum3 += y; sum4 += y* y; sum5 += z; sum6 += z* z; sum7 += x* y; sum8 += x* z; sum9 += y* z; sum10 += x* y* z; } A[0] = sum1; A[1] = sum2; A[2] = sum3; A[3] = sum4; A[4] = sum5; A[5] = sum6; A[6] = sum7; A[7] = sum8; A[8] = sum9; A[9] = sum10; }[/cpp]

Now, the problem is that this vectorized nicely with the icpc version 11.1.056 (11.1 20091012) and the performance was about twice as good as without vectorization. Btw: I use -xSSE2 in this case as a minimum.

reductions.cc(56): (col. 4) remark: LOOP WAS VECTORIZED.

With version 12.0.0 20101006, it suddenly tries to partially vectorize:

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

reductions.cc(57) (col. 4): remark: PARTIAL LOOP WAS VECTORIZED.

This would not be a problem to me if the performance would be on par with the old vectorized result. But it is even slower than the old unvectorized version, so four times slower than the vectorized old result.

How can you avoid this partial vectorization and get a result which is as good as with version 11.1?

Regards

Andreas

Link Copied

10 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- for(size_ti=0;i
- sum1+=x
*;* - sum2+=x
**x**;* - sum3+=y
*;* - sum4+=y
**y**;* - sum5+=z
*;* - #pragma distribute point
- sum6+=z
**z**;* - sum7+=x
**y**;* - sum8+=x
**z**;* - sum9+=y
**z**;* - sum10+=x
**y***z**;* - }

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Is there a way to avoid partial loop vectorization at all?

What I do not understand then, is why version 11.1 of the compiler seems to vectorize the entire loop (at least it says so, I have not looked at the assembler code yet), and 12.0 does not.

Aren't there at least 8 vector registers? This should be sufficient, should it not?

Load x[] into 0, y[] into 1, z[] into 2, and so on... ???

Regards

Andreas

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

You have request accumulation of 10 sums, which would require at least 11 named registers to be available. Thus, vectorization isn't possible without either splitting the loop, or spilling sums to stack. As you hinted, it may be necessary to look at asm code in order to get an idea how your options are working.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I then thought I could be extra clever and put the distribute point right at the start of the loop, right after the for statement. Funnily enough: this works. The vector report states:

reductions.cc(69) (col. 4): remark: LOOP WAS VECTORIZED.

And the runtime is similar to the icpc 11.1.056 results.

The next thing, I will be looking at is the assembler code, but this might take some time.

I will keep you posted.

Regards

Andreas

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

#pragma omp parallel sections num_threads(2)

{

#pragma omp section

{

for(size_ti=0;i*;sum2+=x *x;sum7+=x*y;sum3+=y;sum4+=y*y; } } #pragma omp section { for(size_ti=0;i*

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Thanks for the hint, but parallelisation at this level is not advisable in the application.

The parallelisation happens at a more global level.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

The following is some untested code (it compiles OK).

[cpp]// sum.cpp : Defines the entry point for the console application. // #include "stdafx.h" #include#include using namespace std; #define RESTRICT restrict __declspec(noinline) void reduce (size_t const N, float const * RESTRICT const x, float const * RESTRICT const y, float const * RESTRICT const z, float const * RESTRICT const v, float * RESTRICT const A) { { __declspec(align(16)) float temp_xyzv[4] = { 0.0f, 0.0f, 0.0f, 0.0f }; float sum_product_xyz = 0.0f; __m128 SSE_temp_xyzv; __m128 SSE_sum_xyzv = _mm_setzero_ps(); __m128 SSE_sum_square_xyzv = _mm_setzero_ps(); for (size_t i = 0; i < N; ++i) { temp_xyzv[0] = x ; temp_xyzv[1] = y; temp_xyzv[2] = z; sum_product_xyz += x* y* z; SSE_temp_xyzv = _mm_load_ps(temp_xyzv); SSE_sum_xyzv = _mm_add_ps(SSE_sum_xyzv, SSE_temp_xyzv); SSE_sum_square_xyzv = _mm_add_ps(SSE_sum_square_xyzv, _mm_mul_ps(SSE_temp_xyzv, SSE_temp_xyzv)); } _mm_storeu_ps(&A[0], SSE_sum_xyzv); _mm_storeu_ps(&A[3], SSE_sum_square_xyzv); A[9] = sum_product_xyz; } { __m128 SSE_temp_xxyv; __m128 SSE_temp_yzzv; __m128 SSE_sum_xy_xz_yz_vv = _mm_setzero_ps(); for (size_t i = 0; i < N; ++i) { { __declspec(align(16)) float temp_xxyv[4]; temp_xxyv[0] = x; temp_xxyv[1] = x; temp_xxyv[2] = y; SSE_temp_xxyv = _mm_load_ps(temp_xxyv); } { __declspec(align(16)) float temp_yzzv[4]; temp_yzzv[0] = y; temp_yzzv[1] = z; temp_yzzv[2] = z; SSE_temp_yzzv = _mm_load_ps(temp_yzzv); } SSE_sum_xy_xz_yz_vv = _mm_add_ps(SSE_sum_xy_xz_yz_vv, _mm_mul_ps(SSE_temp_xxyv, SSE_temp_yzzv)); } _mm_storeu_ps(&A[6], SSE_sum_xy_xz_yz_vv); } // for (size_t i = 0; i < N; ++i) } const size_t N = 1000; float x; float y ; float z ; float v ; float A[N*10]; int _tmain(int argc, _TCHAR* argv[]) { // reference variables so optimization doesn't eliminate code for(int i=0; i < N; ++i) { x = i; y= i; z= i; v= i; } // reduce (N, x, y, z, v, A); cout << A[0] << endl; return 0; } [/cpp]

Jim Dempsey

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dale

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page