Solved: is it possible to auto-vectorize and auto-parallelize this code

heavenbird · ‎07-25-2011

Hi,

I have the following function that checks a matrix is symmetric or not:

bool chk_symm(double* data, int row)

{

double summation __attribute__((aligned(16)));

summation =0;

__assume_aligned(data,64);

#pragma parallel

for(int ii=0;ii

{

for(int jj=ii+1;jj

summation +=abs(data[ii*row+jj]-data[jj*row+ii]);

}

return(!(summation>0));

};

XE2011 compiler could not auto parallelize /vectorize this code, can someone give me some advise ?

compiler complains:

Existence of parallel dependence

Existence of vector dependence

Unsupported loop structure

or I have to use OpenMP to explicitly divide the work load ??

Thanks in advance !

Haining

TimP · ‎07-26-2011

With several compilers I tried, I get complaints about the overloading of abs() or about the attributes extensions.
We found it quite difficult to make a compilable copy from your prettified display copy, so it's possible your copy is different.
With icpc 12.0.4,
icpc -par-report -vec-report -parallel -c -par-threshold50 haining.cpp

haining.cpp(11): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
haining.cpp(11): (col. 7) remark: LOOP WAS VECTORIZED.
haining.cpp(11): (col. 7) remark: LOOP WAS VECTORIZED.

Evidently, the auto-parallelizer isn't confident of its ability to speed up your code, so, if your goal is to see the
AUTO_PARALLELIZED without regard to performance, here it is.

Depending on your platform, and whether it stands to benefit from schedule(guided) or the like, you may find that
you need to balance work among threads explicitly. You certainly can't expect the compiler to determine the best
platform-specific choice about parallelization.

View solution in original post

Om_S_Intel · ‎07-25-2011

Yes. The icc cando the auto-vectorization. To check what has been vectorized you can use -vec-report compiler option.

// tstcase.cpp

#include

bool chk_symm(double* data, int row)

{

double summation __attribute__((aligned(16)));

summation =0;

__assume_aligned(data,64);

#pragma parallel

for(int ii=0;ii

{

for(int jj=ii+1;jj

summation += abs(data[ii*row+jj]-data[jj*row+ii]);

}

return(!(summation>0));

};

$ icc -c -xSSE3 -vec-report3 tstcase.cpp

tstcase.cpp(15): (col. 10) remark: LOOP WAS VECTORIZED.

tstcase.cpp(13): (col. 7) remark: loop was not vectorized: not inner loop.

TimP · ‎07-26-2011

With several compilers I tried, I get complaints about the overloading of abs() or about the attributes extensions.
We found it quite difficult to make a compilable copy from your prettified display copy, so it's possible your copy is different.
With icpc 12.0.4,
icpc -par-report -vec-report -parallel -c -par-threshold50 haining.cpp

haining.cpp(11): (col. 7) remark: LOOP WAS AUTO-PARALLELIZED.
haining.cpp(11): (col. 7) remark: LOOP WAS VECTORIZED.
haining.cpp(11): (col. 7) remark: LOOP WAS VECTORIZED.

Evidently, the auto-parallelizer isn't confident of its ability to speed up your code, so, if your goal is to see the
AUTO_PARALLELIZED without regard to performance, here it is.

Depending on your platform, and whether it stands to benefit from schedule(guided) or the like, you may find that
you need to balance work among threads explicitly. You certainly can't expect the compiler to determine the best
platform-specific choice about parallelization.

heavenbird · ‎07-26-2011

Hello Timp, Om Sachan,

Thank you for your help. I've tried different compiler flags, finally found out the reason that I could not get the auto-vectorization and auto-parallelization to work:

(1)-fp-model double flag disabled all auto-parallelization / auto-vectorization

(2) -xSSE4.2, -xSSE4.1, -xSSE3flag enables the auto-vectorization but disables auto-parallelization, it considers insufficient computation tasks to do.

(3) -xSSE2 flags enables both auto-vectorization and auto-parallelization

(4) -axSSE4.2 flag enables both auto-vectorization and auto-parallelization.

(5) -fb-model fast flag (which is default)enables both auto-vectorization and auto-parallelization

according to my understanding: -axSSE4.2 generates all versions of code paths for different instruction sets, hence for some instruction set like SSE2 the auto-parallization is enabled.

in my case I had -fp-model doubleflag, cause I wasn't very sure about how accurate the fast mode is.

I wish for the next version of compiler guide, there could be a chapter comments on how compiler flags affects auto-vectorizer and auto-parallelizer works.

Appreciate your help !

Haining

heavenbird · ‎07-26-2011

Also:

-nolib-inline will create vector dependence and disable auto-vectroization

Thanks,

Haining

TimP · ‎07-27-2011

Icpc will parallelize at -par-threshold0 when -xSSE4.1 is set. Presumably, the compiler sees even less chance of benefit in parallelization due to the more efficient vectorization.
Most of the -fp-model options disable vectorization of sum reductions, due to the likelihood of different (usually more accurate)
numerical results with the batching of additions. This can't affect the results in this case, but the compiler doesn't make such a
distinction. As your data types are all double, -fp-model double would not have a different effect from -fp-model source.
I'd like to see an option to allow vectorization of cases such as this while still observing parentheses, such as current gcc has,
and ifort has.
If you have a policy of using -fp-model source or equivalent options, I can understand why, but it won't make any difference
in this example aside from the effect on vectorization.
You could use the pragma simd to over-ride fp-model (in effect setting -fp-model fast -ansi_alias one loop at a time, also including the effects of #pragma vector always and #pragma ivdep):

[bash]    #pragma simd reduction(+: summation)
      for (int jj = ii + 1; jj < row; jj++)
        summation += abs (data[ii * row + jj] - data[jj * row + ii]);[/bash]

allowing vectorization here, but this apparently doesn't over-ride the effect of fp-model on auto-parallelization.

We have had fairly lengthy discussions about the advice on use of pragmas for optimization, which I alluded to in previous posts,
but the question seems to have enough marketing involved that I don't expect this thread to have an influence,
even though g++ achieves many of the optimizations without pragmas (but of course often needs the __restrict pointer
extension).

is it possible to auto-vectorize and auto-parallelize this code ?