Solved: OpenMP breaks auto-vectorization

hpcmango · ‎06-26-2009

Hi,

for quite some time a regularly encounter the effect that loops are no longer vectorized when they are inside an outer OpenMP-parallel loop. The vectorization works fine though if I remove the '#pragma omp parallel for'.

Example code:

#pragma omp parallel for
for(int i=1;i int curIndex=1+i*sizeX;
for(int j=1;j dataB[curIndex]=0.1*(dataA[curIndex-1]+dataA[curIndex+1]+dataA[curIndex-sizeX]+dataA[curIndex+sizeX])+0.6*dataA[curIndex];
curIndex++;
}
curIndex+=2;
}

The vectorization report says:

***.cxx(38): (col. 5) remark: loop was not vectorized: not inner loop.
***.cxx(40): (col. 7) remark: loop was not vectorized: existence of vector dependence.
***.cxx(41): (col. 2) remark: vector dependence: assumed FLOW dependence between dataB line 41 and dataB line 41.
***.cxx(41): (col. 2) remark: vector dependence: assumed ANTI dependence between dataB line 41 and dataB line 41.

If I use a '#pragma ivdep' before the inner loop, I get:

***.cxx(38): (col. 5) remark: loop was not vectorized: not inner loop.
***.cxx(41): (col. 7) remark: loop was not vectorized: unsupported data type.

If I use additionally a '#pragma vector always'' before the inner loop, I still get the same.

I did this with compiler version 11.0 on x86_64 linux, but I remember the result be quite the same for 10.0 and 10.1.

Can anyone explain this to me? I don't see a reason why vectorization should not work here. Is there a way to fix the problem, .e.g. by pragmas?

Best

Oliver

TimP · ‎06-26-2009

My copies of icpc 11.0/083 and 11.1 for intel64 vectorize both parallel loops, when -ansi-alias is set. If you don't set that flag, you are telling the compiler that you may have violated the standards on data type aliasing.
You may argue that there is nothing here in the line of aliasing (such as the possibility of your float data updates over-writing curIndex) which the compiler should be concerned about, but I'll leave it to you if you wish to submit a report on premier.intel.com to make that case.

As near as I can find out, some BSD variants use the keyword __restrict, but it would be ignored by icpc. It's mentioned in Microsoft documentation, but I haven't found a Microsoft or Intel compiler which observes it. It doesn't appear to make the difference here; the compiler apparently can see that you have malloc'd 2 distinct regions.

View solution in original post

TimP · ‎06-26-2009

We had a case where the poster gave a complete working example on the forum. In that case, 2 steps were required to fix it:
1) upgrade to 11.1
2) set -inline-max-size=50 (this value was low enough to stop in-lining of a function with omp parallel)

Even though 11.1 is intended to be less aggressive on in-lining than 10.1 and 11.0, in that case it still needed the option to help out.

Do you still get vectorization without -openmp but no vectorization with -openmp, if you turn off in-lining?

A complete case would be required to see how you have set it up so that the compiler doesn't have to be concerned about aliasing between dataA and dataB when you don't set -openmp, but is concerned when -openmp is set. If those were function parameters, appropriate restrict qualifiers would be needed. It's possible that the analysis might be affected by a change from default static allocation without -openmp to stack allocation with -openmp, or by the compiler correctly stopping in-lining when you set -openmp.

hpcmango · ‎06-26-2009

Thanks for the answer.

Unfortunately upgrading to 11.1 will take a week or so, because I'm not the admin on the machine with icc. I will try it of course, when the upgrade is done.

I tried -inline-max-size=50 with version 11.0 though, but it didn't help. Anyway I am wondering how this could have an effect, as it seems to be about function inlining and my code only has a main.

While playing around, I found another way to get it to vectorize. I am using a timer c++ class (which simply wraps the posix high resolution timers for convenience) to measure time. If I remove the usage of this class (and inline the timer code instead into main) vectorization works also with OpenMP.

Really strange. The timer code is completely outside the region of interest. Is it possible that OpenMP or vectorization doesn't like object oriented programming and shuts done completely when using it?

-- Main.cxx ---
#include
#include

#include

#include "Timer.hxx"

int main()
{
#pragma omp parallel
{
printf("OpenMP thread = %i/%i.n",omp_get_thread_num(),omp_get_num_threads());
}

const int sizeX = 8192;
const int sizeY = 8192;
const int loops = 100;

float* __restrict dataA;
float* __restrict dataB;

int dataSize=sizeof(float)*sizeX*sizeY;

dataA=(float*)malloc(dataSize);
dataB=(float*)malloc(dataSize);

for(int i=0;i for(int j=0;j dataA[i*sizeX+j]=0;
}
}
dataA[(sizeY/2)*sizeX+(sizeX/2)]=1;

Timer timer;
for(int iLoop=0;iLoop
#pragma omp parallel for
for(int i=1;i int curIndex=1+i*sizeX;
for(int j=1;j dataB[curIndex]=0.1*(dataA[curIndex-1]+dataA[curIndex+1]+dataA[curIndex-sizeX]+dataA[curIndex+sizeX])+0.6*dataA[curIndex];
curIndex++;
}
curIndex+=2;
}

#pragma omp parallel for
for(int i=1;i int curIndex=1+i*sizeX;
for(int j=1;j dataA[curIndex]=0.1*(dataB[curIndex-1]+dataB[curIndex+1]+dataB[curIndex-sizeX]+dataB[curIndex+sizeX])+0.6*dataB[curIndex];
curIndex++;
}
curIndex+=2;
}
}
double duration=timer.get();
fprintf(stderr,"Time = %g s, Performance = %g FLOPSn",duration,6.*(sizeX-1)*(sizeY-1)*2*loops/duration);

fprintf(stderr,"n");
for(int i=sizeY/2-5;i<=sizeY/2+5;i++) {
for(int j=sizeX/2-5;j<=sizeX/2+5;j++) {
fprintf(stderr,"%f ",dataA[i*sizeX+j]);
}
fprintf(stderr,"n");
}

free(dataA);
free(dataB);

return 0;
}

--- Timer.hxx ---
#ifndef om_timer_hxx_
#define om_timer_hxx_

#include

class Timer {
public:
Timer() {
reset();
}
void reset() {
clock_gettime(CLOCK_MONOTONIC,&m_Timespec);
}
double get() {
struct timespec endTimespec;
clock_gettime(CLOCK_MONOTONIC,&endTimespec);
return (endTimespec.tv_sec-m_Timespec.tv_sec)+
(endTimespec.tv_nsec-m_Timespec.tv_nsec)*1e-9;
}
double getAndReset() {
struct timespec endTimespec;
clock_gettime(CLOCK_MONOTONIC,&endTimespec);
double result=(endTimespec.tv_sec-m_Timespec.tv_sec)+
(endTimespec.tv_nsec-m_Timespec.tv_nsec)*1e-9;
m_Timespec=endTimespec;
return result;
}
private:
struct timespec m_Timespec;
};

#endif

jimdempseyatthecove · ‎06-26-2009

What happens when you place each parallel for into seperate functions then compile with and without IPO?

Jim Dempsey

jimdempseyatthecove · ‎06-26-2009

I forgot to mention, have you tried #pragma vector always on the inner loop?

Jim

TimP · ‎06-26-2009

My copies of icpc 11.0/083 and 11.1 for intel64 vectorize both parallel loops, when -ansi-alias is set. If you don't set that flag, you are telling the compiler that you may have violated the standards on data type aliasing.
You may argue that there is nothing here in the line of aliasing (such as the possibility of your float data updates over-writing curIndex) which the compiler should be concerned about, but I'll leave it to you if you wish to submit a report on premier.intel.com to make that case.

As near as I can find out, some BSD variants use the keyword __restrict, but it would be ignored by icpc. It's mentioned in Microsoft documentation, but I haven't found a Microsoft or Intel compiler which observes it. It doesn't appear to make the difference here; the compiler apparently can see that you have malloc'd 2 distinct regions.

hpcmango · ‎06-29-2009

Thanks, Tim.

Using 'ansi-alias' seems to be the simplest solution. Though I fear as it makes the optimization working by assuming less type aliasing, it could be easily possible to break it again, e.g. by using ints and floats inside the timer class.

Altogether I also get the impression that this should be considered a compiler bug.

@Jim: no, '#pragma vector always' doesn't help. What helps though is moving the 'parallel for' section to a separate function and put this function into a '#pragma auto_inline off' section.

Best,

Oliver

jimdempseyatthecove · ‎06-29-2009

Oliver,

Good work around. I've found pushing code out of line to work around other OpenMP problems before. If your inner loop has a significant iteration count then the call overhead shouldn't be too bad. I would consider this a problem in the optimization code. If you can submit a simple code sample to premier support then they should be able to identify the problem and fix it.

Jim

TimP · ‎06-29-2009

Situations are common where it's not possible to optimize without -ansi-alias. It would be poor practice to write code which violates the standard which has been in effect for 20 years, and has been the default requirement in all common compilers except Microsoft's for 10.
I don't think you're clear on which aspect of this you wish to consider a bug, but you're welcome to file a bug report. I don't think Intel will adopt consistency with gcc or g++ when it conflicts with Microsoft.
A possible feature request might be to fix the vec-report2 so it says "this loop is not vectorizable on account of -no-ansi-alias."
I've never seen anyone propose a treatment more like HP's C, so I don't think that would be popular enough to be considered.

hpcmango · ‎06-30-2009

Quoting - tim18

Situations are common where it's not possible to optimize without -ansi-alias. It would be poor practice to write code which violates the standard which has been in effect for 20 years, and has been the default requirement in all common compilers except Microsoft's for 10.
I don't think you're clear on which aspect of this you wish to consider a bug, but you're welcome to file a bug report. I don't think Intel will adopt consistency with gcc or g++ when it conflicts with Microsoft.
A possible feature request might be to fix the vec-report2 so it says "this loop is not vectorizable on account of -no-ansi-alias."
I've never seen anyone propose a treatment more like HP's C, so I don't think that would be popular enough to be considered.

Hi Tim,

of course it is debatable what exactly the bug is.

What disturbs me is, that the behaviour of the compiler is quite unpredictable. Apparently unrelated pieces of code (the timer object, OpenMP pragmas) break the vectorization, which in the simple case works fine.

Why should '-ansi-alias' be required with OpenMP but not without?

Ideally I wished the compiler to figure out that these things do not affect if the loop should be assumed vectorizable or not.

Best,

Oliver

TimP · ‎06-30-2009

Quoting - hpcmango

Why should '-ansi-alias' be required with OpenMP but not without?

In general, violations of -ansi-alias could create race conditions which break OpenMP as well as -parallel. I don't have high expectations for compilation without -ansi-alias.
I do agree that the compiler should be less obscure about which optimizations are disabled by default, as well as which options are needed for consistency with other compilers. I put -ansi-alias in icc.cfg and icpc.cfg, so as not to have to remember to set it on command line.
The first criterion often seems to be not to miss optimizations which MSVC performs, and vectorization is not one of those.

jimdempseyatthecove · ‎06-30-2009

What happens if you remove the __restrict keyword?
The two data array pointers are malloc'd within scope and malloc returns a no-alias'd propertied pointer. Therefor the two pointers willinherit the no-alias (should be vectorizable).

Jim

TimP · ‎06-30-2009

Quoting - jimdempseyatthecove

What happens if you remove the __restrict keyword?
The two data array pointers are malloc'd within scope and malloc returns a no-alias'd propertied pointer. Therefor the two pointers willinherit the no-alias (should be vectorizable).

Jim

I was surprised that icc accepted __restrict, but I checked that it made no difference in this example. I already pointed out that the compiler sees that the malloc'd regions aren't aliased, at least when -ansi-alias is set.
According to tests I just made, the current icc does accept restrict, __restrict, and __restrict__ as interchangeable (except that no command line option is needed to accept the latter two spellings), although that is an undocumented feature.

hpcmango · ‎07-01-2009

Quoting - jimdempseyatthecove

What happens if you remove the __restrict keyword?
The two data array pointers are malloc'd within scope and malloc returns a no-alias'd propertied pointer. Therefor the two pointers willinherit the no-alias (should be vectorizable).

Jim

Removing restrict doesn't make a difference here.

Interesting to know though that the compiler understands what malloc means.

I still wonder why ansi aliasing rules help in my example. If this is all what it is about

http://publib.boulder.ibm.com/infocenter/zos/v1r9/index.jsp?topic=/com.ibm.zos.r9.cbcpx01/optalias.htm

it's just type-based aliasing which can be resolved more easily. And in my example the only floats I use are for the arrays 'dataA' and 'dataB' (which are additionally declared restrict).

Maybe the reason is, that the compiler writers expect that in the typical usage case '-ansi-alias' is present, thus the optimization algorithms for the '-no-ansi-alias' mode are somewhat poorly maintained?

jimdempseyatthecove · ‎07-01-2009

experiment with

__restrict float* dataA;
__restrict float* dataB;

Note, the above juxtapositioning is the same as when you use __declspec(restrict) on a function returning a non-aliased pointer. Although the above may not follow the C++ specification, it may identify that the compiler writer followed the __declspec(restrict) format.

I imagine someone on the ANSI committie (should have) argued for __noalias.

Jim

TimP · ‎07-01-2009

The -no-ansi-alias default tells the compiler not to rely on your code complying with the typed aliasing rules (unless it can "prove" you didn't violate them). Several issues may be involved, such as how much compile time might potentially be taken (on some much larger case) looking for optimizations which could have been done easily under -ansi-alias.
restrict applies to global pointers, so it's not too surprising it doesn't make a difference here.

jimdempseyatthecove · ‎07-01-2009

>> restrict applies to global pointers, so it's not too surprising it doesn't make a difference here.

Correct, these are local pointers, assigned once each via malloc and thus should have no-alias attribute (without __restrict).

What is likely happening is the copy of the dataA and dataB pointers made on stack for each OpenMP thread loose the no-alias attribute.

The reason for this (my guess) is that the code sample given has two parallel regions (parallel for)

one of which is reading from dataA and writing to dataB
and
one of which is reading from dataB and writing to dataA

Although the parallel for's are missing NOWAIT and therefore each region runs seperately, I believe the compiler may mistakenly think that they can run concurrently and thus require write-through (never mind the fact that temporal issues would occure if these sections were to run concurrently).

Jim Dempsey

hpcmango · ‎07-01-2009

The reason the compiler actually gave, was:

***.cxx(41): (col. 2) remark: vector dependence: assumed FLOW dependence between dataB line 41 and dataB line 41.

When adding '#pragma ivdep' it changed to:

***.cxx(41): (col. 7) remark: loop was not vectorized: unsupported data type.

Especially the second one sounds strange. No clue what kind of 'unsupported data type' could be meant.

@Jim: removing one of the parallel for sections doesn't help

jimdempseyatthecove · ‎07-01-2009

Try replacing curIndex with the loop control variable j

#pragma omp parallel for
for(int i=1;i{
// was int curIndex=1+i*sizeX;
// was for(int j=1;jint jBegin = 1+i*sizeX;
int jEnd = jBegin + sizeX - 2; // check ranges
for(int j=jBegin;jdataB=0.1*(dataA[j-1]+dataA[j+1]+dataA[j-sizeX]+dataA[j+sizeX])+0.6*dataA;
}
}

Verify jEnd is correct.
Then change second loop.

The vectorization may work better when the index is relative to the loop control variable.

Jim

hpcmango · ‎07-02-2009

Quoting - jimdempseyatthecove

The vectorization may work better when the index is relative to the loop control variable.

Jim

Hi Jim,

I tried, but it also doesn't seems to make a difference.

I found more OpenMP+vectorization fun though:

- in the original code if I remove the 'parallel for' pragma vectorization works
- but if I move the OpenMP section to a separate function inside a 'auto_inline off' pragma, as shown below, it's exactly the other way round: with the 'parallel for' vectorization works, when I remove it, it doesn't

#pragma auto_inline off
void calc(float* dataIn, float* dataOut,
int sizeX,int sizeY) {
#pragma omp parallel for
for(int i=1;i int curIndex=1+i*sizeX;
for(int j=1;j dataOut[curIndex]=0.1*(dataIn[curIndex-1]+dataIn[curIndex+1]+dataIn[curIndex-sizeX]+dataIn[curIndex+sizeX])+0.6*dataIn[curIndex];
curIndex++;
}
}
}
#pragma auto_inline on

jimdempseyatthecove · ‎07-02-2009

Quoting - hpcmango

Hi Jim,

I tried, but it also doesn't seems to make a difference.

I found more OpenMP+vectorization fun though:

- in the original code if I remove the 'parallel for' pragma vectorization works
- but if I move the OpenMP section to a separate function inside a 'auto_inline off' pragma, as shown below, it's exactly the other way round: with the 'parallel for' vectorization works, when I remove it, it doesn't

#pragma auto_inline off
void calc(float* dataIn, float* dataOut,
int sizeX,int sizeY) {
#pragma omp parallel for
for(int i=1;iint curIndex=1+i*sizeX;
for(int j=1;jdataOut[curIndex]=0.1*(dataIn[curIndex-1]+dataIn[curIndex+1]+dataIn[curIndex-sizeX]+dataIn[curIndex+sizeX])+0.6*dataIn[curIndex];
curIndex++;
}
}
}
#pragma auto_inline on

Ha ha ha ha :)

It would seem that a rotatedRPN is in effect - eh? (evaluate right/bottom to left/top) ;)

Jim