Re: compiler optimization conflicts openmp parallelization

christian_volz · ‎09-09-2008

Hello,

i have got a severe problem with the intel c++ compiler when compiling my openMP parallelized program with activated optimization flags (/O1...or similar). In this case some worksharing openMP constructs (#pramga omp for) do no longer work! I already noticed that i am not the first who encountered this problem, but i could not find any solution for this apparent bug up to now.

My code looks like that:

#pragma omp parallel
{
#pragma omp for
for (i=0;i<1000;i++)
{
do anything...
}
}

Without activated optimization this work sharing for-construct works as expected. The loop iterations are divided among the threads. But as soon as i activate the optimization flag /O1, this no longer works correctly. Now, every thread iterates ALL loop iterations, i.e. the for-work sharing construct does no longer seem to be executed.

Does anyone know how to solve this problem? Is there a bug-fix? I am using intel c++ compiler 10.1 on a windows quadcore machine.

Best wishes,
Christian

Dmitry_Vyukov · ‎09-09-2008

I think you better ask this question on Intel C++ Compiler Forum:
http://software.intel.com/en-us/forums/

jimdempseyatthecove · ‎09-09-2008

Christian,

Place the declaration of "i" inside the scope of the parallel construct

#pragma omp parallel
{
 #pragma omp for
 for (int i=0;i<1000;i++)

or

#pragma omp parallel
{
 int i;
 #pragma omp for
 for (i=0;i<1000;i++)

or

#pragma omp parallel private(i)
{
 #pragma omp for
 for (i=0;i<1000;i++)

Jim Dempsey

christian_volz · ‎09-09-2008

Hi,

yes, you are right. The loop variable "i" must be declared private. But this will not help with my problem regarding the compiler bug.

One solution i am trying is to program the for-work sharing construct by hand. But surely this is cannot be the
best answer and is only feasible as long as i just need a static loop scheduling.

Thanks for all replies, especially for the hint with the intel compiler forum!

Christian

TimP · ‎09-09-2008

Nothing of what you show is any different from countless examples which work well. We would need a complete example.

christian_volz · ‎09-09-2008

In the compiler forum i got a confirmation that this is an unfixed bug up to now. Actually i will have to replace the for-worksharing constructs by hand until Intel hopefully will be able to fix this problem. This will produce a rather nasty looking code in my application, but at least i am happy to know the source of my problem.

You can find a more useful and reproducable example of the problem here:
http://software.intel.com/en-us/forums//topic/59895

Best regards,
Christian

jimdempseyatthecove · ‎09-09-2008

Christian

try:

#pragma omp parallel for private(i)
for (i=0;i<1000;i++)
{
 do anything...
}

That is remove "#pragma omp parallel" and run the for loop as a parallel for

BTW MS VC++ does not show this problem

Do not use a "#pragma omp parallel for" inside a "#pragma omp parallel" unless you have nested parallel enabled .AND. you want spawn additional threads.

If you do have a compiler bug then you will have to experiment to work around the problem.

When this problem showed up for Intel Visual Fortran (several years ago) you could work around the problem (parallel for inside parallel region) by placing the parallel for into a function and calling that

#pragma omp parallel
{
... do work
hack(a,b,c); // perform parallel for
... other work
}

...

void hack(int a, int b, int c)
{
#pragma omp for private(i)
for(i=0; i<1000; ++i)
{
do work using a,b,c
}
}

Jim Dempsey

christian_volz · ‎09-09-2008

Hi Jim,

thanks for your reply.

Probably your suggestion would solve the problem. But such an approach seems not very efficient for my code (it is a numerical simulation) since i have plenty of loops in it. If i would generate a parallel region for each loop then this would create a lot of parallel overhead preventing a good scalable performance. The parallel performance is much better if i generate only one large parallel region and then use work sharing constructs within it.

Also i can confirm that MVC++ does not show this bug. An alternative would therefore also be to use this compiler, but it seems much slower especially regarding the parallel speedup.

Regarding this bug with the intel compiler it is rather strange that not all "for-worksharing" constructs do show this behaviour. But i could not figure out why some loops are parallelized correctly whereas others do not work. If someone (from INTEL?) can trace back this bug and explain what is the cause for this false behaviour this could probably help me to solve the problem.

At the moment the only work around suitable for me seems to be to program the static "for-worksharing" by hand. With this code it seems to work but it is rather nasty looking and must be inserted various times within the code.

#pragma omp parallel
{
...
// determine the lower and upper values for the iterations for each thread
int total_nr_threads = omp_get_num_threads();
int thread_nr= omp_get_thread_num();
int chunk = (size + total_nr_threads - 1 )/total_nrthreads;
int istart = thread_nr * chunk;
int iend = ((thread_nr + 1)*chunk) < size ? ((thread_nr + 1)*chunk) : size;

// iterate just these iterations for each thread
for (int i= istart; i< iend; i++)
{
do anything....
}
....
} // end parallel

Christian

TimP · ‎09-10-2008

Intel OpenMP threads are persistent, so that it doesn't normally matter that you end a parallel region and start another. In my tests, the claim that Intel OpenMP libraries improve the performance of VC9 OpenMP have proven valid; possibly it is partly on account of this feature.
I note from the other copy you started on the C++ forum, Pat Kennedy has verified an example of the problem you report. You make it confusing by posting the same issue on multiple forums; if you change your mind about which forum is appropriate, why not remove the duplicates?

christian_volz · ‎09-10-2008

tim18:
Intel OpenMP threads are persistent, so that it doesn't normally matter that you end a parallel region and start another.

I am not absolutely sure about this but I think Microsofts OpenMP threads are also persistent. But if you start the parallel regions very very often (actually my parallelized loops are within an larger outer time loop wich causes them to be executed hundreds of times) there seems to be still some parallel overhead. Some sort of activation and deactivation of the threads must always take place, probably this causes my observed overhead.

I note from the other copy you started on the C++ forum, Pat Kennedy has verified an example of the problem you report. You make it confusing by posting the same issue on multiple forums; if you change your mind about which forum is appropriate, why not remove the duplicates?

First of all, I did not change my mind. I just got here a recommendation to post my problem in an intel compiler forum and obviously this was a good hint. So, thats the reason why there are now two copys of my issue existent. Truely, this thread could be deleted. But I do not have a problem with two existing posts because this enhances the possibility that someone can be found with an easy and smart work-around.

Christian

jimdempseyatthecove · ‎09-10-2008

Christian,

While I do not suggest you go to the effort of converting your application to using Intel Threading Building Blocks (TBB) you might want to lift the concept they use of that of an iterator (or set of iterators). Create something like

iterator i(size);// (i=0; i iterator i(from, to);// (i=from; i<=to; i+=direction)
iterator i(from, to, stride);
etc...

Then use as

#pragma omp parallel
{
 ...
 
 // iterate portions for each thread
 iterator i(size);
 while(i.next())
 { 
 do anything....
 }
 ....
} // end parallel

You should be capable ofwriting the iterator class and member functions. Inlining of the functions/ctor/dtor would produce as efficient code as directly using the #pragma omp for (assuming it works) and give you further control depending on how well you write your iterator class. Take the time to examine the TBB iterator class to glean some ideas as well as look at the STL for how they do their iterators. You should be able to produce a nice light-weight iterator that meets your needs.

Jim Dempsey