Different behaviour depending on which thread computes which interation loop (OpenMP parallel for)

aurora · ‎02-23-2012

Hi,

I have a parallel loop like this one:

!$OMP PARALLEL DO NUM_THREADS(2) DEFAULT(SHARED)

DO J=1, X

!$OMP CRITICAL

aux+=f(...some parameters...)

!$OMP END CRITICAL

END DO

!$OMP END PARALLEL DO

In f(), there are some "double precission" and "integer" definitions and the parameters variables and it calculates a number.

The thing, is that "aux" depends on which threads is computing the f() function. This is, if thread0 computes interations 1,2,3,4 and thread1 computes interations with J=5,6,7,8, aux is always the same. But when there is a different combination of iterations in threads, the results differ.

So, to achieve the same "aux" result between two executions of the program, thread0 has to compute exactly the same iterations. What could produce this behaviour? ( for example, threadprivate declarations)

Thanks in advance

TimP · ‎02-23-2012

Variations in roundoff behavior are inherent in reduction operations. In the context you posted, it looks like you need firstprivate(aux) lastprivate(aux) in order to accomplish it by critical section.
Assuming your function is correctly parallelized and has no side effects (and is compiled with compatible options or RECURSIVE declaration), numerical variations are to be expected due to varying order of addition.
The OpenMP reduction clause might be more efficient and might have better numerical properties, as well as being simpler to write, than the critical section. reduction clause would avoid the need for firstprivate lastprivate.
If the code is correctly parallelized but still has excessive variations in roundoff due to various orders of addition, the simplest remedy would be to declare aux as double precision if it isn't already.

aurora · ‎02-23-2012

Hi,

This is a simplification of the problem. In fact, in my problem, "aux" would be a matrix and each position in the matrix is accesed only once in the whole loop, so no reduction here, only a shared matrix.

I've tested also the code in sequential with random order of iterations and works fine

PS: The critical is only for showing that there is not a concurrence problem. I think the problem is in function f() and what variables it declares in stack (maybe garbage values that persist between iterations or something).

Any ideas?

TimP · ‎02-23-2012

I did suggest that you check the function f() to assure that it was compiled with options to assure private stack. You may recall that Steve Lionel suggested that such functions should be declared RECURSIVE so as to avoid those dependencies on compile options. When that is done, the order of threaded completion should not have any more effects than changes in order of sequential iterations.

jimdempseyatthecove · ‎02-23-2012

Consider using the ORDERED directive

!$OMP PARALLEL DO ORDERED NUM_THREADS(2) DEFAULT(SHARED)

DO J=1, X

... ! possibly other work here
!$OMP ORDERED
aux+=f(...some parameters...)

!$OMP END ORDERED
... ! possibly other work here

END DO

!$OMP END PARALLEL DO

Jim Dempsey

aurora · ‎02-24-2012

Hi,

I've read about this inhttp://software.intel.com/en-us/forums/showpost.php?p=117749

I thought that /recursive was implicit with /Qopenmp. Anyway, compiling with /recursive doesnt solve the problem :(

aurora · ‎03-02-2012

Which compilation flags should I use in order to assure that all variables static/stack/heap are 0-initialized?

I have not SAVE statments or commons

Thanks in advance!

Les_Neilson · ‎03-02-2012

Whilst it is not a good idea to rely on such settings(it is always better to change the code even if it is a long slog to do so (or you could write a script to help you)) /QSave and /QZeroare the ones you want.

Note/QZero only initialises saved scalar variables - arrays you will have to do yourself.

see the help for more details

Les

TimP · ‎03-02-2012

The combination /Qsave /Qzero works by removing the affected variables from stack, so it does nothing to initialize stack. It won't help you with parallelization.