Erroneous results in non-parallel sections when building parallel code (OpenMP)

mdobrica · ‎01-08-2007

Hi, i have encountered this issue several times with IVF 9.1 on Core 2 Duo processor, Win32. I get wrong results from calculations performed in loops found outside parallel regions. This only happens when parallel code is generated (i have otherparallel openmp regions). The loop in question is a kind of matrix-matrix multiplication:

DO ja=1,N; DO ia=1,M; 
DO jb=1,N; DO ib=1,M;
A(ia,ja) = A(ia,ja) + K(M-ia+ib, N-ja+jb)*B(ib,jb);
ENDDO;ENDDO;
ENDDO;ENDDO;

When this code is executed, the resulting matrix A may differ by several orders of magnitude from thecorrect results. This happens even if matrices are small (M,N<100), with autoparallelization turned off (but OpenMP directives are processed). As i said before, this only happens in non-parallel regions of the code.

I have found different workarounds to this problem, but yet i'm not sure all my codeis running as it should. Possible solutions were:1) tochange the looping order from (ja, ia, jb, ib) to (ja, jb, ia, ib); 2) to parallelize the code with !$OMP PARALLEL DO REDUCTION(+:A); 3) to set "Improve FP consistency" compiler option (/Op);

I was wondering if anyone else has encountered this issue and if there is any known workaround that wouldensure that such errors do not produce (since it seems to me this is a quite particular error). Thnx for having read all this this :)

jimdempseyatthecove · ‎01-08-2007

If the above nested loop is experiencing problems when you believe it is not executed in parallel but you have other OpenMP parallel sections in your program, then it is likely that your assumption is incorrect. For example a preceeding parallel section terminated with a NOWAIT.

Before the 1st DO insert the diagnostic code

if(OMP_IN_PARALLEL()) then
STOP ! place break point here
endif

The above does not catch all such problems as the Master thread may have exited a parallel region (via NOWAIT) while a different thread is still processing not only array A but K and B as well i.e. you get to the summation loop on Aprior to processing on K and B being complete.

Jim Dempsey

mdobrica · ‎01-09-2007

Hi Jim, thnx for your answer.

I have used the diagnostic code you sugested and i can confirm that there's no parallel execution in that loop. The loopis in a subroutine which is called after an OMP PARALLEL / OMP Sections / ... / OMP END PARALLEL block. I have played arround with it a bit more, and i found no logical explanation for the behavior described above. I'll post more details, maybe some of you can help (or test this to see if you get the same behavior):

DO ja=1,N; DO ia=1,M; 
   DO jb=1,N; DO ib=1,M;
     A(ia,ja) = A(ia,ja) + K(M-ia+ib, N-ja+jb)*B(ib,jb);
   ENDDO;ENDDO;
ENDDO;ENDDO;

Loop orderExec. timeResult
ja, ia, jb, ib 1.27s ERR
ia, ja, jb, ib 1.27s ERR
ja, jb, ia, ib0.52s ERR
ja, jb, ib, ia 2.23sOK
jb, ja, ib, ia 2.23s& nbsp; OK

Also, performing additional calculus between loops (like computing k=M-ia; l = N-ja) seems to change the loop order required for correct computations, but it doesnt solve the problem for all possible loop orders (when it does, execution time is 6.4s).

Now, if i introduce a temporary summation variable between loops 2 and 3, i do get correct results for all loop orders i have tested, execuion timebeing 1.62s:

DO ja=1,N; DO ia=1,M;

   t_sum = 0.d0;

   DO jb=1,N; DO ib=1,M;

     t_sum = t_sum + K(M-ia+ib, N-ja+jb)*B(ib,jb);

   ENDDO;ENDDO;

   A(ia,ja) = A(ia,ja) + t_sum;

ENDDO;ENDDO;

However, all these tweakings do not gu arantee good execution for the rest of the code. At this point i'm evalluating the performance loss from using /Op (improving FP consistency), but it seems quite harsh; loop execution time passes to 6.2s in single processing and 3.2s in OMP. So, i'm still in search of a better solution, if anyone can help. Thnx.

P.S. Forgot to say that if loop is parallelized (OMP PARALLEL), i get correct results for all loop orders, best execution time being 0.97s for (ja, ia, jb, ib) order and 0.52s for (ja, jb, ia, ib) order.

jimdempseyatthecove · ‎01-09-2007

Please note that

OMP PARALLEL / OMP Sections / ... / OMP END PARALLEL

May be initiated within a parallel section. i.e. when using nested parallel sections.

The OMP_IN_PARALLEL() should have caught that though

If the problem is not due to OpenMP threading then it could potentialy be due to a compiler bug due to loop unrolling (bug) or autoparallization (bug).

The use of the temporary should have caused the summation to run faster.

The summation loops are a good candidate for explicit parallization and vectorization

Your nested loop looks like it is a good candidate for OpenMP with vectorization.

Rework to use

!dec$ attributes align : 16 :: t_sum
real(8), automatic :: t_sum(2)

Then change the inner loop to run ib in two steps at a time.

What is in K? A selector (0/1) or a scale factor?

Jim

mdobrica · ‎01-09-2007

Hi again and thnx for your answer. I'm not using nested parallel sections, and i guess you're right when suspecting an autoparallelization or loop unrolling bug.

The use of the temporary did cause the summation to run faster, and i found it to be even faster (1.02s) and also correct if it is applied only for the innermost loop (thus allowing the use of the ja, jb, ia, ib looping order). It is interesting that the optimal singlethread correct execution takes exactly twice the time of the fastest incorect singlethread execution (which, in turn,equals the fastest execution in OMP with 2 threads). This gives the ideea that the loop gets autoparallelized, and probably a reduction clause isnt used by the compiler. This is strange, however, since i turned off the autoparallelization option of the compiler.

Running in two steps at a time worsens computing time in bothparallel and single-thread execution(3.0s in siglethread) (maybe i wrote something wrong, althoughi think it's relatedto missprediction by the CPU). K is a scale factor.

The issue, however, is not makingthis particular loop run faster (although its an interesting exercise). I'm concerned since i have lots of loops in my code that don't normallyneed parallelization (since they are only called once in a while), and now i find myself forced to check the correct execution of each loop to make sure i dont get wrong results.

jimdempseyatthecove · ‎01-11-2007

I would suggest configuring the code where it looks correct but produces incorrect results. Then compile with optimizations off and on. Also experiment with disabling SSE instructions.Assuming a temporal dependency is not at issue...if you can identify a failure mode between options then this would indicate a compiler bug. A simple test app could be created and submitted to the Premeir site.

The double speed can be due to vectorization (use of SSE to compute 2 REAL(8) or 4 REAL(4) operations in one instruction). By turning on/off the SSE instructions you can affect the vectorization code.

Jim