I am currently testing OpenMP in a big loop in my FORTRAN code. The code is part of a simulation module which is called from a VB.NET user interface; this interface also does the timing measurements. So I start a simulation, and at the end the software shows me how long it took (I only write this to show that for timing measurements I don't use wtime or cpu_time).
Now when I repeatedly start a simulation with my parallelized loop, I always get different simulation times, reaching, in one example, from 1min30sec to almost 3min! The results are always correct.
I tried different schedules for the loop (static, guided, dynamic), I tried to calculate the chunks that are assigned to each thread manually (do i=1,N -> do i=i_start,i_end), I tried to change the number of threads taking part in the calculation of the loop - with no change of the situation. When I remove the OpenMP directives from the code this does not occur, so they must be the reason for this behavior.
My machine is a quadcore Intel Xeon CPU X3470 @2.93GHz with Win7 installed. I tried to run the program with both enabled and disabled multithreading (in the bios), however, this also didn't change anything.
Do you have any ideas what could go wrong? Thanks in advance for your thoughts.
Are you running on a dedicated machine, or is it a multi-user environment? If the latter, you can try setting some OpenMP environment variables that effect threading scheduling:
Neither of the above are the default values, and these settings will bias the OMP runtime to run your code as quickly as possible (at the expense of other running processes).
Intel Developer Support
Like I said, my results are always fine, so I can't really believe that there's an error (race condition , ...) in my code. When I remove the OpenMP directives from the loop I want to parallelize, runtimes are constant again. But in my code there are other loops whcih are already parallelized with OpenMP, which tells me that my OpenMP environment generally works.
A rather innocuous call is the random number generator. And there are others.
Should you have a high frequency of these calls then depending on if there is an attempt at concurrent call, then one thread will block at the critical section (delaying execution). When the threads do not collide at the call, then each will pass through with little delay. Since you have an almost 2x difference, you should be able to locate the problem using a profiler. (or do a Break-All and continue until you narrow in on the blockage point.)
It's always that in about 2 of 3 runs, the parallelized loop is faster than the sequential one, but the third run is way slower than the both the other two runs and the sequential one. Can the reason for the fluctuation be that I have a couple of Atomic-statements in my loop, some of which are only executed if a certain if-condition is fulfilled? But if that was the reason, why are there still fluctuations when I set the OpenMP schedule to dynamic or guided?
Are you performing memory allocation in your parallelizedloop?
Memory allocation, either explicitly with ALLOCATE or implicitly with -heaparrays, requires callinga library function whichpasses through a critical section.
Can you show the !$OMP ATOMIC statement?
Please indicate if the operation is integer or floating point.
Floating point requires a small loop containing a tentative operation and a compare with conditional swap. Sometimes hight contention in !$OMP ATOMIC on floating piont will cause excessive retries by one or the other thread(s). Consider using a local variable (to thread) then perform a reduction just prior to exit of the parallel region.
If you run a profiler you shoiuld be able to see if the atomic is the root cause for the slow down.
[bash] !$OMP PARALLEL DO DEFAULT(SHARED) & !$OMP PRIVATE(n,k,nk,i,j,l,List,Vx,Vz,cS,AE1,RootCh,Ec1,Ec2,Ec3,FcE,GcE,VxE,VzE,SMuL1,SMuL2) & !$OMP PRIVATE(W1,W2,W3,Wx,Wz,S,i1,j1,AcE,j2,ic,iB,kk,iBound,i2) & !$OMP FIRSTPRIVATE(NumSEL) REDUCTION(-:Cum0,Cum1) REDUCTION(+:CumR) DO n=1, NumEl ! Loop on subelements DO k=1, Elements(n)%nCorners-2 nk = (k-1) * 3 ! i=Elements(n)%KX(1) j=Elements(n)%KX(k+1) l=Elements(n)%KX(k+2) List(1)=i List(2)=j List(3)=l ! IF(Level == NLevel) THEN Vx(1)=Nodes(i)%VxO Vx(2)=Nodes(j)%VxO Vx(3)=Nodes(l)%VxO Vz(1)=Nodes(i)%VzO Vz(2)=Nodes(j)%VzO Vz(3)=Nodes(l)%VzO ELSE Vx(1)=Nodes(i)%VxN Vx(2)=Nodes(j)%VxN Vx(3)=Nodes(l)%VxN Vz(1)=Nodes(i)%VzN Vz(2)=Nodes(j)%VzN Vz(3)=Nodes(l)%VzN END IF ! cS=cBound(sol,5) cS=(MIN(cS,Nodes(i)%Conc(sol))+MIN(cS,Nodes(j)%Conc(sol))+MIN(cS,Nodes(l)%Conc(sol)))/3.0D0 ! AE1=Elements(n)%xMul(k)*Elements(n)%Area(k)*dt*Eps RootCh=AE1*cS*(Nodes(i)%Sink+Nodes(j)%Sink+Nodes(l)%Sink)/3.0D0 Cum0=Cum0-AE1*(Nodes(i)%Gc1+Nodes(j)%Gc1+Nodes(l)%Gc1)/3.0D0 Cum1=Cum1-AE1*(Nodes(i)%Fc1+Nodes(j)%Fc1+Nodes(l)%Fc1)/3.0D0 CumR=CumR+RootCh Ec1=(Nodes(i)%Dispxx+Nodes(j)%Dispxx+Nodes(l)%Dispxx)/3.0D0 Ec2=(Nodes(i)%Dispxz+Nodes(j)%Dispxz+Nodes(l)%Dispxz)/3.0D0 Ec3=(Nodes(i)%Dispzz+Nodes(j)%Dispzz+Nodes(l)%Dispzz)/3.0D0 ! IF (Level == NLevel) AcE=(Nodes(i)%Ac+Nodes(j)%Ac+Nodes(l)%Ac)/3.0D0 ! FcE=(Nodes(i)%Fc+Nodes(j)%Fc+Nodes(l)%Fc)/3.0D0 GcE=(Nodes(i)%Gc+Nodes(j)%Gc+Nodes(l)%Gc)/3.0D0 VxE=(Vx(1)+Vx(2)+Vx(3))/3.0D0 VzE=(Vz(1)+Vz(2)+Vz(3))/3.0D0 SMul1=-Elements(n)%AMul(k) SMul2=Elements(n)%Area(k)/20.0D0*Elements(n)%XMul(k) ! If (lUpw) THEN W1=WeTab(1,(n-1)*(Elements(n)%nCorners-2)+k) W2=WeTab(2,(n-1)*(Elements(n)%nCorners-2)+k) W3=WeTab(3,(n-1)*(Elements(n)%nCorners-2)+k) Wx(1)=2.0D0*Vx(1)*(W2-W3)+Vx(2)*(W2-2.0D0*W3)+Vx(3)*(2.0D0*W2-W3) Wx(2)=Vx(1)*(2.0D0*W3-W1)+2.0D0*Vx(2)*(W3-W1)+Vx(3)*(W3-2.0D0*W1) Wx(3)=Vx(1)*(W1-2.0D0*W2)+Vx(2)*(2.0D0*W1-W2)+2.0D0*Vx(3)*(W1-W2) Wz(1)=2.0D0*Vz(1)*(W2-W3)+Vz(2)*(W2-2.0D0*W3)+Vz(3)*(2.0D0*W2-W3) Wz(2)=Vz(1)*(2.0D0*W3-W1)+2.0D0*Vz(2)*(W3-W1)+Vz(3)*(W3-2.0D0*W1) Wz(3)=Vz(1)*(W1-2.0D0*W2)+Vz(2)*(2.0D0*W1-W2)+2.0D0*Vz(3)*(W1-W2) END IF ! DO j1=1, 3 i1=List(j1) !$OMP ATOMIC Nodes(i1)%F=Nodes(i1)%F+Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0) IF (Level==NLevel) then !$OMP ATOMIC Nodes(i1)%DS=Nodes(i1)%DS+Elements(n)%GMul(k)*(Ace+Nodes(i1)%Ac/3.0D0) end if iBound=0 IF (Nodes(i)%Kode/=0) THEN BP_Loop : DO id=1, NumBP IF((KXB(id)==i1) .AND. (KodCB(id) > 0)) THEN iBound=1 EXIT BP_Loop END IF END DO BP_Loop END IF ! DO j2=1, 3 i2=List(j2) S(j1,j2)=SMul1*(Ec1*Elements(n)%dz(nk+j1)*Elements(n)%dz(nk+j2)+ & Ec3*Elements(n)%dx(nk+j1)*Elements(n)%dx(nk+j2)+ & Ec2*(Elements(n)%dz(nk+j1)*Elements(n)%dx(nk+j2)+ & Elements(n)%dx(nk+j1)*Elements(n)%dz(nk+j2))) S(j1,j2)=S(j1,j2)-(Elements(n)%dz(nk+j2)/8.0D0*(VxE+Vx(j1)/3.0D0)+ & Elements(n)%dx(nk+j2)/8.0D0*(VzE+Vz(j1)/3.0D0))*Elements(n)%xMul(k) IF(lUpw) S(j1,j2)=S(j1,j2)-Elements(n)%xMul(k)* & (Elements(n)%dz(nk+j2)/40.0D0*Wx(j1)+ & Elements(n)%dx(nk+j2)/40.0D0*Wz(j1)) ic=1 IF (i1==i2) ic=2 S(j1,j2)=S(j1,j2)+SMul2*ic*(FcE+(Nodes(i1)%Fc+Nodes(i2)%Fc)/3.0D0) IF (iBound==1) then if(j2.eq.1) then !$OMP ATOMIC Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)-Eps*Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0) else !$OMP ATOMIC Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol) end if end if IF (Level/=NLevel) THEN !$OMP ATOMIC B(i1)=B(i1)-alf*S(j1,j2)*Nodes(i2)%Conc(sol) ELSE IF (lOrt) THEN CALL rFIND(i1,i2,kk,NumNP,MBandD,IAD,IADN) iB=kk ELSE iB=MBand+i2-i1 END IF !$OMP ATOMIC A(iB,i1)=A(iB,i1)+epsi*S(j1,j2) END IF END DO END DO END DO END DO !$OMP END PARALLEL DO[/bash]
What I suggest you do is to determine if only a subset of each thread's
and number of threads = 2
then (possibly) thread 0 iterates n through 1:50 and thread 1 iterates n through51:100
... then is the possibility of collision only at n=50 for thread 0 while n=51 for thread 1?
I understand sometimes this question is not easily answered but you could add code to make this determination.But often you will find that an algorithm will have data usagethat will not collide amongst different threads (e.g. iteration space divided by volume with possibility of interaction at walls of volume but not in interior of volume). The potential collision points may vary depending on the number of threadsand/or loop schedule method as well as how you divide up the iteration space.Should a good portion of a threads iteration space operate without possibility for interaction, then I suggest you add a flag to indicate if atomic should be used or not
Although it may seem redundant to code this way, atomic operations are quite expensive in terms of execution time.
I guess in that case it would be easier to create a kind of "lock"-array that has a length equal to the number of nodes in the mesh; then when one thread tries to update a nodal value, it changes the "lock"-array's value from 0 to 1, and afterwards back to zero. Doing it that way it should be possible to replace each Atomic-statement by a test if the "lock"-array's value for a specific value is 0 or 1. However, it would of course require additional memory.
I will try to implement this, although my initial question wouldn't be answered doing it that way of course.
Self-made locks like
1 do while(lock(iG))
3 end do
5 B(iG) = B(iG) + ...
did not really work, I guess because threads can still get to line 4 simultaneously and then execute line 5 together as well.
So it seems that the fluctuations are indeed due to all the Atomic statements. Do you see any other way of parallelizing the loop without rewriting it completely (and without using local variables for each thread; since that would be way too much overhead if big matrices A are to be constructed)?
For a given number of threads and specific data set you may have only one set of possibilities for collision. And this subset is the same for the 1,000's or 1,000,000's of iterations that follow. If so, then you could have some once-only code that initializes a bit mask to(as wide as the number of OpenMP threads), runs the same configured parallel loop with theatomics on all elements plus an atomic on a bit OR into the mask of the OpenMP thread bit position. When this once only run (with all atomics and bit OR into mask) completes then run through the bit mask array building a LOGICAL array with same number ofelements where the array contains a .true. if more than one bit is set in the bitmask array. Then on subsequent iterations use the logical array as you indicator as to use ATOMIC or not.
Now then, it may end up that all the elements always have potential for collision. Under situations where this occurs, then consider segmenting the iteration space up into multiple domains and wheresubset collections of thesedomains can be operated on by seperate threads without possibility for collision (or greatly reduced collision). An example of this might be that of a chess board and you loop through all the black/dark squares first, then all the white/light squares second and where on each pass the collision points can only occur at the corner points (then consider other patterns where you have no possibility for collision).
What you are trading off here is the increase in overhead of running anumber of sub-sets of the data against the reduction in, or elimination of, the atomics.
My loop goes through
elem. node1 node2 node3
1 1 3 4
2 3 5 6
3 5 7 8
4 7 9 10
5 9 11 12
6 11 13 14
7 13 15 16
So you see that concurrency can only occur if two threads work with elements whose element numbers are "close" (although the actual pattern could vary from mesh to mesh, depending on the mesh geometry; in the above case it's a very simple mesh). My first idea was to divide the element numbers into 4 (number of threads) static blocks, and apply an Atomic directive for the first and last 20 (an empiric estimate, taking into account some characteristics from my mesh generator) elements of each block. However, fluctuations still occured.
The second thing I did was to divide my big loop into 20 smaller, parallelized ones, in the following way:
!$OMP PARALLEL DO ...
!$OMP END PARALLEL DO
In this case fluctuations did not occur, which is expected since I removed all the Atomics from the parallelized inner loop. However, for large meshes this version showed very poor performance, probably because of the overhead of repeatedly creating parallel regions.
EDIT: Changing the value of 20 does improve the performance a lot, there seems to be a value where maximum performance can be achieved - small enough so that the number of created parallel regions is not too big, and big enough to not get into trouble due to concurrent writing operations of the threads. I guess this optimum value depends on the structure of the generated meshes, I'll take a look into that.
!$OMP DO ...
!$OMP END DO
! note, there is an implicit BARRIER at above statement
!$OMP END PARALLEL
The above will form one team once then wait at barriers (as opposed to team build-up/tear-down).
There are potentially other non-barrier progress monitoring techniques you can use that will permit a skew in threads progress (i.e. reduce some wait time at barriers).
What portion of your application total run time is occupied within this loop?
If a small percentage then it might be best to look elsewhere.
You also might look at the overall requirements of the loop to see if a different organization might be more efficient in terms of multi-threading. It is not unusual for the best core techniquefor the serial method to not be the best for multi-threaded method.