OpenMP runtime fluctuation

mamey4 · ‎06-29-2011

Hi,

I am currently testing OpenMP in a big loop in my FORTRAN code. The code is part of a simulation module which is called from a VB.NET user interface; this interface also does the timing measurements. So I start a simulation, and at the end the software shows me how long it took (I only write this to show that for timing measurements I don't use wtime or cpu_time).
Now when I repeatedly start a simulation with my parallelized loop, I always get different simulation times, reaching, in one example, from 1min30sec to almost 3min! The results are always correct.
I tried different schedules for the loop (static, guided, dynamic), I tried to calculate the chunks that are assigned to each thread manually (do i=1,N -> do i=i_start,i_end), I tried to change the number of threads taking part in the calculation of the loop - with no change of the situation. When I remove the OpenMP directives from the code this does not occur, so they must be the reason for this behavior.
My machine is a quadcore Intel Xeon CPU X3470 @2.93GHz with Win7 installed. I tried to run the program with both enabled and disabled multithreading (in the bios), however, this also didn't change anything.

Do you have any ideas what could go wrong? Thanks in advance for your thoughts.

Martin

pbkenned1 · ‎06-29-2011

Are you running on a dedicated machine, or is it a multi-user environment? If the latter, you can try setting some OpenMP environment variables that effect threading scheduling:

set KMP_LIBRARY=turnaround
set KMP_BLOCKTIME=0

Neither of the above are the default values, and these settings will bias the OMP runtime to run your code as quickly as possible (at the expense of other running processes).

Patrick Kennedy
Intel Developer Support

mamey4 · ‎06-30-2011

I tried different values for all the variables listed on http://www.slac.stanford.edu/comp/unix/package/intel_tools/ifort/mergedProjects/optaps_for/common/optaps_par_var.htm (I hope that's all of them). My runtimes differed slightly from the ones I had before, but big differences were still there (1min42sec vs. 2min43sec).
Like I said, my results are always fine, so I can't really believe that there's an error (race condition , ...) in my code. When I remove the OpenMP directives from the loop I want to parallelize, runtimes are constant again. But in my code there are other loops whcih are already parallelized with OpenMP, which tells me that my OpenMP environment generally works.

jimdempseyatthecove · ‎06-30-2011

Are you issuing library function calls that may have critical sections in them?

A rather innocuous call is the random number generator. And there are others.
Should you have a high frequency of these calls then depending on if there is an attempt at concurrent call, then one thread will block at the critical section (delaying execution). When the threads do not collide at the call, then each will pass through with little delay. Since you have an almost 2x difference, you should be able to locate the problem using a profiler. (or do a Break-All and continue until you narrow in on the blockage point.)

Jim Dempsey

mamey4 · ‎07-01-2011

I removed all external function calls from my parallelized do loop (by manually inserting their code into the loop; there were only two small functions and I think it doesn't mae a difference), however it did not change anything.
It's always that in about 2 of 3 runs, the parallelized loop is faster than the sequential one, but the third run is way slower than the both the other two runs and the sequential one. Can the reason for the fluctuation be that I have a couple of Atomic-statements in my loop, some of which are only executed if a certain if-condition is fulfilled? But if that was the reason, why are there still fluctuations when I set the OpenMP schedule to dynamic or guided?

jimdempseyatthecove · ‎07-01-2011

What are the two small functions? (yours or some library - if library name please)
Are you performing memory allocation in your parallelizedloop?
Memory allocation, either explicitly with ALLOCATE or implicitly with -heaparrays, requires callinga library function whichpasses through a critical section.

Jim

mamey4 · ‎07-01-2011

There were no allocate statements in the small functions; these were custom functions, not from any public libraries. In the whole do-loop there are no allocate statements, it basically consists of a number of variable assignments, some of which are equipped with an OpenMP Atomic directive.

jimdempseyatthecove · ‎07-01-2011

Can you show the !$OMP ATOMIC statement?
Please indicate if the operation is integer or floating point.

Floating point requires a small loop containing a tentative operation and a compare with conditional swap. Sometimes hight contention in !$OMP ATOMIC on floating piont will cause excessive retries by one or the other thread(s). Consider using a local variable (to thread) then perform a reduction just prior to exit of the parallel region.

If you run a profiler you shoiuld be able to see if the atomic is the root cause for the slow down.

Jim Dempsey

mamey4 · ‎07-03-2011

Here's the whole loop. The Atomic-statements are always protecting the operations on the global Nodes-array, which is accessible for all threads. They do involve floating point operations, however, I could of course replace all of the Atomic-statements by using local variables, but these would have to be arrays of size size_of_Nodes, which can be a very large number.

[bash]    !$OMP PARALLEL DO DEFAULT(SHARED) &
    !$OMP PRIVATE(n,k,nk,i,j,l,List,Vx,Vz,cS,AE1,RootCh,Ec1,Ec2,Ec3,FcE,GcE,VxE,VzE,SMuL1,SMuL2) &
    !$OMP PRIVATE(W1,W2,W3,Wx,Wz,S,i1,j1,AcE,j2,ic,iB,kk,iBound,i2) &
    !$OMP FIRSTPRIVATE(NumSEL) REDUCTION(-:Cum0,Cum1) REDUCTION(+:CumR)
        DO n=1, NumEl
    !       Loop on subelements   
            DO k=1, Elements(n)%nCorners-2
                nk = (k-1) * 3
    !           
                i=Elements(n)%KX(1)
                j=Elements(n)%KX(k+1)
                l=Elements(n)%KX(k+2)
                List(1)=i
                List(2)=j
                List(3)=l
    !           
                IF(Level == NLevel) THEN
                    Vx(1)=Nodes(i)%VxO
                    Vx(2)=Nodes(j)%VxO
                    Vx(3)=Nodes(l)%VxO
                    Vz(1)=Nodes(i)%VzO
                    Vz(2)=Nodes(j)%VzO
                    Vz(3)=Nodes(l)%VzO
                ELSE
                    Vx(1)=Nodes(i)%VxN
                    Vx(2)=Nodes(j)%VxN
                    Vx(3)=Nodes(l)%VxN
                    Vz(1)=Nodes(i)%VzN
                    Vz(2)=Nodes(j)%VzN
                    Vz(3)=Nodes(l)%VzN
                END IF
    !           
                cS=cBound(sol,5)
                cS=(MIN(cS,Nodes(i)%Conc(sol))+MIN(cS,Nodes(j)%Conc(sol))+MIN(cS,Nodes(l)%Conc(sol)))/3.0D0
    !           
                AE1=Elements(n)%xMul(k)*Elements(n)%Area(k)*dt*Eps
                RootCh=AE1*cS*(Nodes(i)%Sink+Nodes(j)%Sink+Nodes(l)%Sink)/3.0D0
                Cum0=Cum0-AE1*(Nodes(i)%Gc1+Nodes(j)%Gc1+Nodes(l)%Gc1)/3.0D0
                Cum1=Cum1-AE1*(Nodes(i)%Fc1+Nodes(j)%Fc1+Nodes(l)%Fc1)/3.0D0
                CumR=CumR+RootCh
                Ec1=(Nodes(i)%Dispxx+Nodes(j)%Dispxx+Nodes(l)%Dispxx)/3.0D0
                Ec2=(Nodes(i)%Dispxz+Nodes(j)%Dispxz+Nodes(l)%Dispxz)/3.0D0
                Ec3=(Nodes(i)%Dispzz+Nodes(j)%Dispzz+Nodes(l)%Dispzz)/3.0D0
    !
                IF (Level == NLevel) AcE=(Nodes(i)%Ac+Nodes(j)%Ac+Nodes(l)%Ac)/3.0D0
    !
                FcE=(Nodes(i)%Fc+Nodes(j)%Fc+Nodes(l)%Fc)/3.0D0
                GcE=(Nodes(i)%Gc+Nodes(j)%Gc+Nodes(l)%Gc)/3.0D0
                VxE=(Vx(1)+Vx(2)+Vx(3))/3.0D0
                VzE=(Vz(1)+Vz(2)+Vz(3))/3.0D0
                SMul1=-Elements(n)%AMul(k)
                SMul2=Elements(n)%Area(k)/20.0D0*Elements(n)%XMul(k)
    !
                If (lUpw) THEN
                    W1=WeTab(1,(n-1)*(Elements(n)%nCorners-2)+k)
                    W2=WeTab(2,(n-1)*(Elements(n)%nCorners-2)+k)
                    W3=WeTab(3,(n-1)*(Elements(n)%nCorners-2)+k)
                    Wx(1)=2.0D0*Vx(1)*(W2-W3)+Vx(2)*(W2-2.0D0*W3)+Vx(3)*(2.0D0*W2-W3)
                    Wx(2)=Vx(1)*(2.0D0*W3-W1)+2.0D0*Vx(2)*(W3-W1)+Vx(3)*(W3-2.0D0*W1)
                    Wx(3)=Vx(1)*(W1-2.0D0*W2)+Vx(2)*(2.0D0*W1-W2)+2.0D0*Vx(3)*(W1-W2)
                    Wz(1)=2.0D0*Vz(1)*(W2-W3)+Vz(2)*(W2-2.0D0*W3)+Vz(3)*(2.0D0*W2-W3)
                    Wz(2)=Vz(1)*(2.0D0*W3-W1)+2.0D0*Vz(2)*(W3-W1)+Vz(3)*(W3-2.0D0*W1)
                    Wz(3)=Vz(1)*(W1-2.0D0*W2)+Vz(2)*(2.0D0*W1-W2)+2.0D0*Vz(3)*(W1-W2)
                END IF
    !
                DO j1=1, 3
                    i1=List(j1)
    !$OMP           ATOMIC
                    Nodes(i1)%F=Nodes(i1)%F+Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
                    IF (Level==NLevel) then
    !$OMP               ATOMIC
                        Nodes(i1)%DS=Nodes(i1)%DS+Elements(n)%GMul(k)*(Ace+Nodes(i1)%Ac/3.0D0)
                    end if
                    iBound=0
                    IF (Nodes(i)%Kode/=0) THEN
                        BP_Loop : DO id=1, NumBP
                            IF((KXB(id)==i1) .AND. (KodCB(id) > 0)) THEN
                                iBound=1
                                EXIT BP_Loop
                            END IF
                        END DO BP_Loop
                    END IF
    !
                    DO j2=1, 3
                        i2=List(j2)
                        S(j1,j2)=SMul1*(Ec1*Elements(n)%dz(nk+j1)*Elements(n)%dz(nk+j2)+ &
                                        Ec3*Elements(n)%dx(nk+j1)*Elements(n)%dx(nk+j2)+ &
                                       Ec2*(Elements(n)%dz(nk+j1)*Elements(n)%dx(nk+j2)+ &
                                            Elements(n)%dx(nk+j1)*Elements(n)%dz(nk+j2)))
                                           
                        S(j1,j2)=S(j1,j2)-(Elements(n)%dz(nk+j2)/8.0D0*(VxE+Vx(j1)/3.0D0)+ &
                                           Elements(n)%dx(nk+j2)/8.0D0*(VzE+Vz(j1)/3.0D0))*Elements(n)%xMul(k)
                                           
                        IF(lUpw) S(j1,j2)=S(j1,j2)-Elements(n)%xMul(k)* &
                                                  (Elements(n)%dz(nk+j2)/40.0D0*Wx(j1)+ &
                                                   Elements(n)%dx(nk+j2)/40.0D0*Wz(j1))
                                                   
                        ic=1
                        IF (i1==i2) ic=2
                        S(j1,j2)=S(j1,j2)+SMul2*ic*(FcE+(Nodes(i1)%Fc+Nodes(i2)%Fc)/3.0D0)
                        IF (iBound==1) then
                            if(j2.eq.1) then
    !$OMP                       ATOMIC
                                Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)-Eps*Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
                            else
    !$OMP                       ATOMIC
                                Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)
                            end if
                        end if
                        IF (Level/=NLevel) THEN
    !$OMP                   ATOMIC
                            B(i1)=B(i1)-alf*S(j1,j2)*Nodes(i2)%Conc(sol)
                        ELSE
                            IF (lOrt) THEN
                                CALL rFIND(i1,i2,kk,NumNP,MBandD,IAD,IADN)
                                iB=kk
                            ELSE
                                iB=MBand+i2-i1
                            END IF
    !$OMP                   ATOMIC
                            A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
                        END IF
                    END DO
                END DO
            END DO
        END DO
    !$OMP END PARALLEL DO[/bash]

jimdempseyatthecove · ‎07-04-2011

The number of atomics in your loop is neither small nor infrequently used.
What I suggest you do is to determine if only a subset of each thread's

i=Elements(n)%KX(1)
j=Elements(n)%KX(k+1)
l=Elements(n)%KX(k+2)
List(1)=i
List(2)=j
List(3)=l

Elements collide

For example:

should n=1:100
and number of threads = 2
then (possibly) thread 0 iterates n through 1:50 and thread 1 iterates n through51:100
... then is the possibility of collision only at n=50 for thread 0 while n=51 for thread 1?

I understand sometimes this question is not easily answered but you could add code to make this determination.But often you will find that an algorithm will have data usagethat will not collide amongst different threads (e.g. iteration space divided by volume with possibility of interaction at walls of volume but not in interior of volume). The potential collision points may vary depending on the number of threadsand/or loop schedule method as well as how you divide up the iteration space.Should a good portion of a threads iteration space operate without possibility for interaction, then I suggest you add a flag to indicate if atomic should be used or not

if(CollisionPossible(n)) then

!$OMPATOMIC
A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
else
A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
endif

Although it may seem redundant to code this way, atomic operations are quite expensive in terms of execution time.

Jim Dempsey

mamey4 · ‎07-04-2011

That is a good idea, and I have already given this some thought since, as you said, the number of possible collision points is indeed quite small. This number depends on how the finite element mesh on which my simulation is based is designed - this means that if one thread is including the information from element1 into the global matrix, then all the other threads must not do the same for elements that border to element1. In order to check if such a situation occurs I think I would have to run through the whole adjacency graph, for each element iteration.
I guess in that case it would be easier to create a kind of "lock"-array that has a length equal to the number of nodes in the mesh; then when one thread tries to update a nodal value, it changes the "lock"-array's value from 0 to 1, and afterwards back to zero. Doing it that way it should be possible to replace each Atomic-statement by a test if the "lock"-array's value for a specific value is 0 or 1. However, it would of course require additional memory.
I will try to implement this, although my initial question wouldn't be answered doing it that way of course.

mamey4 · ‎07-05-2011

I implemented the method described in my previous post by surrounding each Atomic-protected statement with a CALL OMP_SET_LOCK(lock(i1)) - CALL OMP_UNSET_LOCK(lock(i1)) pair. The variable i1 corresponds to the number of the node whose arrays are currently modified (so there are as many locks as nodes). Doing it that way the fluctuations seemed to vanish, however, the simulation runtime was very long, even longer then the longest time when runtimes still oscillated.
Self-made locks like

1 do while(lock(iG))
2 iG=iG
3 end do
4 lock(iG)=.true.
5 B(iG) = B(iG) + ...
6 block(iG)=.false.

did not really work, I guess because threads can still get to line 4 simultaneously and then execute line 5 together as well.
So it seems that the fluctuations are indeed due to all the Atomic statements. Do you see any other way of parallelizing the loop without rewriting it completely (and without using local variables for each thread; since that would be way too much overhead if big matrices A are to be constructed)?

jimdempseyatthecove · ‎07-05-2011

The element by element lock will take as long as, if not longer then, atomic. What you want to do is to reduce the number of locks (or atomics) to those elements that may have a collision.

For a given number of threads and specific data set you may have only one set of possibilities for collision. And this subset is the same for the 1,000's or 1,000,000's of iterations that follow. If so, then you could have some once-only code that initializes a bit mask to(as wide as the number of OpenMP threads), runs the same configured parallel loop with theatomics on all elements plus an atomic on a bit OR into the mask of the OpenMP thread bit position. When this once only run (with all atomics and bit OR into mask) completes then run through the bit mask array building a LOGICAL array with same number ofelements where the array contains a .true. if more than one bit is set in the bitmask array. Then on subsequent iterations use the logical array as you indicator as to use ATOMIC or not.

Now then, it may end up that all the elements always have potential for collision. Under situations where this occurs, then consider segmenting the iteration space up into multiple domains and wheresubset collections of thesedomains can be operated on by seperate threads without possibility for collision (or greatly reduced collision). An example of this might be that of a chess board and you loop through all the black/dark squares first, then all the white/light squares second and where on each pass the collision points can only occur at the corner points (then consider other patterns where you have no possibility for collision).

What you are trading off here is the increase in overhead of running anumber of sub-sets of the data against the reduction in, or elimination of, the atomics.

Jim Dempsey

mamey4 · ‎07-07-2011

I believe your first solution would require a lot of extra memory, especially for Nodes-Arrays of 100.000+ entries, so I tried your second suggestion.
My loop goes through elements, each of which changes the values of three nodes. The following shows this:

elem. node1 node2 node3
1 1 3 4
2 3 5 6
3 5 7 8
4 7 9 10
5 9 11 12
6 11 13 14
7 13 15 16
...

So you see that concurrency can only occur if two threads work with elements whose element numbers are "close" (although the actual pattern could vary from mesh to mesh, depending on the mesh geometry; in the above case it's a very simple mesh). My first idea was to divide the element numbers into 4 (number of threads) static blocks, and apply an Atomic directive for the first and last 20 (an empiric estimate, taking into account some characteristics from my mesh generator) elements of each block. However, fluctuations still occured.
The second thing I did was to divide my big loop into 20 smaller, parallelized ones, in the following way:

do ivar=1,20
!$OMP PARALLEL DO ...
do jvar=ivar,NumEL,20

end do
!$OMP END PARALLEL DO
end do

In this case fluctuations did not occur, which is expected since I removed all the Atomics from the parallelized inner loop. However, for large meshes this version showed very poor performance, probably because of the overhead of repeatedly creating parallel regions.

EDIT: Changing the value of 20 does improve the performance a lot, there seems to be a value where maximum performance can be achieved - small enough so that the number of created parallel regions is not too big, and big enough to not get into trouble due to concurrent writing operations of the threads. I guess this optimum value depends on the structure of the generated meshes, I'll take a look into that.

jimdempseyatthecove · ‎07-07-2011

Take timings of your current scheme then make a minor change to your !$OMP statements, and make timing runs again:

!$OMP PARALLEL...
do ivar=1,20
!$OMP DO ...
do jvar=ivar,NumEL,20

end do
!$OMP END DO
! note, there is an implicit BARRIER at above statement
end do
!$OMP END PARALLEL

The above will form one team once then wait at barriers (as opposed to team build-up/tear-down).

There are potentially other non-barrier progress monitoring techniques you can use that will permit a skew in threads progress (i.e. reduce some wait time at barriers).

Jim Dempsey

mamey4 · ‎07-07-2011

The two timings are quite similar, I guess this means that creating and destroying parallel regions is not an important time factor in the loop. I am confused however why I don't get much longer runtimes with your suggested code, since I would expect every thread to create a "do ivar=1,20" loop for its own in the upper situation.

jimdempseyatthecove · ‎07-08-2011

The time to create and destroy (suspend) thread teams at !$OMP PARALLEL... !$OMP END PARALLEL seems to vary depending on the O/S and number of processors (sockets). Once inside the parallel region the barriers (implicit at !$OMP END DO) tend to be less of an issue.

What portion of your application total run time is occupied within this loop?
If a small percentage then it might be best to look elsewhere.

You also might look at the overall requirements of the loop to see if a different organization might be more efficient in terms of multi-threading. It is not unusual for the best core techniquefor the serial method to not be the best for multi-threaded method.

Jim Dempsey