openmp parallelization problem

ffgarcia · ‎04-25-2010

Dear users,

I recently got introduced to openMP and I would like to parallelize a serial f77 subroutine using openMP. All the examples that I've seen online focus on a single DO loop and that did not help much. I have multiple DO loops and several temporary variables in my code and I was wondering if experienced users can help me figure out what is wrong with my openmp implementation and help me define the commands (private, shared, reduction etc.) correctly. Please note that the variables FK, POS and the variables in the common block are used by other subroutines.

SUBROUTINE FFULL(FCOMP, POS)

C ACCEPTS POS AS INPUT AND RETURNS FCOMP

IMPLICIT NONE

INTEGER MAXK, NMAX, NAT

INTEGER TOTK

INTEGER KMAX, KX, KY, KZ, I, KSQMAX, KSQ, J

PARAMETER ( MAXK = 50000, NMAX = 1000 )

DOUBLE PRECISION KVEC(MAXK), POS(3,NMAX),Z(NMAX)

DOUBLE PRECISION KAPPA, FCOMP(3, NMAX), INV_VOL

DOUBLE PRECISION PI, TWOPI, S1, S2, L

DOUBLE PRECISION RKX, RKY, RKZ, DP, ZI, ZJ

DOUBLE PRECISION FX, FY, FZ, RX, RY, RZ

DOUBLE PRECISION SUM1(MAXK), SUM2(MAXK)

CHARACTER*2 SYMB(NMAX)

COMMON/KPARAM/KAPPA,KMAX,KSQMAX

COMMON/BLOCK1/L, Z, NAT, SYMB

COMMON/BLOCK2/KVEC, SUM1, SUM2

C *******************************************************************

C DEFINE SOME VARIABLES THAT WOULD BE NEEDED LATER

PI=2.D0*DASIN(1.D0)

TWOPI = 2.D0*PI

INV_VOL = 1.D0/(L**3)

C INITIALIZE FCOMP TO ZERO

DO I = 1, NAT

DO J = 1, 3

FCOMP(J, I) = 0.D0

END DO

C BEGIN OPENMP HERE

c$omp parallel

c$omp& shared(FK,POS)

C$omp& private(I,RX,RY,RZ,ZI,KX,KY,KZ,RKX,RKY,RKZ,DP,S1,S2)

c$omp& reduction(+:TOTK,FX,FY,FZ)

C$omp do

DO I =1, NAT

FX = 0.D0

FY = 0.D0

FZ = 0.D0

RX = POS(1,I)

RY = POS(2,I)

RZ = POS(3,I)

ZI = Z(I)

TOTK = 0

DO 24 KX = -KMAX , KMAX

RKX = TWOPI*DFLOAT(KX)/L

DO 23 KY = -KMAX, KMAX

RKY = TWOPI*DFLOAT(KY)/L

DO 22 KZ = -KMAX,KMAX

RKZ = TWOPI*DFLOAT(KZ)/L

KSQ = KX*KX + KY*KY + KZ*KZ

IF((KSQ.GE.KSQMAX).OR.(KSQ.EQ.0))GOTO 22

TOTK = TOTK + 1

DP=RX*RKX+RY*RKY+RZ*RKZ

S1 = SUM1(TOTK)*DSIN(DP)

S2 = SUM2(TOTK)*DCOS(DP)

FX = FX + KVEC(TOTK)*RKX*(S1 - S2)

FY = FY + KVEC(TOTK)*RKY*(S1 - S2)

FZ = FZ + KVEC(TOTK)*RKZ*(S1 - S2)
22 CONTINUE

23CONTINUE

24CONTINUE

FCOMP(1,I) = FX*2.D0*INV_VOL*ZI

FCOMP(2,I) = FY*2.D0*INV_VOL*ZI

FCOMP(3,I) = FZ*2.D0*INV_VOL*ZI

END DO

C$omp end do

C$omp end parallel

RETURN

TimP · ‎04-25-2010

If you would simply use
C$omp parallel do
then you wouldn't have to worry about which clauses belong to "parallel" and which to "do." I would have thought compiler diagnostics might give a clue, if the example were compilable.
Assuming you are looking for efficiency, a little work to optimize your inner loop should make a large difference.

Martyn_C_Intel · ‎04-26-2010

At a quick glance, you need to make KSQ private also.
Because you are threading the outermost loop, I see no need to put TOTK,FX,FY,FZ in aREDUCTION clause,the PRIVATE clausewould be sufficient. REDUCTION is only needed if the reduction is in the outermost loop, such that different parts of the reduction are carried out by different threads, which would need to be synchronized.

A minor point: In some situations, it can be advantageous to initializate arraysin an OpenMP loop that is similar to the loop in which they will be used. But in your code as written, FCOMP does not need to be initialized at all.

Intel sells a tool, thread checker, that can help to find race conditions in complex codes, such as may arise if you forget to declare a variable as private. But that's not reallyneeded for a simple loop without function calls such as this one.

jimdempseyatthecove · ‎04-27-2010

How large is NAT?
How many cores are available?
How large is KMAX (MAXK is a parameter, KMAX is in COMMON)

If NAT is relatively small and KMAX relatively large, you mightconsider moving the paralleliztion tothe DO 24 loop

ffgarcia · ‎04-27-2010

NAT = 3000
Number of cores = 4
KMAX = 10
MAXK eventually becomes 9980

jimdempseyatthecove · ‎04-28-2010

Ask the compiler to generate a vectorization report.
Examine the report to if the expressions are suitably vectorized. Perhapse they are not.

Suggestion 1:

To improve (the probability of) vectorization try changing your ?X, ?Y, ?Z variables into arrays

DOUBLE PRECISION FXYZ(3), RXYZ(3)

And the appropriate changes to thesource. e.g.

FXYZ= FXYZ *(KVEC(TOTK) * (SUM1(TOTK)*DSIN(DP) - SUM2(TOTK)*DCOS(DP)))

Suggestion 2:

Calculate the product terms ofDP=RX*RKX+RY*RKY+RZ*RKZ in the loop where the term varies. Then do the summation in the inner most loop.

Suggestion 3:

Add constant TWOPIOL = TWOPI / L

When using SSE you will not carry temporary calculations to 80-bits. So precalculate,

Suggestion 4:

Replace the DFLOAT(KZ) in the inner loop with:

FKZ = DFLOAT(-KMAX-1)
DO 22 KZ = -KMAX,KMAX
FKZ = FKZ + 1.0D
RKZ = FKZ * TWOPIOL ! was TWOPI*DFLOAT(KZ)/L

Jim Dempsey

TimP · ‎04-28-2010

Vectorization may depend on changing the limits on KZ so as to eliminate the IF() conditional inside the loop.