- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Dear users,
I recently got introduced to openMP and I would like to parallelize a serial f77 subroutine using openMP. All the examples that I've seen online focus on a single DO loop and that did not help much. I have multiple DO loops and several temporary variables in my code and I was wondering if experienced users can help me figure out what is wrong with my openmp implementation and help me define the commands (private, shared, reduction etc.) correctly. Please note that the variables FK, POS and the variables in the common block are used by other subroutines.
SUBROUTINE FFULL(FCOMP, POS)
C ACCEPTS POS AS INPUT AND RETURNS FCOMP
IMPLICIT NONE
INTEGER MAXK, NMAX, NAT
INTEGER TOTK
INTEGER KMAX, KX, KY, KZ, I, KSQMAX, KSQ, J
PARAMETER ( MAXK = 50000, NMAX = 1000 )
DOUBLE PRECISION KVEC(MAXK), POS(3,NMAX),Z(NMAX)
DOUBLE PRECISION KAPPA, FCOMP(3, NMAX), INV_VOL
DOUBLE PRECISION PI, TWOPI, S1, S2, L
DOUBLE PRECISION RKX, RKY, RKZ, DP, ZI, ZJ
DOUBLE PRECISION FX, FY, FZ, RX, RY, RZ
DOUBLE PRECISION SUM1(MAXK), SUM2(MAXK)
CHARACTER*2 SYMB(NMAX)
COMMON/KPARAM/KAPPA,KMAX,KSQMAX
COMMON/BLOCK1/L, Z, NAT, SYMB
COMMON/BLOCK2/KVEC, SUM1, SUM2
C *******************************************************************
C DEFINE SOME VARIABLES THAT WOULD BE NEEDED LATER
PI=2.D0*DASIN(1.D0)
TWOPI = 2.D0*PI
INV_VOL = 1.D0/(L**3)
C INITIALIZE FCOMP TO ZERO
DO I = 1, NAT
DO J = 1, 3
FCOMP(J, I) = 0.D0
END DO
END DO
C BEGIN OPENMP HERE
c$omp parallel
c$omp& shared(FK,POS)
C$omp& private(I,RX,RY,RZ,ZI,KX,KY,KZ,RKX,RKY,RKZ,DP,S1,S2)
c$omp& reduction(+:TOTK,FX,FY,FZ)
C$omp do
DO I =1, NAT
FX = 0.D0
FY = 0.D0
FZ = 0.D0
RX = POS(1,I)
RY = POS(2,I)
RZ = POS(3,I)
ZI = Z(I)
TOTK = 0
DO 24 KX = -KMAX , KMAX
RKX = TWOPI*DFLOAT(KX)/L
DO 23 KY = -KMAX, KMAX
RKY = TWOPI*DFLOAT(KY)/L
DO 22 KZ = -KMAX,KMAX
RKZ = TWOPI*DFLOAT(KZ)/L
KSQ = KX*KX + KY*KY + KZ*KZ
IF((KSQ.GE.KSQMAX).OR.(KSQ.EQ.0))GOTO 22
TOTK = TOTK + 1
DP=RX*RKX+RY*RKY+RZ*RKZ
S1 = SUM1(TOTK)*DSIN(DP)
S2 = SUM2(TOTK)*DCOS(DP)
FX = FX + KVEC(TOTK)*RKX*(S1 - S2)
FY = FY + KVEC(TOTK)*RKY*(S1 - S2)
FZ = FZ + KVEC(TOTK)*RKZ*(S1 - S2)
22 CONTINUE
23CONTINUE
24CONTINUE
FCOMP(1,I) = FX*2.D0*INV_VOL*ZI
FCOMP(2,I) = FY*2.D0*INV_VOL*ZI
FCOMP(3,I) = FZ*2.D0*INV_VOL*ZI
END DO
C$omp end do
C$omp end parallel
RETURN
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
C$omp parallel do
then you wouldn't have to worry about which clauses belong to "parallel" and which to "do." I would have thought compiler diagnostics might give a clue, if the example were compilable.
Assuming you are looking for efficiency, a little work to optimize your inner loop should make a large difference.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Because you are threading the outermost loop, I see no need to put TOTK,FX,FY,FZ in aREDUCTION clause,the PRIVATE clausewould be sufficient. REDUCTION is only needed if the reduction is in the outermost loop, such that different parts of the reduction are carried out by different threads, which would need to be synchronized.
A minor point: In some situations, it can be advantageous to initializate arraysin an OpenMP loop that is similar to the loop in which they will be used. But in your code as written, FCOMP does not need to be initialized at all.
Intel sells a tool, thread checker, that can help to find race conditions in complex codes, such as may arise if you forget to declare a variable as private. But that's not reallyneeded for a simple loop without function calls such as this one.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How many cores are available?
How large is KMAX (MAXK is a parameter, KMAX is in COMMON)
If NAT is relatively small and KMAX relatively large, you mightconsider moving the paralleliztion tothe DO 24 loop
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Number of cores = 4
KMAX = 10
MAXK eventually becomes 9980
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Examine the report to if the expressions are suitably vectorized. Perhapse they are not.
Suggestion 1:
To improve (the probability of) vectorization try changing your ?X, ?Y, ?Z variables into arrays
DOUBLE PRECISION FXYZ(3), RXYZ(3)
And the appropriate changes to thesource. e.g.
FXYZ= FXYZ *(KVEC(TOTK) * (SUM1(TOTK)*DSIN(DP) - SUM2(TOTK)*DCOS(DP)))
Suggestion 2:
Calculate the product terms ofDP=RX*RKX+RY*RKY+RZ*RKZ in the loop where the term varies. Then do the summation in the inner most loop.
Suggestion 3:
Add constant TWOPIOL = TWOPI / L
When using SSE you will not carry temporary calculations to 80-bits. So precalculate,
Suggestion 4:
Replace the DFLOAT(KZ) in the inner loop with:
FKZ = DFLOAT(-KMAX-1)
DO 22 KZ = -KMAX,KMAX
FKZ = FKZ + 1.0D
RKZ = FKZ * TWOPIOL ! was TWOPI*DFLOAT(KZ)/L
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page