- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have some number-crunching code which contains the following type of nested pair of do-loops:
DO I=1,NPHASE
CP=CPHASE(I)
SP=SPHASE(I)
SUMR=0.0D+00
SUMI=0.0D+00
DO J= I,NMAX, NPHASE
SUMR=SUMR+QUANTITY(J)*CP
SUMI=SUMI+QUANTITY(J)*SP
ENDDO
TOTAL(I,1)=SUMR
TOTAL(I,2)=SUMI
As you can see, the inner loop progresses through the array QUANTITY with stride length NPHASE
each time starting at an array location set by the index of the outer loop, So the inner loop accesses a completely
different set of array values per each time the outer loop index changes. NPHASE may be 24 or 48 for example and
NMAX will be several tens of thousands and not exactly divisible by NPHASE.
Will this code be parallelised if the appropriate additional parallel directives are added?
If so,please can you suggest what directives are required, as I have never written parallel code before.
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The code fragment has some inefficiencies that may or may not be overlooked by the optimizer. I suggest that you try
DO I=1,NPHASE SUMX=0.0D+00 DO J= I,NMAX, NPHASE SUMX=SUMX+QUANTITY(J) ENDDO TOTAL(I,1)=SUMX*CPHASE(I) TOTAL(I,2)=SUMX*SPHASE(I) ENDDO
before attempting to parallelize the code.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
!$omp parallel do priivate(cp,sp,sumr,sumi) schedule(auto)
DO I=1,NPHASE
CP=CPHASE(I)
SP=SPHASE(I)
SUMR=0.0D+00
SUMI=0.0D+00
DO J= I,NMAX, NPHASE
SUMR=SUMR+QUANTITY(J)*CP
SUMI=SUMI+QUANTITY(J)*SP
ENDDO
TOTAL(I,1)=SUMR
TOTAL(I,2)=SUMI
enddo
The scheduling complication is associated with the varying length of the inner loop. Omitting the schedule clause should no more than double the time taken and might even run faster if it avoids numa non-locality problems. There are more complicated ways to deal with the issue for a multiple CPU platform.
As mecej4 pointed out, writing it as a single sum reduction may well improve performance. The f90 sum intrinsic would be simpler looking with the same performance.
ifort (with /QxHost and default option ./fp:fast) probably performs a simd reduction already for the inner loop, and there's no point in tinkering with it by adding omp simd reduction directives.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The calculation can be reorganized as follows (please test code before using!) :
DO J=1,NMAX i=mod(j-1,nphase)+1 total(i,1) = total(i,1) + quantity(j)*cphase(i) total(i,2) = total(i,2) + quantity(j)*sphase(i) END DO
If you parallelize this version, note that the small array "total" may fit entirely inside the cache, but is subject to update by all the threads, where each thread works on a section of the large array "quantity".
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Many thanks for all your suggestions, which i will try out on Monday when I return to work.
I am somewhat mortified not to have spotted the simple re-arrangement which removes
thousands of multiplications that was suggested by Mecej4 in his first post!

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page