Need advice re: parallelisable code

Anthony_Richards · ‎06-20-2015

I have some number-crunching code which contains the following type of nested pair of do-loops:

DO I=1,NPHASE
CP=CPHASE(I)
SP=SPHASE(I)
SUMR=0.0D+00
SUMI=0.0D+00
DO J= I,NMAX, NPHASE
SUMR=SUMR+QUANTITY(J)*CP
SUMI=SUMI+QUANTITY(J)*SP
ENDDO
TOTAL(I,1)=SUMR
TOTAL(I,2)=SUMI

As you can see, the inner loop progresses through the array QUANTITY with stride length NPHASE
each time starting at an array location set by the index of the outer loop, So the inner loop accesses a completely
different set of array values per each time the outer loop index changes. NPHASE may be 24 or 48 for example and
NMAX will be several tens of thousands and not exactly divisible by NPHASE.

Will this code be parallelised if the appropriate additional parallel directives are added?
If so,please can you suggest what directives are required, as I have never written parallel code before.

mecej4 · ‎06-20-2015

The code fragment has some inefficiencies that may or may not be overlooked by the optimizer. I suggest that you try

DO I=1,NPHASE
   SUMX=0.0D+00
   DO J= I,NMAX, NPHASE
      SUMX=SUMX+QUANTITY(J)
   ENDDO
   TOTAL(I,1)=SUMX*CPHASE(I)
   TOTAL(I,2)=SUMX*SPHASE(I)
ENDDO

before attempting to parallelize the code.

TimP · ‎06-20-2015

!$omp parallel do priivate(cp,sp,sumr,sumi) schedule(auto)

DO I=1,NPHASE
CP=CPHASE(I)
SP=SPHASE(I)
SUMR=0.0D+00
SUMI=0.0D+00

DO J= I,NMAX, NPHASE

SUMR=SUMR+QUANTITY(J)*CP
SUMI=SUMI+QUANTITY(J)*SP
ENDDO
TOTAL(I,1)=SUMR
TOTAL(I,2)=SUMI

enddo

The scheduling complication is associated with the varying length of the inner loop. Omitting the schedule clause should no more than double the time taken and might even run faster if it avoids numa non-locality problems. There are more complicated ways to deal with the issue for a multiple CPU platform.

As mecej4 pointed out, writing it as a single sum reduction may well improve performance. The f90 sum intrinsic would be simpler looking with the same performance.

ifort (with /QxHost and default option ./fp:fast) probably performs a simd reduction already for the inner loop, and there's no point in tinkering with it by adding omp simd reduction directives.

mecej4 · ‎06-20-2015

The calculation can be reorganized as follows (please test code before using!) :

DO J=1,NMAX
   i=mod(j-1,nphase)+1
   total(i,1) = total(i,1) + quantity(j)*cphase(i)
   total(i,2) = total(i,2) + quantity(j)*sphase(i)
END DO

If you parallelize this version, note that the small array "total" may fit entirely inside the cache, but is subject to update by all the threads, where each thread works on a section of the large array "quantity".

Anthony_Richards · ‎06-20-2015

Many thanks for all your suggestions, which i will try out on Monday when I return to work.

I am somewhat mortified not to have spotted the simple re-arrangement which removes
thousands of multiplications that was suggested by Mecej4 in his first post!