Nested Loops and OpenMP

Dishaw__Jim · ‎05-31-2007

This may be a basic question, but my OpenMP kung-fu is weak.

I have the following nested loop

DO iz=1,num_cells(3)
  DO iy=1,num_cells(2)
    DO ix=1,num_cells(1)
       ... stuff ...
    END DO
  END DO
END DO

I have no a priori knowledge on the number of cells in each axis until run time, thus I don't think an $OMP PARALLEL DO would work. Would it be better to collapse the loop and iterate from 1 to "total number of cells" and compute the cell position (ix,iy,iz) based on the cell number?

TimP · ‎05-31-2007

Nothing you show here poses a problem. OpenMP divides the loop at run time, according to the number of threads requested and the length of the loop, and any additional relevant options you may specify.

jimdempseyatthecove · ‎05-31-2007

The answer you seek will depend on your ... stuff ...

If your stuff has temporal issues - meaning must be performed in a specified sequence then you may or may not be able to express the nested loops as a single level loop.

The first order of optimization is to look at your ... stuff ... to see if it is adaptible to vectorization (Single Instruction Multiple Data). Once that has been optimized then look at how you can distribute the work load.

There are several factors to consider when distributing workload. One of which is the overhead to start and stop threads. The second of which is data placement such that cache access by one thead does not interfere (much) with cache access by other threads. Both of these methods will be affected by the number of processors available.

If your ... stuff ... has non-Temporal issues (can execute in any order) then a good starting point would be to select as the inner most loop (iz, iy, or ix) the method that benefits most from vectorization. The outer most loop and middle loop order could be swaped depending on the counts and number of processors.

if(WhichOrder(num_cells)) then
 DO iz=1,num_cells(3)
 DO iy=1,num_cells(2)
 DO ix=1,num_cells(1)
 ... stuff ...
 END DO
 END DO
 END DO
else
 DO iy=1,num_cells(3)
 DO iz=1,num_cells(2)
 DO ix=1,num_cells(1)
 ... stuff ...
 END DO
 END DO
 END DO
endif

If you have large numbers of runs with varying numbers of num_cells then you might have some success in inserting some heuristics in the function used to specify the loop order.

Jim Dempsey

Dishaw__Jim · ‎05-31-2007

So the $OMP PARALLEL DO is able to look at the relative sizes of the nested loops and partition the work accordingly--cool.

If you had 4 cores and 8x1x1 (X,Y,Z) cells it would (conceivably) spread the innermost loop over the 4 cores?

TimP · ‎05-31-2007

You want to parallelize the outermost possible loop, with the largest possible separation between the data involved in the threads. If your problem varies in size, and could be too small to parallelize efficiently, use the if clause, e.g.
$omp parallel do if(num_cells(1)*num_cells(2)*num_cells(3).gt.1000)

smshah · ‎05-31-2007

I'd say yes, collapse the loops, compute the cell position and try to make sure the order of traversal gives you spatial or temporal locality.

jimdempseyatthecove · ‎06-01-2007

>>So the $OMP PARALLEL DO is able to look at the relative sizes of the nested loops and partition the work accordingly--cool.<<

No it does not. $OMP PARALLEL DO will only look at the iteration count of the loop to which it applies (on the immediately following Fortran statement) then distribute according to SCHEDULE (default or explicitly stated SCHEDULE).

If your ... stuff ... code is relatively small then you would not want to parallelize an 8x1x1 loop structure. However, if you have an 8x1x1 and if the ... stuff ... code is very long then consider using $OMP PARALLEL DO SCHEDULE(STATIC,1).

Then consider my prior post if you have 4 cores and something like 3x1234x12

Where you might want to keep the 12 in the inner loop for vectorization then change the outer loop order to process the 1234, middle loop to use the 3. Also in this case (4 cores and 3x1234x12 processed as 1234x3x12) then experiment using SCHEDULE with STATIC, GUIDED, DYNAMIC to obtain satisfactory performance results.

Jim Dempsey

Dishaw__Jim · ‎06-01-2007

Here is a typical example of "stuff"

DO iz=1,num_cells(3)
  DO iy=1,num_cells(2)
    DO ix=1,num_cells(1)
      CALL gemm(A(:,:,region(ix,iy,iz)), Z(:,:,iz,iy,iz), G, &
         'N', 'N', 1.0_dp, 0.0_dp)

      ndx = compute_index(ix,iy,iz)
      values(ndx) = 1.0_dp
      values(ndx+1) = -G(1,2)
      values(ndx+2) = -G(1,1)

    END DO
  END DO
END DO

The Z submatrix can range in size from 16x16 to about 500x500. The "aspect ratio" of num_cells is arbitrary, e.g. an 8x1x1 might be a 400x50x50--though the submatrix would tend to be smaller, you are trading spatial for angular resolution). For small matrix sizes a MATMUL might be faster, but from what I understand gemm will be faster for larger matrices.

It does not appear to me that any ordering would make one preferable for SIMD vectorization. At face value, if I put an $OMP PARALLEL DO on the outermost loop I would be utilizing all the cores if num_cells(3) >= num of cores.

The basis for my question is that when I am running my code in "reduced geometry," i.e. 2D or 1D vice 3D, num_cells(3)=1 (2D and 1D) and num_cells(2)=1 (1D). Thus, putting the $OMP PARALLEL DO directive on the outermost loop would not, apparently, provide any benefit for reduced geometry problems.

It would appear that my three options are:

Have three cases (1D, 2D, and 3D) with the relevant number of DO and $OMP PARALLEL DO on the outermost loop
Put $OMP PARALLEL DO on the innermost loop (though that would be xpensive on thread creation)
Collapse into one loop and compute the cell position (ix,iy,iz)

Option #1 would work well when the outermost loop is evenly divisible by the number of cores. Option #3 works well in all cases, but I pay a (small) price in code complexity by computing the cell position. Option #2 is not a good choice.

On a sidenote, how much of a performance hit is "A(:,:,region(ix,iy,iz))" and "Z(:,:,iz,iy,iz)"?

jimdempseyatthecove · ‎06-04-2007

You have a 4th option (others can pipe in for 5th, 6th, ...)

I will give you the outline, you can do the exercise.

1) Create an integer flag array that is the size (or larger)of the product of the entries of the num_cells
2) Initialize integer flag array to 0's
3) Code with $OMP PARALLEL but not as DO (i.e. all threads execute all iterations of all three nested DO loops.
4) immediately inside the 3rd loop call issue a call to InterlockedCompareExchange to attempt to replace the 0 in the integer flag array with 1. If the exchange fails issue a CYCLE

Coding in this manner has the following advantages:

a)Same codeworks for 1D, 2D and 3D
b) Works well if processing time for subroutine gemm varies from call to call
c) Automatically load ballances if system performing other work

Disadvantages

a) Requires nulling out a flag array (should be minor overhead)
b) Requires redundant loop overhead (very minor code overhead)
c) Requires call to InterlockedCompareExchange (very minor overhead)

Jim Dempsey