- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have the following nested loop
DO iz=1,num_cells(3) DO iy=1,num_cells(2) DO ix=1,num_cells(1) ... stuff ... END DO END DO END DO
I have no a priori knowledge on the number of cells in each axis until run time, thus I don't think an $OMP PARALLEL DO would work. Would it be better to collapse the loop and iterate from 1 to "total number of cells" and compute the cell position (ix,iy,iz) based on the cell number?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The answer you seek will depend on your ... stuff ...
If your stuff has temporal issues - meaning must be performed in a specified sequence then you may or may not be able to express the nested loops as a single level loop.
The first order of optimization is to look at your ... stuff ... to see if it is adaptible to vectorization (Single Instruction Multiple Data). Once that has been optimized then look at how you can distribute the work load.
There are several factors to consider when distributing workload. One of which is the overhead to start and stop threads. The second of which is data placement such that cache access by one thead does not interfere (much) with cache access by other threads. Both of these methods will be affected by the number of processors available.
If your ... stuff ... has non-Temporal issues (can execute in any order) then a good starting point would be to select as the inner most loop (iz, iy, or ix) the method that benefits most from vectorization. The outer most loop and middle loop order could be swaped depending on the counts and number of processors.
if(WhichOrder(num_cells)) then
DO iz=1,num_cells(3)
DO iy=1,num_cells(2)
DO ix=1,num_cells(1)
... stuff ...
END DO
END DO
END DO
else
DO iy=1,num_cells(3)
DO iz=1,num_cells(2)
DO ix=1,num_cells(1)
... stuff ...
END DO
END DO
END DO
endif
If you have large numbers of runs with varying numbers of num_cells then you might have some success in inserting some heuristics in the function used to specify the loop order.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So the $OMP PARALLEL DO is able to look at the relative sizes of the nested loops and partition the work accordingly--cool.
If you had 4 cores and 8x1x1 (X,Y,Z) cells it would (conceivably) spread the innermost loop over the 4 cores?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
$omp parallel do if(num_cells(1)*num_cells(2)*num_cells(3).gt.1000)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>So the $OMP PARALLEL DO is able to look at the relative sizes of the nested loops and partition the work accordingly--cool.<<
No it does not. $OMP PARALLEL DO will only look at the iteration count of the loop to which it applies (on the immediately following Fortran statement) then distribute according to SCHEDULE (default or explicitly stated SCHEDULE).
If your ... stuff ... code is relatively small then you would not want to parallelize an 8x1x1 loop structure. However, if you have an 8x1x1 and if the ... stuff ... code is very long then consider using $OMP PARALLEL DO SCHEDULE(STATIC,1).
Then consider my prior post if you have 4 cores and something like 3x1234x12
Where you might want to keep the 12 in the inner loop for vectorization then change the outer loop order to process the 1234, middle loop to use the 3. Also in this case (4 cores and 3x1234x12 processed as 1234x3x12) then experiment using SCHEDULE with STATIC, GUIDED, DYNAMIC to obtain satisfactory performance results.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
DO iz=1,num_cells(3) DO iy=1,num_cells(2) DO ix=1,num_cells(1) CALL gemm(A(:,:,region(ix,iy,iz)), Z(:,:,iz,iy,iz), G, & 'N', 'N', 1.0_dp, 0.0_dp) ndx = compute_index(ix,iy,iz) values(ndx) = 1.0_dp values(ndx+1) = -G(1,2) values(ndx+2) = -G(1,1) END DO END DO END DO
The Z submatrix can range in size from 16x16 to about 500x500. The "aspect ratio" of num_cells is arbitrary, e.g. an 8x1x1 might be a 400x50x50--though the submatrix would tend to be smaller, you are trading spatial for angular resolution). For small matrix sizes a MATMUL might be faster, but from what I understand gemm will be faster for larger matrices.
It does not appear to me that any ordering would make one preferable for SIMD vectorization. At face value, if I put an $OMP PARALLEL DO on the outermost loop I would be utilizing all the cores if num_cells(3) >= num of cores.
The basis for my question is that when I am running my code in "reduced geometry," i.e. 2D or 1D vice 3D, num_cells(3)=1 (2D and 1D) and num_cells(2)=1 (1D). Thus, putting the $OMP PARALLEL DO directive on the outermost loop would not, apparently, provide any benefit for reduced geometry problems.
It would appear that my three options are:
- Have three cases (1D, 2D, and 3D) with the relevant number of DO and $OMP PARALLEL DO on the outermost loop
- Put $OMP PARALLEL DO on the innermost loop (though that would be xpensive on thread creation)
- Collapse into one loop and compute the cell position (ix,iy,iz)
Option #1 would work well when the outermost loop is evenly divisible by the number of cores. Option #3 works well in all cases, but I pay a (small) price in code complexity by computing the cell position. Option #2 is not a good choice.
On a sidenote, how much of a performance hit is "A(:,:,region(ix,iy,iz))" and "Z(:,:,iz,iy,iz)"?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You have a 4th option (others can pipe in for 5th, 6th, ...)
I will give you the outline, you can do the exercise.
1) Create an integer flag array that is the size (or larger)of the product of the entries of the num_cells
2) Initialize integer flag array to 0's
3) Code with $OMP PARALLEL but not as DO (i.e. all threads execute all iterations of all three nested DO loops.
4) immediately inside the 3rd loop call issue a call to InterlockedCompareExchange to attempt to replace the 0 in the integer flag array with 1. If the exchange fails issue a CYCLE
Coding in this manner has the following advantages:
a)Same codeworks for 1D, 2D and 3D
b) Works well if processing time for subroutine gemm varies from call to call
c) Automatically load ballances if system performing other work
Disadvantages
a) Requires nulling out a flag array (should be minor overhead)
b) Requires redundant loop overhead (very minor code overhead)
c) Requires call to InterlockedCompareExchange (very minor overhead)
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page