Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.

parallel nesting in IVF 18

tiho
Beginner
782 Views

Hello 

I am trying to dip my feet into parallel nesting. The sketch of my situation is as follows

call omp_set_nested(.true.)

 do c = 1, 2
 
        !$OMP PARALLEL num_threads(2)                                
        !$OMP DO  
        do i=1,2
             call General(i)
        end do
        !$OMP END DO
        !$OMP END PARALLEL
 
end do
 
subroutine General(i)
 
        call NonParallelCode
 
       !$OMP PARALLEL num_threads(2)                                
       !$OMP DO      
        do j=1,2
             call ParalellCode(j)
        end do
       !$OMP END DO      
       !$OMP END PARALLEL
subroutine General
 
The problem I run into is that compiling with my Release settings the extra level of nesting raises the CPU time by 25% and it appears to load all four cores even when the program is not into the nested parallel portion. In Debug on the other hand the extra level of parallel nesting leads to reduction of the CPU time and ends up being faster than the Release. 
 
Here are the Release compiler options
 
/O2 /Qparallel /fpscomp:ioformat /real_size:64 /Qauto /module:"x64\Release/" /object:"x64\Release/" /Fd"x64\Release\vc120.pdb" /traceback /check:none /libs:static /threads /c /Qopenmp
 
Are any of these options incompatible with the nested parallel programming?
 
 
 
0 Kudos
5 Replies
TimP
Honored Contributor III
782 Views

I haven't worked with it, but the current OpenMP should permit specifying the number of threads at each level.  I think it's easier to experiment using the environment variable e.g. SET OMP_NUM_THREADS=2,2.  I suppose you may need to experiment with OMP_PLACES as well.

0 Kudos
jimdempseyatthecove
Honored Contributor III
782 Views

The call to omp_set_nested must be made prior to your application's first parallel region. Was this the case?

The other thing to consider is the block time. This is the time (or lack thereof) a thread remains in spinwait after a parallel region (in anticipation it will be re-used shortly later).

Environment variables:

KMP_BLOCKTIME=0

or

OMP_WAIT_POLICY=PASSIVE

Note, then above is when you do not want a spin-wait after parallel region, use time in ms for KMP_BLOCKTIME or ACTIVE for OMP_WAIT_POLICY if you want a spin-wait.

Jim Dempsey

0 Kudos
TimP
Honored Contributor III
782 Views

There's been no indication of a need to tinker with the BLOCKTIME.  The purpose of the default setting is to shorten the startup time for entering subsequent parallel regions after a time shorter than the BLOCKTIME.  If you are concerned about it, if your serial code doesn't have to wait for all parallel computations to finish, you could put your serial code inside the parallel region with an omp single clause around it, and put a nowait clause on the omp do.  This allows the first available thread to work on the single region.

0 Kudos
jimdempseyatthecove
Honored Contributor III
782 Views

TimP,

I was addressing: and it appears to load all four cores even when the program is not into the nested parallel portion

When the OpenMP Debug build runs faster than the OpenMP Release build, this can happen if the code in the parallel region contains a convergence routine who's iteration count varies depending on floating point optimizations used or not used.

Try building the Release version with -O0, then -O1, then -O2, ...

Note, the issue may involve only one of your source files. In the Visual Studio Solution Explorer pane, you can right-click on the problematic source file, pick properties, and then specify the optimal optimization level (do this while Release Build is selected).

Jim Dempsey

0 Kudos
tiho
Beginner
782 Views

Much appreciated. These settings worked for me

call omp_set_nested(.true.)
CALL OMP_SET_NUM_THREADS(2,2)
CALL kmp_set_blocktime(0)

 

0 Kudos
Reply