parallel nesting in IVF 18

tiho · ‎10-10-2018

Hello

I am trying to dip my feet into parallel nesting. The sketch of my situation is as follows

call omp_set_nested(.true.)

do c = 1, 2

!$OMP PARALLEL num_threads(2)

!$OMP DO

do i=1,2

call General(i)

end do

!$OMP END DO

!$OMP END PARALLEL

end do

subroutine General(i)

call NonParallelCode

!$OMP PARALLEL num_threads(2)

!$OMP DO

do j=1,2

call ParalellCode(j)

end do

!$OMP END DO

!$OMP END PARALLEL

subroutine General

The problem I run into is that compiling with my Release settings the extra level of nesting raises the CPU time by 25% and it appears to load all four cores even when the program is not into the nested parallel portion. In Debug on the other hand the extra level of parallel nesting leads to reduction of the CPU time and ends up being faster than the Release.

Here are the Release compiler options

/O2 /Qparallel /fpscomp:ioformat /real_size:64 /Qauto /module:"x64\Release/" /object:"x64\Release/" /Fd"x64\Release\vc120.pdb" /traceback /check:none /libs:static /threads /c /Qopenmp

Are any of these options incompatible with the nested parallel programming?

TimP · ‎10-11-2018

I haven't worked with it, but the current OpenMP should permit specifying the number of threads at each level. I think it's easier to experiment using the environment variable e.g. SET OMP_NUM_THREADS=2,2. I suppose you may need to experiment with OMP_PLACES as well.

jimdempseyatthecove · ‎10-11-2018

The call to omp_set_nested must be made prior to your application's first parallel region. Was this the case?

The other thing to consider is the block time. This is the time (or lack thereof) a thread remains in spinwait after a parallel region (in anticipation it will be re-used shortly later).

Environment variables:

KMP_BLOCKTIME=0

or

OMP_WAIT_POLICY=PASSIVE

Note, then above is when you do not want a spin-wait after parallel region, use time in ms for KMP_BLOCKTIME or ACTIVE for OMP_WAIT_POLICY if you want a spin-wait.

Jim Dempsey

TimP · ‎10-11-2018

There's been no indication of a need to tinker with the BLOCKTIME. The purpose of the default setting is to shorten the startup time for entering subsequent parallel regions after a time shorter than the BLOCKTIME. If you are concerned about it, if your serial code doesn't have to wait for all parallel computations to finish, you could put your serial code inside the parallel region with an omp single clause around it, and put a nowait clause on the omp do. This allows the first available thread to work on the single region.

jimdempseyatthecove · ‎10-12-2018

TimP,

I was addressing: and it appears to load all four cores even when the program is not into the nested parallel portion

When the OpenMP Debug build runs faster than the OpenMP Release build, this can happen if the code in the parallel region contains a convergence routine who's iteration count varies depending on floating point optimizations used or not used.

Try building the Release version with -O0, then -O1, then -O2, ...

Note, the issue may involve only one of your source files. In the Visual Studio Solution Explorer pane, you can right-click on the problematic source file, pick properties, and then specify the optimal optimization level (do this while Release Build is selected).

Jim Dempsey

tiho · ‎10-12-2018

Much appreciated. These settings worked for me

call omp_set_nested(.true.)
CALL OMP_SET_NUM_THREADS(2,2)
CALL kmp_set_blocktime(0)