How to minimize thread creation overhead in Intel Fortran/OpenM

virtualmemory · ‎06-07-2012

I have a compute-bound Fortran pgm which I am attempting to parallelize using OpenMP. The outline of this pgm is below. The parallel form below takes almost 3 times longer than the serial form, and according to VTune, it is due to thread creation overhead.

I am wondering how I can bring the thread creation outside all the DO-loops, yet still execute all but the iSP loop (the parallel region in the code below) serially.

The outer loops cannot be parallelized because the values of array a at each time depend on values at the preceeding time (this is a PDE with time and position as independent variables). Also, the 'iTry' loop has a conditional EXIT, which is usually taken.

DO iTime=1,nTimes
...
DO iTry=1,nTries
...
!$OMP PARALLEL
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP END PARALLEL
...
END DO ! iTry
...
END DO ! iTime

jimdempseyatthecove · ‎06-07-2012

If nTries is very large, give the following a try

!$OMP PARALLEL PRIVATE(iTime, iTry, j)
DO iTime=1,nTimes
!$OMP MASTER
...
!$OMP END MASTER
DO iTry=1,nTries
!$OMP MASTER
...
!$OMP END MASTER
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTry
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTime
!$OMP END PARALLEL

The above will move the parallel region to outside your nTimes and nTriesloops.
*** each thread executes full range of nTimes and nTries...
*** however, only master thread executes the ...
*** and there is an implied barrier at !$OMP END DO

Jim Dempsey

jimdempseyatthecove · ‎06-07-2012

In your original code make j PRIVATE

virtualmemory · ‎06-07-2012

Jim -

nTries is a user-set constant, generally 4 or less. Under normal circumstances, only a single try is necessary, so the loop exits after the first try.

There are many private variables, including j, but I did not show these in the interest of simplicity.

I did try something like your MASTER approach, but was unable to get it to compile due to the interleaving of Fortran and OMP blocks. I will have another look.

If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.

jimdempseyatthecove · ‎06-07-2012

>>If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.

In OpenMP, the first !$OMP PARALLEL region creates the thread pool. In an application this first time thing happens only once. For your timing insert

!$OMP PARALLEL
write(*,*) omp_get_thread_num() ! or some code that does not optimize out
!$OMP END PARALLEL
....
Now run your timed session

Note, the initial thread startup is generally negligible...
unless you have some initialization going on...
like a large thread private area
and/or large stack that gets touched

The above code will eliminate those variables from your test.

Jim Dempsey

TimP · ‎06-08-2012

It should be sufficient to set KMP_BLOCKTIME long enough that the threads persist between entries to parallel regions (default 200 ms). Both environment variable and subroutine call alternatives are available.

virtualmemory · ‎06-21-2012

I tried increasing KMP_BLOCKTIME to 10,000 msec (10 sec), with no change in execution time.

I also tried decreasing the stack size from the default 2 MB to 1 MB, also with no effect.

I need some tools to help me understand what's happening. Parallel execution is taking about 1.5 _longer_.

Steven_L_Intel1 · ‎06-21-2012

Intel VTune Amplifier XE is just what you need to analyze the thread performance

TimP · ‎06-21-2012

Inspector should catch some threading errors, but, as Steve hinted, you may find some simply by Amplifier showing where all threads are contending for access to a variable.

virtualmemory · ‎06-21-2012

I have tried VTune, but can't get it to work for this pgm. See my post of 6/20, http://software.intel.com/en-us/forums/showthread.php?t=106106, for details.

How to minimize thread creation overhead in Intel Fortran/OpenMP?