Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29274 Discussions

How to minimize thread creation overhead in Intel Fortran/OpenMP?

virtualmemory
Beginner
2,163 Views
I have a compute-bound Fortran pgm which I am attempting to parallelize using OpenMP. The outline of this pgm is below. The parallel form below takes almost 3 times longer than the serial form, and according to VTune, it is due to thread creation overhead.

I am wondering how I can bring the thread creation outside all the DO-loops, yet still execute all but the iSP loop (the parallel region in the code below) serially.

The outer loops cannot be parallelized because the values of array a at each time depend on values at the preceeding time (this is a PDE with time and position as independent variables). Also, the 'iTry' loop has a conditional EXIT, which is usually taken.

DO iTime=1,nTimes
...
DO iTry=1,nTries
...
!$OMP PARALLEL
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP END PARALLEL
...
END DO ! iTry
...
END DO ! iTime
0 Kudos
9 Replies
jimdempseyatthecove
Honored Contributor III
2,163 Views
If nTries is very large, give the following a try

!$OMP PARALLEL PRIVATE(iTime, iTry, j)
DO iTime=1,nTimes
!$OMP MASTER
...
!$OMP END MASTER
DO iTry=1,nTries
!$OMP MASTER
...
!$OMP END MASTER
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTry
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTime
!$OMP END PARALLEL

The above will move the parallel region to outside your nTimes and nTriesloops.
*** each thread executes full range of nTimes and nTries...
*** however, only master thread executes the ...
*** and there is an implied barrier at !$OMP END DO

Jim Dempsey
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,163 Views
In your original code make j PRIVATE
0 Kudos
virtualmemory
Beginner
2,163 Views
Jim -

nTries is a user-set constant, generally 4 or less. Under normal circumstances, only a single try is necessary, so the loop exits after the first try.

There are many private variables, including j, but I did not show these in the interest of simplicity.

I did try something like your MASTER approach, but was unable to get it to compile due to the interleaving of Fortran and OMP blocks. I will have another look.

If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,163 Views
>>If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.

In OpenMP, the first !$OMP PARALLEL region creates the thread pool. In an application this first time thing happens only once. For your timing insert

!$OMP PARALLEL
write(*,*) omp_get_thread_num() ! or some code that does not optimize out
!$OMP END PARALLEL
....
Now run your timed session

Note, the initial thread startup is generally negligible...
unless you have some initialization going on...
like a large thread private area
and/or large stack that gets touched

The above code will eliminate those variables from your test.

Jim Dempsey
0 Kudos
TimP
Honored Contributor III
2,163 Views
It should be sufficient to set KMP_BLOCKTIME long enough that the threads persist between entries to parallel regions (default 200 ms). Both environment variable and subroutine call alternatives are available.
0 Kudos
virtualmemory
Beginner
2,163 Views
I tried increasing KMP_BLOCKTIME to 10,000 msec (10 sec), with no change in execution time.

I also tried decreasing the stack size from the default 2 MB to 1 MB, also with no effect.

I need some tools to help me understand what's happening. Parallel execution is taking about 1.5 _longer_.
0 Kudos
Steven_L_Intel1
Employee
2,163 Views
Intel VTune Amplifier XE is just what you need to analyze the thread performance
0 Kudos
TimP
Honored Contributor III
2,163 Views
Inspector should catch some threading errors, but, as Steve hinted, you may find some simply by Amplifier showing where all threads are contending for access to a variable.
0 Kudos
virtualmemory
Beginner
2,163 Views
I have tried VTune, but can't get it to work for this pgm. See my post of 6/20, http://software.intel.com/en-us/forums/showthread.php?t=106106, for details.
0 Kudos
Reply