- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have a compute-bound Fortran pgm which I am attempting to parallelize using OpenMP. The outline of this pgm is below. The parallel form below takes almost 3 times longer than the serial form, and according to VTune, it is due to thread creation overhead.
I am wondering how I can bring the thread creation outside all the DO-loops, yet still execute all but the iSP loop (the parallel region in the code below) serially.
The outer loops cannot be parallelized because the values of array a at each time depend on values at the preceeding time (this is a PDE with time and position as independent variables). Also, the 'iTry' loop has a conditional EXIT, which is usually taken.
DO iTime=1,nTimes
...
DO iTry=1,nTries
...
!$OMP PARALLEL
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP END PARALLEL
...
END DO ! iTry
...
END DO ! iTime
I am wondering how I can bring the thread creation outside all the DO-loops, yet still execute all but the iSP loop (the parallel region in the code below) serially.
The outer loops cannot be parallelized because the values of array a at each time depend on values at the preceeding time (this is a PDE with time and position as independent variables). Also, the 'iTry' loop has a conditional EXIT, which is usually taken.
DO iTime=1,nTimes
...
DO iTry=1,nTries
...
!$OMP PARALLEL
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP END PARALLEL
...
END DO ! iTry
...
END DO ! iTime
Link Copied
9 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If nTries is very large, give the following a try
!$OMP PARALLEL PRIVATE(iTime, iTry, j)
DO iTime=1,nTimes
!$OMP MASTER
...
!$OMP END MASTER
DO iTry=1,nTries
!$OMP MASTER
...
!$OMP END MASTER
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTry
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTime
!$OMP END PARALLEL
The above will move the parallel region to outside your nTimes and nTriesloops.
*** each thread executes full range of nTimes and nTries...
*** however, only master thread executes the ...
*** and there is an implied barrier at !$OMP END DO
Jim Dempsey
!$OMP PARALLEL PRIVATE(iTime, iTry, j)
DO iTime=1,nTimes
!$OMP MASTER
...
!$OMP END MASTER
DO iTry=1,nTries
!$OMP MASTER
...
!$OMP END MASTER
!$OMP DO
DO iSp=1,nSp
DO j=1,4000
a(iSp,j)=...
END DO ! j
END DO ! iSp
!$OMP END DO
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTry
!$OMP MASTER
...
!$OMP END MASTER
END DO ! iTime
!$OMP END PARALLEL
The above will move the parallel region to outside your nTimes and nTriesloops.
*** each thread executes full range of nTimes and nTries...
*** however, only master thread executes the ...
*** and there is an implied barrier at !$OMP END DO
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
In your original code make j PRIVATE
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim -
nTries is a user-set constant, generally 4 or less. Under normal circumstances, only a single try is necessary, so the loop exits after the first try.
There are many private variables, including j, but I did not show these in the interest of simplicity.
I did try something like your MASTER approach, but was unable to get it to compile due to the interleaving of Fortran and OMP blocks. I will have another look.
If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.
nTries is a user-set constant, generally 4 or less. Under normal circumstances, only a single try is necessary, so the loop exits after the first try.
There are many private variables, including j, but I did not show these in the interest of simplicity.
I did try something like your MASTER approach, but was unable to get it to compile due to the interleaving of Fortran and OMP blocks. I will have another look.
If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>If I were not using OpenMP, I would just start nSp worker threads (this is Windows) and have them be idle until the iSp loop starts.
In OpenMP, the first !$OMP PARALLEL region creates the thread pool. In an application this first time thing happens only once. For your timing insert
!$OMP PARALLEL
write(*,*) omp_get_thread_num() ! or some code that does not optimize out
!$OMP END PARALLEL
....
Now run your timed session
Note, the initial thread startup is generally negligible...
unless you have some initialization going on...
like a large thread private area
and/or large stack that gets touched
The above code will eliminate those variables from your test.
Jim Dempsey
In OpenMP, the first !$OMP PARALLEL region creates the thread pool. In an application this first time thing happens only once. For your timing insert
!$OMP PARALLEL
write(*,*) omp_get_thread_num() ! or some code that does not optimize out
!$OMP END PARALLEL
....
Now run your timed session
Note, the initial thread startup is generally negligible...
unless you have some initialization going on...
like a large thread private area
and/or large stack that gets touched
The above code will eliminate those variables from your test.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It should be sufficient to set KMP_BLOCKTIME long enough that the threads persist between entries to parallel regions (default 200 ms). Both environment variable and subroutine call alternatives are available.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I tried increasing KMP_BLOCKTIME to 10,000 msec (10 sec), with no change in execution time.
I also tried decreasing the stack size from the default 2 MB to 1 MB, also with no effect.
I need some tools to help me understand what's happening. Parallel execution is taking about 1.5 _longer_.
I also tried decreasing the stack size from the default 2 MB to 1 MB, also with no effect.
I need some tools to help me understand what's happening. Parallel execution is taking about 1.5 _longer_.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Intel VTune Amplifier XE is just what you need to analyze the thread performance
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Inspector should catch some threading errors, but, as Steve hinted, you may find some simply by Amplifier showing where all threads are contending for access to a variable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have tried VTune, but can't get it to work for this pgm. See my post of 6/20, http://software.intel.com/en-us/forums/showthread.php?t=106106, for details.

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page