- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I have a program that utilizes loop-level parallelization where the general idea is as follows:
[fortran]PROGRAM Main
!(...initialize an array...)
IF (test1_TRUE) THEN
! (...do something...)
CALL Task1()
! (...do more things...)
END IF
IF (test2_TRUE) THEN
! (...do something...)
CALL Task2()
! (...do more things...)
END IF
END
SUBROUTINE Task1()
!$OMP PARALLEL DO
DO indx=1,N
! (...calculations...)
END DO
!$OMP END PARALLEL DO
END SUBROUTINE Task1
SUBROUTINE Task2()
!$OMP PARALLEL DO
DO indx=1,M
! (...calculations...)
END DO
!$OMP END PARALLLEL DO
END SUBROUTINE Task2()[/fortran]
Although I am no expert, I am thinking that having two OpenMP PARALLEL DO constructs might lead to doubling of the overhead. I would like to move the PARALLEL DO to the main program, create the threads only once, and use these threads in the subroutines for the loop level parallelization when needed. This way I can also parallelize the array initialization in the main program. So, in general, I am trying to get something like this:
[fortran]PROGRAM Main
!$OMP PARALLEL
!$OMP WORKSHARE
!(...initialize an array...)
!$OMP END WORKSHARE
!$OMP SINGLE
IF (test1_TRUE) THEN
! (...do something...)
CALL Task1()
! (...do more things...)
END IF
IF (test2_TRUE) THEN
! (...do something...)
CALL Task2()
! (...do more things...)
END IF
!$OMP END SINGLE
!$OMP END PARALLEL
END
SUBROUTINE Task1()
!$OMP DO
DO indx=1,N
! (...calculations...)
END DO
!$OMP END DO
END SUBROUTINE Task1
SUBROUTINE Task2()
!$OMP DO
DO indx=1,M
! (...calculations...)
END DO
!$OMP END DO
END SUBROUTINE Task2()[/fortran]
But this approach uses only 1 thread after the SINGLE construct. If I leave the PARALLEL DO constructs (instead of changing them to !$OMP DO) in the subroutines then I end up with nested parallelism but this means I now have to face overheads for the creation of a parallel team 3 times, which beats the purpose.
Is my thinking correct on this? Can anyone give me some pointers regarding how to efficieciently parallelize such a codeset with minimal overhead. Is an efficient coarse-grain parallelization in such a case is even possible?
Thanks for any help,
Jon
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You haven't shown anything to indicate why there should be a problem with your first version (unlike the second). Most OpenMP implementations keep the thread pool active for a while after a parallel region finishes (default 200 milliseconds for Intel's, controlled by KMP_BLOCKTIME).
As you may be hinting, there isn't currently any thread locality control (KMP_AFFINITY) for nested parallelism with any OpenMP you are likely to be using.
The point of initializing arrays with the same OpenMP scheduling as will be used for the main work is to get the advantage of automatic first-touch locality, which could make a significant difference when you run on a multiple CPU platform. It doesn't need to be done in the same parallel region.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Extending Sergey's comment further. If in the case where Task1 and Task2 are both run .AND. outputs are independent .AND. inputs are somewhat the same, then consider something along the line of
!$omp parallel
if(test1_TRUE .AND. test2_TRUE) THEN
call do_1_2(...)
else if(test1_TRUE) THEN
call do_1(...)
else if(test2_TRUE) THEN
call do_2(...)
endif
!$omp end parallel
...
subroutine do_1_2(...)
!$omp do
do outer=1,N,stride
do inner=outer,min(outer+stride,N)
... ! compute stride part of 1
end do
do inner=outer,min(outer+stride,N)
... ! compute stride part of 2
end do
end do ! next stride
!$omp end do
end subroutine do_1_2
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you all for your responses. Task1 and Task2 process two independent and large data sets. Jim, I will try your suggestion. Thanks again.
Jon
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jon,
The suggestion I made will help when the input data for the two tasks are shared (either completely or partially). Essentially the technique is attempting to re-use data in cache in Task2 that was brought in by Task1. (adjust the stride to accomplish this)
![](/skins/images/872293744008A34B36F8ABF94A46CC66/responsive_peak/images/icon_anonymous_message.png)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page