Coarse grain parallelization with OpenMP

Jon_D · ‎05-31-2013

Hello,

I have a program that utilizes loop-level parallelization where the general idea is as follows:

[fortran]PROGRAM Main

!(...initialize an array...)

IF (test1_TRUE) THEN
      ! (...do something...)

      CALL Task1()

      ! (...do more things...)
END IF

IF (test2_TRUE) THEN
      ! (...do something...)

      CALL Task2()

     ! (...do more things...)
END IF
END


SUBROUTINE Task1()
!$OMP PARALLEL DO
DO indx=1,N
     ! (...calculations...)
END DO
!$OMP END PARALLEL DO
END SUBROUTINE Task1


SUBROUTINE Task2()
!$OMP PARALLEL DO
DO indx=1,M
     ! (...calculations...)
END DO
!$OMP END PARALLLEL DO
END SUBROUTINE Task2()[/fortran]

Although I am no expert, I am thinking that having two OpenMP PARALLEL DO constructs might lead to doubling of the overhead. I would like to move the PARALLEL DO to the main program, create the threads only once, and use these threads in the subroutines for the loop level parallelization when needed. This way I can also parallelize the array initialization in the main program. So, in general, I am trying to get something like this:

[fortran]PROGRAM Main
!$OMP PARALLEL

!$OMP WORKSHARE
!(...initialize an array...)
!$OMP END WORKSHARE

    !$OMP SINGLE
IF (test1_TRUE) THEN
      ! (...do something...)

      CALL Task1()

      ! (...do more things...)
END IF

IF (test2_TRUE) THEN
      ! (...do something...)

      CALL Task2()

     ! (...do more things...)
END IF

!$OMP END SINGLE

!$OMP END PARALLEL
END


SUBROUTINE Task1()
!$OMP DO
DO indx=1,N
     ! (...calculations...)
END DO
!$OMP END DO
END SUBROUTINE Task1


SUBROUTINE Task2()
!$OMP DO
DO indx=1,M
     ! (...calculations...)
END DO
!$OMP END DO
END SUBROUTINE Task2()[/fortran]

But this approach uses only 1 thread after the SINGLE construct. If I leave the PARALLEL DO constructs (instead of changing them to !$OMP DO) in the subroutines then I end up with nested parallelism but this means I now have to face overheads for the creation of a parallel team 3 times, which beats the purpose.

Is my thinking correct on this? Can anyone give me some pointers regarding how to efficieciently parallelize such a codeset with minimal overhead. Is an efficient coarse-grain parallelization in such a case is even possible?

Thanks for any help,

Jon

TimP · ‎06-03-2013

You haven't shown anything to indicate why there should be a problem with your first version (unlike the second). Most OpenMP implementations keep the thread pool active for a while after a parallel region finishes (default 200 milliseconds for Intel's, controlled by KMP_BLOCKTIME).

As you may be hinting, there isn't currently any thread locality control (KMP_AFFINITY) for nested parallelism with any OpenMP you are likely to be using.

The point of initializing arrays with the same OpenMP scheduling as will be used for the main work is to get the advantage of automatic first-touch locality, which could make a significant difference when you run on a multiple CPU platform. It doesn't need to be done in the same parallel region.

SergeyKostrov · ‎06-03-2013

My comment is Not related to your questions and it is rather about possible performance issues. You could have a problem with indx=1,N and indx=1,M statements if your processing is done on different large data sets. It is also Not clear if your calculations modify the same data set.

jimdempseyatthecove · ‎06-04-2013

Extending Sergey's comment further. If in the case where Task1 and Task2 are both run .AND. outputs are independent .AND. inputs are somewhat the same, then consider something along the line of

!$omp parallel
if(test1_TRUE .AND. test2_TRUE) THEN
call do_1_2(...)
else if(test1_TRUE) THEN
call do_1(...)
else if(test2_TRUE) THEN
call do_2(...)
endif
!$omp end parallel
...
subroutine do_1_2(...)
!$omp do
do outer=1,N,stride
do inner=outer,min(outer+stride,N)
... ! compute stride part of 1
end do
do inner=outer,min(outer+stride,N)
... ! compute stride part of 2
end do
end do ! next stride
!$omp end do
end subroutine do_1_2

Jim Dempsey

Jon_D · ‎06-04-2013

Thank you all for your responses. Task1 and Task2 process two independent and large data sets. Jim, I will try your suggestion. Thanks again.

Jon

jimdempseyatthecove · ‎06-04-2013

Jon,

The suggestion I made will help when the input data for the two tasks are shared (either completely or partially). Essentially the technique is attempting to re-use data in cache in Task2 that was brought in by Task1. (adjust the stride to accomplish this)