Solved: OMP Parallel do

rafadix08 · ‎11-19-2009

I have two questions for which I didn't find answers in the manuals.

Suppose I have this piece of code:

!omp parallel do
do i = 1, N
do j = 1, M
function(i,j)
end do
end do

I checked, printing the thread of each iteration that only the outer loop is parallelized. How can I parallelize every single iteration? That is, if I have N*M processors available, I would like to be able to run this on N*M processors.

Another question:

suppose I have two functions: function1(s1) and function2(s1,s2)

How can I parallelize this without two consecutive omp directives, that is, in the most efficient way possible:

do s1 = 1,M
function1(s1)
do s2 = 1, N
function2(s1,s2)
end do
end do

Many thanks,
Rafael

TimP · ‎11-19-2009

In this case:
do s1 = 1,M
function1(s1)
do s2 = 1, N
function2(s1,s2)
end do
end do
if the inner do loop takes most of the time, and parallelization of the outer loop doesn't do the entire job, you could also parallelize the inner loop, and set OMP_NESTED.
I find it hard to think of a situation where PARALLEL SECTIONS would help this case, not to say there isn't one. I don't know that you couldn't have a do loop inside a SECTION.
http://static.msi.umn.edu/tutorial/scicomp/general/openMP/content_openMP.html
appears to show how you could even have a parallel loop inside a section, with OMP_NESTED. I wonder about their example, as it has apparently redundant !$OMP END DO lines.

View solution in original post

jimdempseyatthecove · ‎11-19-2009

Use "!$omp" with "$"

On newer release of compiler there is a collapse(n) where n specifies the number of loops to collapse.

!$omp parallel do collapse(2)
do i = 1, N
do j = 1, M
function(i,j)
end do
end do

The second function would be harder to do.
With the statements you provided, and with newer release of compiler, the followin should work

!$omp parallel do collapse(2)
do s1 = 1,M
do s2 = 1, N
if(s1*s2== 1) function1(s1)
function2(s1,s2)
end do
end do

or

!$omp parallel do collapse(2)
do s1 = 1,M
do s2 = 1, N
if(s1== 1)then
if(s2== 1) function1(s1)
endif
function2(s1,s2)
end do
end do

I do not know how the collapse would work with the function1 in between the do's

Jim Dempsey

TimP · ‎11-19-2009

collapse requires the loops are directly nested. I've started testing it more often myself. There are cases even with directly nested loops where the compiler will reject collapse. As requested at the top of the loop, it should spread all the iterations of the combined loops among the threads; should be quite useful when the outer loop count is small and not an even multiple of number of threads.

rafadix08 · ‎11-19-2009

Hi Jim and Tim,

Many thanks for the replies. the collapse directive is exactly what I needed.

However, for my second question, you suggestion didn't quite work, since OpenMP is doing what is inside the do at once in a single thread.

Is there another way to do it? Maybe with OMP PARALLEL SECTIONS?

I noticed I cannot insert a do inside OMP PARALLEL SECTIONS... However, there should be an automatic way of doing it without having to write every single iteration in full.

TimP · ‎11-19-2009

In this case:
do s1 = 1,M
function1(s1)
do s2 = 1, N
function2(s1,s2)
end do
end do
if the inner do loop takes most of the time, and parallelization of the outer loop doesn't do the entire job, you could also parallelize the inner loop, and set OMP_NESTED.
I find it hard to think of a situation where PARALLEL SECTIONS would help this case, not to say there isn't one. I don't know that you couldn't have a do loop inside a SECTION.
http://static.msi.umn.edu/tutorial/scicomp/general/openMP/content_openMP.html
appears to show how you could even have a parallel loop inside a section, with OMP_NESTED. I wonder about their example, as it has apparently redundant !$OMP END DO lines.

jimdempseyatthecove · ‎11-19-2009

>>However, for my second question, you suggestion didn't quite work, since OpenMP is doing what is inside the do at once in a single thread.

Do you mean to say that the first loop was parallelized and the second was not?

If so, did you forget the $ on "!$omp parallel do collaps(2)"
If not, did you forget to move the call to function1 to a conditional inside the second loop?

Jim Dempsey

rafadix08 · ‎11-19-2009

Hi Jim,

Here is what I did:

!$omp parallel do collapse(2)
do s1 = 1, M
do s2 = 1, N
if(s2 == 1) then
function1(s1)
endif
function2(s1,s2)
end do
end do

When I said it didn't quite work, I actually meant that the parallelization was not done inthe most efficient way. For example, when s2 == 1, function1(s1) and function2(s1,s2) will be called serially in the same thread.

What seems to be working well is nested parallelization suggested in the link you sent a couple of posts above, although it's not working exactly the way I wanted, but it's good enough.

Many thanks for your help,
Rafael

jimdempseyatthecove · ‎11-20-2009

Rafael,

Does function1(at atgiven s1)produce changes required by functon2(at same s1, s2)?
From your statement, it would seem the answer is no.

!$omp parallel
!$omp do
do s1 = 1, M
function1(s1)
end do
!$omp end do nowait
!$omp do collapse(2)
do s1 = 1, M
do s2 = 1, N
function2(s1,s2)
end do
end do
!$omp end do nowait
!$omp end parallel

You might need to experiment with schedule clause and/or num_threads on the do loops to get maximum utilization out of the cores.

What the above construct does is

form a thread team (once)
slice iteration space 1:M amongst threads
each thread runs function1 across its slice of iteration space
without implicit barrier (due to nowait) at end of 1st do
as threads complete 1st do begin thread slice of second do
as each thread completes 2nd do, without implicit barrier, run to end parallel and exit team (to become available for other work).

The trick will be to balance the load such that you do not have an idle thread.

Jim Dempsey

TimP · ‎11-20-2009

Quoting - jimdempseyatthecove

!$omp parallel
!$omp do
do s1 = 1, M
function1(s1)
end do
!$omp end do nowait
!$omp do collapse(2)
do s1 = 1, M
do s2 = 1, N
function2(s1,s2)
end do
end do
!$omp end do nowait
!$omp end parallel

Jim,
What is the effect of the final nowait? Does end parallel effectively implement a barrier?
In cases where I've tried nowait, the best gain was with gfortran combined with libiomp5, and was about 2%, while it tended to break with g++. So I'm concerned about the cost/benefit of nowait. I used to always get complaints about nowait from thread checker, but that situation has improved.
The good side: even supposing that the work is balanced among the threads in the first do loop, if M isn't divisible evenly by number of threads (the problem we're trying to solve with collapse), there will be remainder threads available early to start working on the 2nd loop.
I'm not criticizing the general idea you offer here; in fact, it looks good. nowait optimizations will increase in importance as number of cores increases.

jimdempseyatthecove · ‎11-20-2009

Tim,

From looking with a narrow view of the code presented, the second nowait would seem to have no practical purpose.
From a wider view, this code, either now, or some time in the future, may be run within a nested parallel region. In this case, having the nowait on the second loop, removes the implicit barrier at that loop, thus permitting the thread(s) finishing the second loop to immediately run to the end parallel. At that point, the threads (core/hw thread)reaching the end parallel are immediately available for use in the other nest levels and/or other branches or sections or taskq, etc...

Even when nesting is not in effect, then depending on your KMP_BLOCKTIME setting, these hardware threads mayimmediately suspend making computational resources available for other apps or other non-OpenMP threads within the same application. Consider your Fortran program running with an OpenGL thread driving the display. Or in my case, using the Array Visualizer to view a 3D visualization of the simulation as it runs. In this situation, I do not want the application to have threads burnning up block time interval in synchronization only to immediatly terminate the thread team.

If you program this right in the beginning, then as you extend your parallization outwards, you won't have to find these small-ish, and unwarranted,overhead sections in your code.

Jim Dempsey