Re: Perils of Barriers in OpenMP

ClayB · ‎01-09-2004

We recently ran into a case of improper OpenMP barrier usage within a hybrid MPI-OpenMP code. Essentially, the problem came down to code something like this:

#pragma omp parallel

myid = omp_get_thread_num();

if (myid %2) {

// do some odd work

#pragma omp barrier

// do more work

}

else {

// do some even work

#pragma omp barrier

// do more work

}

It turns out that this is not legal OpenMP and the results will be undefined (implementation dependent). The OpenMP specification states that barrier regions must be encountered in the same order and by each thread in the team. The above code has two separate barrier regions and, for teams of more than a single thread, not all threads will be able to reach the same barrier region.

Since the incorrect barrier usage was embedded into MPI code, we suspect that the programmers were more familiar with MPI barriers, which act across the processes in a communicator. Thus, an MPI code may contain several different points with calls to MPI_Barrier, but as long as each call has the same communicator and all processes within that communicator eventually make the call, the processes will all check in and be released.

How do you fix the code above? My first advice is to not code this way. Avoid the problem completely.

If you can't get around coding in this fashion, set up a single barrier region that all threads will encounter. For example, if the barrier position can't be moved to a single point where all threads will execute it, something like this should work...

void my_barrier()

{

#pragma omp barrier // single barrier region

}

#pragma omp parallel

myid = omp_get_thread_num();

if (myid %2) {

// do some odd work

my_barrier();

// do more work

}

else {

// do some even work

my_barrier();

// do more work

}

-- clay

jim_dempsey · ‎06-07-2005

I have a related barrier problem in FORTRAN.

What happens in parallel DO loop with an enclosed barrier? i.e. at theexpiration of the loop iterations the remaining threads bypass the barrierandcongrigate at the end parallel do.In this case not all of the threads reach the barrier. This problem cannot be circumvented by calling a barrier routine as with your example.

ClayB · ‎06-24-2005

Jim -

I'm not sure I understand your problem, but let me see if I can describe it and you can tell me if I've got it correct.

Your parallel do-loop contains an explicitbarrier in the middle to synchronize threads. This works fine if you have a number of iterations that is evenly divisible by the number of threads. However, if there are a different number of iterations per thread, threads iwth fewer iterations will hit the explicit barrier while the other thread(s) stop at the explicit barrier. Something like this...

!$OMP PARALLEL DO

DO i = 1, some_prime_number

!$OMP BARRIER

ENDDO

If this is the cae, why do you need a barrier in the middle of the loop?Is there some way to restructure the loop to not require the barrierin the middle? Is the barrier dependent on the scheduling of the loop iterations to threads?

If you've added the barrier to get over some data dependency within the loop, the loop may not be able to be run in parallel or you should look for some other solution (like private variables or rewriting the loop to remove the dependence).

Can you provide some more details on what you've got? This sounds like an interesting problem.

--clay

jim_dempsey · ‎06-24-2005

The DO i = 1, some_prime_number would be an example of how some threads end up at an implicit barrier (end do) while others may end up at an explicit barrier (!$omp barrier).

There are many reasons for the use of a barrier. One such reason is the application may need to take a snapshot of the data processed by all threads but only when the data is not in flux.

Something along the line of:

!$OMP PARALLEL DO
DO i = 1, some_prime_number
call DoWork(i)
!$OMP BARRIER
!$OMP MASTER

call TakeSnapshot

!$OMP END MASTER

!$OMP BARRIER
ENDDO
!$OMP END PARALLEL DO

To correctely handle the barrier problem I need to insert some speghetti code that is aware of the number of threads

!$OMP PARALLEL DO
DO i = 1, some_prime_number
call DoWork(i)

if((some_prime_number - i) .lt. NumberOfThreads) goto 100
!$OMP BARRIER
!$OMP MASTER

call TakeSnapshot

!$OMP END MASTER

!$OMP BARRIER

100 continue
ENDDO
!$OMP END PARALLEL DO

call TakeSnapshot

Although the above would likely work it is not very clean coding. i.e. contra-OpenMP design goals.

A better fix would be to redefine BARRIER from

Wait for all threads to congregate here

to

Wait for all active threads to congregate here

When a thread blocks at and implicit barrier (!$OMP END PARALLEL DO) then it is removed from the active thread count. And then is not counted in the barrier waiting thread count. With this change then the first (non-speghetti code) example would work.

Jim Dempsey

leihuang · ‎06-24-2005

According to OpenMP specification,a barrier region binds to the innermost enclosing parallel region. To my knowledge, in your my_barrier() example, the barrier actually stops all the threads in the parallel region, which is not the intention to use the barriers fora subteam of threads. Is my understanding correct? Thanks.

jim_dempsey · ‎06-24-2005

The design purpose of barrier is to get all active threads in the context of the barrier to synchronize at the barrier.

The nuance is: What constitutes "active threads in the context of the barrier"?

My argument is: As a Parallel Do Loop expires the count of the active threads in the context of the barrier diminishes. And therefor the barrier should block for only those remaining activethreadsin thecontext of the barrier. This is not an unreasonable request.

There is no adverse effect in changing the deffinition of barrier to what I suggest.Ummm... unless your intention of use of barrier is to create a deadlock.

Now for future planning...

Assume at some future date the number of threads available to parallel sections (e.g. parallel do) is dynamically variable. The hypothetical example would be: At the start of say a Parallel DO a request is made to the OS of "I want up to 8 threads" and the OS replies "hereare 7 threads". A subsequent start of the Parallel DO might obtain 4 threads. During the processing of the Parallel DO the OS can issue a call to a callback routine for the purpose of notifying the loop control structure that the OS has an additional number of threads available as well as the OS is requesting the application to relenquish some of it's threads. This is a cooperative system.

Jim Dempsey

ClayB · ‎07-07-2005

jim_dempsey@ameritech.net wrote:

The DO i = 1, some_prime_number would be an example of how some threads end up at an implicit barrier (end do) while others may end up at an explicit barrier (!$omp barrier).

There are many reasons for the use of a barrier. One such reason is the application may need to take a snapshot of the data processed by all threads but only when the data is not in flux.

Jim -

Your description isa condition that I'd not considered, but can easily see being needed. One quibble that I'd have with your solutions, though, is that the Master thread might be one that is given fewer iterations than one ore more of the others. The use of !$OMP MASTER could then lead to deadlock.

Why not use the implicit barrier of the !$OMP SINGLE to solve both your explicit barrier and master thread problems? Something like...

Code:

!$OMP PARALLEL DO
      DO i = 1, some_prime_number
        call DoWork(i)
!! The first SINGLE region acts like an explicit barrier
!$OMP SINGLE
       call DoNothing
!$OMP END SINGLE
!! Once data has quiesced, one thread takes the snapshot
!$OMP SINGLE
       call TakeSnapshot
!$OMP END SINGLE

      ENDDO
!$OMP END PARALLEL DO

With this solution, it won't matter whether or not the master thread participates to the end and it won't matter how many threads are active for a round of iterations (and how many are waiting at the implicit barrier at the end of the parallel region).

--clay

jim_dempsey · ‎07-07-2005

Thanks for pointing out that the master thread might get retired before other threads in the team. Good bug catch.

The SINGLE won't work either because there is an implicit BARRIER at the end of the SINGLE section. If some of the threads have retired (waiting at the END PARALLEL DO) then the application hangs as before.

The SINGLE will work _provided_ no threads in the team are waiting at the end parallel do. To account for this the programmer must bypass the SINGLE section when the remaining loop count is less than the number of threads available to the section. Somewhat uggly coding to handle something that could be neatly tucked into the deffinition of BARRIER (and into SINGLE).

Jim Dempsey

ClayB · ‎07-11-2005

Jim -

You're right. This thought crossed my mind a few hours after I posted and I should have tried the code before posting this.

Now that I have had the chance to try my solution, it appears to be worse than we both thought. My Intel Fortran compiler will not even allow a SINGLE or MASTER region to be part of a work-sharing construct. The error message says that such use is "invalid" though I don't see anything forbidding it in either 2.0 spec. [There is a sentence that states all work-sharing and BARRIER directives must be encountered by all threads (or none at all) and in the same order in the specs. Not exactly the situation I've set up, but could be interpreted to apply.] Thus, it would appear that neither of our solutions will work, at least with Intel compilers. Does the compiler you've been using accept a SINGLE or MASTER directive within a work-sharing construct?

--clay