Problem offloading loops in an OpenMP region

mguy44 · ‎11-26-2014

Hello,

Here is an example that contains some features of a larger code I want to run on MIC in offload mode. I'm begining with offload mode.
It doesn't work and I do not understand why.
That is an OpenMP region with several loops that I want to offload on the MIC, some of the data should be kept on the MIC from one to
the other offloaded loop. Around them, there are some other instructions that stay on the host.
Here it is.

      PROGRAM BASIC_OMP

!$    USE OMP_LIB
      IMPLICIT NONE
      INTEGER,  PARAMETER  ::  N = 512

!dir$ attributes offload:mic:: U, V
      REAL(8),  DIMENSION(:), ALLOCATABLE  ::  U, V
!dir$ attributes align:64 :: U, V

      REAL(8)  ::  LAMBDA = 0.0_8
      REAL(8)  ::  scal = 5.0_8 / 12.0_8
!
      INTEGER   ::  NBTHRDS, MYTHRD, I, IERR=0
!dir$ attributes offload:mic:: LAMBDA, scal, NBTHRDS, MYTHRD, I, IERR, N



#define ALLOC alloc_if(.true.)
#define FREE free_if(.true.)
#define RETAIN free_if(.false.)
#define REUSE alloc_if(.false.)

      write(6,'(A)') 'Starting Compute'

!$OMP PARALLEL DEFAULT (NONE) &
!$OMP SHARED (U, V) &
!$OMP SHARED (scal, IERR, LAMBDA) &
!$OMP PRIVATE(NBTHRDS, MYTHRD, I)
!
!$OMP MASTER
      write (6,'(A)') 'on CPU, step 1'
!$OMP END MASTER
!
!dir$ offload begin target(mic:0) in(U:length(N) ALLOC RETAIN) in(V:length(N) ALLOC RETAIN) in(scal)

!$OMP MASTER
      write (6,'(A)') 'on MIC, step 2'
!$OMP END MASTER
      NBTHRDS = 1
      MYTHRD  = 0
!$    NBTHRDS = OMP_GET_NUM_THREADS ()
!$    MYTHRD  = OMP_GET_THREAD_NUM  ()
!$OMP MASTER
!$    WRITE (6,'(A,L3)') 'OMP_IN_PARALLEL : ', OMP_IN_PARALLEL ()
      WRITE (6,'(2(A,I5))') 'NBTHRDS = ', NBTHRDS, ' MYTHRD = ', MYTHRD
!$OMP END MASTER

!$OMP MASTER
      ALLOCATE (U(N), V(N), STAT = IERR)
      IF (IERR /= 0) THEN
         WRITE (6,'(A)') ' Allocation Problem U, V'
         WRITE (6,'(A,I8)') 'IERR = ', IERR
         STOP
      END IF
!$OMP END MASTER
!$OMP BARRIER
!$OMP DO PRIVATE (I) SCHEDULE(STATIC)
      DO I = 1, N
         U (I) = scal
         V (I) = 0.2_8 + REAL (I,8)
      END DO
!$OMP END DO 

!$OMP MASTER
      write (6,'(A)') 'on MIC, end step 2'
!$OMP END MASTER


!dir$ end offload

!$OMP MASTER
      write (6,'(A)') 'on CPU, step 3, between offloads'
!$OMP END MASTER

      call flush (6)
!$OMP BARRIER

!dir$ offload begin target(mic:0) in(U:length(n) REUSE FREE) in(V:length(n) REUSE FREE) inout(LAMBDA)

!$OMP MASTER
      write (6,'(A)') 'on MIC, step 4'
      write (6,*) 'L  ', LAMBDA
      write (6,*) 'U  ', U(1)
      write (6,*) 'V  ', V(1)
!$OMP END MASTER

!$OMP DO REDUCTION (+:LAMBDA) PRIVATE (I) SCHEDULE(STATIC)
      DO I = 1, N
         LAMBDA = LAMBDA + U (I) * V (I)
      END DO
!$OMP END DO

!$OMP MASTER
      write (6,'(A)') 'on MIC, end step 4'
      call flush (6)
!$OMP END MASTER

!$OMP BARRIER

!dir$ end offload

!$OMP MASTER
      write (6,'(A)') 'on CPU, step 5'
      write (6,'(A,E22.15)') 'LAMBDA = ', LAMBDA
!$OMP END MASTER

!$OMP END PARALLEL

      STOP
      END PROGRAM BASIC_OMP

Compilation and running on a Xeon based host with 2 MIC 5110P (I use only one).
ifort -O3 -openmp -g -traceback basic_omp.F90 -o basic_offload.out

On the host, I set the following env. variables :
export OMP_NUM_THREADS=2
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=118

The execution gives me :

./basic_offload.out
Starting Compute
on CPU, step 1
on MIC, step 2
OMP_IN_PARALLEL :   F
NBTHRDS =     1 MYTHRD =     0
on MIC, end step 2
on MIC, step 2
OMP_IN_PARALLEL :   F
NBTHRDS =     1 MYTHRD =     0
 Allocation Problem U, V
IERR =      151
on CPU, step 3, between offloads
offload error: process on the device 0 unexpectedly exited with code 0

What I do not understand :
- the code is compiled as an OpenMP parallelized application. Inside the OpenMP region, OpenMP tells me it is not in parallel

- therefore, the number of threads is false, everything is printed twice (2 threads on the host), allocation has to be done by the master
thread only, but the error tells me the opposite.

- why does it not behave like I want it to ?

N.B. : With the -no-offload option, everything's fine on the host.

Thank you in advance for your comments.

Regards,

Guy.

James_C_Intel2 · ‎11-26-2014

What I do not understand :
- the code is compiled as an OpenMP parallelized application. Inside the OpenMP region, OpenMP tells me it is not in parallel

- therefore, the number of threads is false, everything is printed twice (2 threads on the host), allocation has to be done by the master
thread only, but the error tells me the opposite.

- why does it not behave like I want it to ?

I don't think you want any of the parallelism on the host side. Everything on the host is inside master sections!

On the MIC side you are on a different device, and there is no "omp parallel" executed on that device, so you only have a single thread and are not in parallel (as OpenMP tells you :-)), so I think you want code more like this (***UNTESTED***) version.

     PROGRAM BASIC_OMP
 
!$    USE OMP_LIB
      IMPLICIT NONE
      INTEGER,  PARAMETER  ::  N = 512
 
!dir$ attributes offload:mic:: U, V
      REAL(8),  DIMENSION(:), ALLOCATABLE  ::  U, V
!dir$ attributes align:64 :: U, V
 
      REAL(8)  ::  LAMBDA = 0.0_8
      REAL(8)  ::  scal = 5.0_8 / 12.0_8
!
      INTEGER   ::  NBTHRDS, MYTHRD, I, IERR=0
!dir$ attributes offload:mic:: LAMBDA, scal, NBTHRDS, MYTHRD, I, IERR, N

#define ALLOC alloc_if(.true.)
#define FREE free_if(.true.)
#define RETAIN free_if(.false.)
#define REUSE alloc_if(.false.)
 
      write(6,'(A)') 'Starting Compute'
      write (6,'(A)') 'on CPU, step 1'
!
!dir$ offload begin target(mic:0) in(U:length(N) ALLOC RETAIN) in(V:length(N) ALLOC RETAIN) in(scal)
 
!$OMP PARALLEL DEFAULT (NONE) &
!$OMP SHARED (U, V) &
!$OMP SHARED (scal, IERR, LAMBDA) &
!$OMP PRIVATE(NBTHRDS, MYTHRD, I)
!
!$OMP MASTER
      write (6,'(A)') 'on MIC, step 2'
!$OMP END MASTER
      NBTHRDS = 1
      MYTHRD  = 0
!$    NBTHRDS = OMP_GET_NUM_THREADS ()
!$    MYTHRD  = OMP_GET_THREAD_NUM  ()
!$OMP MASTER
!$    WRITE (6,'(A,L3)') 'OMP_IN_PARALLEL : ', OMP_IN_PARALLEL ()
      WRITE (6,'(2(A,I5))') 'NBTHRDS = ', NBTHRDS, ' MYTHRD = ', MYTHRD
!$OMP END MASTER
 
!$OMP MASTER
      ALLOCATE (U(N), V(N), STAT = IERR)
      IF (IERR /= 0) THEN
         WRITE (6,'(A)') ' Allocation Problem U, V'
         WRITE (6,'(A,I8)') 'IERR = ', IERR
         STOP
      END IF
!$OMP END MASTER
!$OMP BARRIER
!$OMP DO PRIVATE (I) SCHEDULE(STATIC)
      DO I = 1, N
         U (I) = scal
         V (I) = 0.2_8 + REAL (I,8)
      END DO
!$OMP END DO
 
!$OMP MASTER
      write (6,'(A)') 'on MIC, end step 2'
!$OMP END MASTER
!$OMP END PARALLEL 
 
!dir$ end offload
 
      write (6,'(A)') 'on CPU, step 3, between offloads'
      call flush (6)
 
!dir$ offload begin target(mic:0) in(U:length(n) REUSE FREE) in(V:length(n) REUSE FREE) inout(LAMBDA)

!$OMP PARALLEL DEFAULT (NONE) &
!$OMP SHARED (U, V) &
!$OMP SHARED (scal, IERR, LAMBDA) &
!$OMP PRIVATE(NBTHRDS, MYTHRD, I)
 
!$OMP MASTER
      write (6,'(A)') 'on MIC, step 4'
      write (6,*) 'L  ', LAMBDA
      write (6,*) 'U  ', U(1)
      write (6,*) 'V  ', V(1)
!$OMP END MASTER
 
!$OMP DO REDUCTION (+:LAMBDA) PRIVATE (I) SCHEDULE(STATIC)
      DO I = 1, N
         LAMBDA = LAMBDA + U (I) * V (I)
      END DO
!$OMP END DO
 
!$OMP MASTER
      write (6,'(A)') 'on MIC, end step 4'
      call flush (6)
!$OMP END MASTER
 
!$OMP END PARALLEL
 
!dir$ end offload
 
      write (6,'(A)') 'on CPU, step 5'
      write (6,'(A,E22.15)') 'LAMBDA = ', LAMBDA
 
      STOP
      END PROGRAM BASIC_OMP

I cannot comment on the data transfer aspects of the code, they may still be wrong, but I hope this gets you a little further.

mguy44 · ‎11-26-2014

Thank you for your message.

OK I understand why there's only one OpenMP thread on the MIC. It seems more complicated that I thought.

There is parallelism on the host side but I replace it by a simple thing.

In fact, I try to get a very simple example of a larger code (hybrid MPI/OpenMP fortran CFD application) : to learn how to do and next try to implement this in that one.

I do not want to go too deep in the explanations about the real application behind this but it may help. So in that larger one, for each MPI process there's a single OpenMP region that contains the loop in time with, inside it, all the computations and others things. Moreover, if I run in native mode, initializations and other things take too much memory on the MIC. In the loop in time, at each time step, on one side there are calls to "cpu kernels" (and only them) that I want to offload and on other side there are calls to less computational routines that do something else (MPI communications, statistics, I/O ...) that do not scale enough for running efficiently on a MIC. But all of them are parallelized with OpenMP, very few of them run as a single thread. So I do not wish to split my OpenMP region in a lot of smaller ones at each time step.

In my example the two loops stand for the kernels, and the "write" stand for the other routines. Instead of the "write", one can put an other OpenMP loop there, but not offloaded, it's not a problem.

Maybe it is not possible to offload some parts of an OpenMP region and having an OpenMP threads team (no link between them ?) on both sides : host and MIC ?

James_C_Intel2 · ‎11-27-2014

In your current code (ignoring offload for now), is there only really one thread inside the OpenMP parallel region that is doing the serious computation, while the other threads are doing asynchronous system work (maybe transferring halo regions or whatever)? That seems to be what you're saying...

If so, then you can keep all of the current code, and simply do the offload from the one thread that was doing the computation, then go parallel on the Xeon Phi card. The way you had your original code, every single OpenMP thread on the master side is trying to do an offload (which is not a good idea!), whereas you seem to be saying that's not what you want anyway!

Something like this (pardon my Fortran, I used to write FORTRAN IV and Fortran 77, but haven't written any for 20 years or so :-))

!$OMP PARALLEL, PRIVATE(THREADID,TIMESTEP), NUMTHREADS(3)
THREADID = OMP_GET_THREAD_ID()

DO TIMESTEP=1,TEND
IF (THREADID .EQ. 0) THEN
    ! something or other
ELIF (THREADID .EQ. 1) THEN
    ! something or other else 
ELIF
    ! The real computation
!$DIR OFFLOAD BEGIN TARGET(mic:0) ...etc...
!$OMP PARALLEL
    ! Parallel computation on the MIC card
!$OMP END PARALLEL
!$DIR OFFLOAD END
ENDIF

!$OMP BARRIER
END DO
!$OMP END PARALLEL

mguy44 · ‎11-27-2014

Hello,

I understand your example , it's very clear.

Unfortunately, in the OpenMP region, all the work is shared between threads through OpenMP parallel constructs. In general they are OMP DO constructs for nested loops. Nowhere, there's a thread that is working alone with sufficient stuff to do. I'd like to offload whole parallel constructs of this kind on the MIC. I can put them inside a pair of MASTER or SINGLE directives, but what's about the overhead of creating lots of parallel region on the MIC ?

On my own I try to modify my example above with nested parallelism but I lose allocated arrays on the MIC between the two offload sections :

      PROGRAM NESTED_OMP

!$    USE OMP_LIB
      IMPLICIT NONE
      INTEGER,  PARAMETER  ::  N = 512

!DIR$ ATTRIBUTES OFFLOAD:mic:: U, V
      REAL(8),  DIMENSION(:), ALLOCATABLE  ::  U, V
!DIR$ ATTRIBUTES ALIGN:64  :: U, V
!
      INTEGER   ::  NBTHRDS, MYTHRD, IERR=0
!DIR$ ATTRIBUTES OFFLOAD:mic:: NBTHRDS, MYTHRD, IERR, N


#define ALLOC alloc_if(.true.)
#define FREE free_if(.true.)
#define RETAIN free_if(.false.)
#define REUSE alloc_if(.false.)

      WRITE (6,'(A)') 'Starting Computation'
! =================================================
! on the HOST beginning of the main parallel region
! =================================================
!$OMP PARALLEL DEFAULT (NONE) &
!$OMP SHARED (U, V) &
!$OMP PRIVATE(NBTHRDS, MYTHRD, IERR)
!
      NBTHRDS = 1
      MYTHRD  = 0
!$OMP MASTER
      WRITE (6,'(A)') 'on CPU, step 1'
!$    NBTHRDS = OMP_GET_NUM_THREADS ()
!$    MYTHRD  = OMP_GET_THREAD_NUM  ()
!$    WRITE (6,'(A,L3)') 'OMP_IN_PARALLEL : ', OMP_IN_PARALLEL ()
      WRITE (6,'(2(A,I5))') 'NBTHRDS = ', NBTHRDS, ' MYTHRD = ', MYTHRD
!$OMP END MASTER
!$OMP BARRIER
!
! ==========================================
! On the MIC, nested parallelism for offload
! ==========================================
!$OMP MASTER
!DIR$ OMP OFFLOAD target(mic:0) in(U:length(N) ALLOC RETAIN) in(V:length(N) ALLOC RETAIN)
!$OMP PARALLEL DEFAULT (NONE) &
!$OMP SHARED (U, V) &
!$OMP PRIVATE(IERR, NBTHRDS, MYTHRD)

!$OMP MASTER
      WRITE (6,'(A)') 'on MIC, beginning step 2'
!$OMP END MASTER
      NBTHRDS = 1
      MYTHRD  = 0
!$    NBTHRDS = OMP_GET_NUM_THREADS ()
!$    MYTHRD  = OMP_GET_THREAD_NUM  ()
!$OMP MASTER
!$    WRITE (6,'(A,L3)') 'OMP_IN_PARALLEL : ', OMP_IN_PARALLEL ()
      WRITE (6,'(2(A,I5))') 'NBTHRDS = ', NBTHRDS, ' MYTHRD = ', MYTHRD
!$OMP END MASTER

!$OMP MASTER
      ALLOCATE (U(N), V(N), STAT = IERR)
      IF (IERR /= 0) THEN
         WRITE (6,'(A,I8)') 'Allocation Problem U, V, IERR =', IERR
         STOP
      END IF
!$OMP END MASTER
!$OMP BARRIER
!$OMP MASTER
      WRITE (6,'(A)') 'on MIC, still step 2'
      WRITE (6,*) 'U  ', ALLOCATED (U)
      WRITE (6,*) 'V  ', ALLOCATED (V)
      WRITE (6,'(A)') 'on MIC, end step 2'
!$OMP END MASTER

! =====================================
! on the MIC, end of nested parallelism
! =====================================
!$OMP END PARALLEL
!$OMP END MASTER

! ================
! back to the HOST
! ================

!$OMP MASTER
      WRITE (6,'(A)') 'on CPU, step 3, between offloads'
      WRITE (6,*) 'U  ', ALLOCATED (U)
      WRITE (6,*) 'V  ', ALLOCATED (V)
      WRITE (6,'(A)') 'on CPU, end step 3'
!$OMP END MASTER

!$OMP BARRIER

! =======================================================
! Second nested parallelism on the MIC for second offload
! =======================================================
!$OMP MASTER
!DIR$ OMP OFFLOAD target(mic:0) in(U:length(n) REUSE FREE) in(V:length(n) REUSE FREE)
!$OMP PARALLEL DEFAULT (NONE) &
!$OMP SHARED (U, V) &
!$OMP PRIVATE(NBTHRDS, MYTHRD, IERR)

!$OMP MASTER
      WRITE (6,'(A)') 'on MIC, step 4'
      WRITE (6,*) 'U  ', ALLOCATED (U)
      WRITE (6,*) 'V  ', ALLOCATED (V)
!$OMP END MASTER

!$OMP MASTER
      DEALLOCATE (U,V,STAT=IERR)
      IF (IERR /= 0) THEN
         WRITE (6,'(A,I8)') 'DeAllocation Problem U, V, IERR =', IERR
         STOP
      END IF
      WRITE (6,'(A)') 'on MIC, end step 4'
      CALL FLUSH (6)
!$OMP END MASTER

!$OMP END PARALLEL
! ===========================================
! on the MIC end of second nested parallelism
! ===========================================
!$OMP END MASTER

! =====================
! on the HOST, for good
! =====================

!$OMP MASTER
      WRITE (6,'(A)') 'on CPU, step 5'
!$OMP END MASTER

!$OMP END PARALLEL
! =======================================
! on the HOST end of main parallel region
! =======================================

      STOP
      END PROGRAM NESTED_OMP

then I compile :

ifort -O3 -openmp nested_omp.F90 -o nested_offload.out

Here is what I got at runtime :

export OMP_NUM_THREADS=4
export MIC_ENV_PREFIX=MIC
export MIC_OMP_NUM_THREADS=118
./nested_offload.out 
Starting Computation
on CPU, step 1
OMP_IN_PARALLEL :   T
NBTHRDS =     4 MYTHRD =     0
on MIC, beginning step 2
OMP_IN_PARALLEL :   T
NBTHRDS =   118 MYTHRD =     0
on MIC, still step 2
 U   T
 V   T
on MIC, end step 2
on CPU, step 3, between offloads
 U   F
 V   F
on CPU, end step 3
on MIC, step 4
 U   F
 V   F
DeAllocation Problem U, V, IERR =     153
offload error: process on the device 0 unexpectedly exited with code 0

So in step 2, U and V are allocated, I use the RETAIN word to try to keep them till the next offload zone, but they are lost in step 4, although they are SHARED everywhere. Maybe that doesn't matter/make sense here ?

James_C_Intel2 · ‎11-27-2014

That looks fairly plausible from a control flow and parallelism point of view. In effect you're doing something like this

!$OMP PARALLEL

! A parallel loop on the master
!$OMP DO
do ...
end do

! Now a loop on the KNC
! We would need another barrier here if we weren't immediately after the implicit
! barrier at the end of the previous omp do loop.
!$OMP MASTER
!$DIR OFFLOAD BEGIN TARGET(mic:0) ...

!$OMP PARALLEL DO
do ...
...
end do
!$DIR OFFLOAD END
!$OMP END MASTER
!$OMP BARRIER

So the critical point is that you ensure that all the threads on the host reach the point where you're going to execute the master statement to offload before you execute that, and that they all wait for the offload to complete.

The cost of creating a parallel region on the Phi is not trivial, but not exceptionally huge once you have done it once. (Creating all the underlying pthreads is expensive, but they're not killed and re-created between parallel regions).

I'm afraid I can't help with the offload data movement, which is a deep art that I am not qualified in.

jimdempseyatthecove · ‎11-27-2014

Perhaps the following sketch may be useful:

On the host use the OMP TASK construct to create two tasks: HostTask, MicTask. The MicTask performs the offloads.

The HostTask contains the !$OMP PARALLEL... constructions for host processing
Inside the offload of the MicTask the code contains the !$OMP PARALLEL... constructions for MIC processing

The host side of the MicTask can use the offload signal and wait such that the MicTask (host side) can manage any synchronization (asynchronously) with the HostTask. (essentially performing an enqueue/dequeue of tasks to be performed on the MIC).

While this will complicate the means of tasking between host and MIC, it will eliminate any overhead of entering and exiting outer parallel regions.

Jim Dempsey

jimdempseyatthecove · ‎11-27-2014

Additional comment.

I am not using the latest compiler. What I have permits the asynchronous offload (signal/wait) permitting the host to continue running while MIC side is running, but it does not permit a second offload (data transfer only) to occur while the pending asynchronous offload is running (same MIC). Meaning, the version of compiler and MPSS I have won't permit "double buffering". This may have changed, dig deep into the latest documentation.

Even without this capability, if it really was necessary. Meaning you had a lot of dead time of each end taking turns waiting for the other end and your analysis indicates you could recoup a lot of dead time if only you could double buffer. You could potentially use an inter-process messaging system, like MPI to concurrently transport data amongst running processes, each respectively remaining inside the parallel region.

Keep in mind that MPI can be used for simple messaging as opposed to using coarray and ranks.

Jim Dempsey

mguy44 · ‎11-27-2014

Hello Jim,

Thank you for your messages.

I do not know OpenMP tasks, I have to learn about using them, especially when you talk about performing an enqueue/dequeue of tasks to be performed on the MIC. For the moment, it sounds complicated to me.

mguy44 · ‎11-27-2014

James Cownie (Intel) wrote:

That looks fairly plausible from a control flow and parallelism point of view. In effect you're doing something like this
!$OMP PARALLEL

! A parallel loop on the master
!$OMP DO
do ...
end do

! Now a loop on the KNC
! We would need another barrier here if we weren't immediately after the implicit
! barrier at the end of the previous omp do loop.
!$OMP MASTER
!$DIR OFFLOAD BEGIN TARGET(mic:0) ...

!$OMP PARALLEL DO
do ...
...
end do
!$DIR OFFLOAD END
!$OMP END MASTER
!$OMP BARRIER
So the critical point is that you ensure that all the threads on the host reach the point where you're going to execute the master statement to offload before you execute that, and that they all wait for the offload to complete.

The cost of creating a parallel region on the Phi is not trivial, but not exceptionally huge once you have done it once. (Creating all the underlying pthreads is expensive, but they're not killed and re-created between parallel regions).

I'm afraid I can't help with the offload data movement, which is a deep art that I am not qualified in.

Synchronizing all OpenMP threads before and after the offload section is not a problem. I think that putting an OpenMP BARRIER should solve the problem. I modify my example with your suggestion. The only remaining problem is the loss of the dynamicaly allocated arrays U and V between the two offload sections. I'm working on it.

jimdempseyatthecove · ‎11-27-2014

The arrays can still be dynamically allocatable. You would have to measure the cost of having the host side MicTask performing signal/wait (plus having MIC side enter/exit parallel region) verses using MPI or other messaging system (MPI is build for low latency/hig bandwidth transfers). Your U and V array descriptors may have to be static in both domains (they already are). For the double buffering you might require two sets:

!DIR$ ATTRIBUTES OFFLOAD:mic:: U0, V0, U1, V1
      REAL(8),  DIMENSION(:), ALLOCATABLE  ::  U0, V0, U1, V1
!DIR$ ATTRIBUTES ALIGN:64  :: U0, V0, U1, V2
!DIR$ ATTRIBUTES OFFLOAD:mic:: MicTaskMic
subroutine MicTaskMic
      INTEGER :: Buffer
      Buffer = -1 ! not yet received 1st buffer
      do while(Buffer .ne. -2) ! not done
         Buffer = YourMPIreader()
         if(Buffer .ge. 0) then
           !$omp taskwait ! 1st iteration, no wait, subsequent interations possible wait
           if(iand(Buffer, 1) .eq. 0) then
              !$omp task
                call YourProcessBuffer(U0,V0)
             !$omp end task
           else
                !$omp task
                call YourProcessBuffer(U1,V1)
             !$omp end task
           endif
         endif
      end do
!$omp parallel
!$omp master
!$omp end master
!$omp end parallel
end subroutine MicTaskMic
...

!$omp parallel
!$omp master
!$omp task
call MicTask
!$omp end task
call HostTask
!$omp taskwait
!$omp end master
!$omp end parallel
 
subroutine HostTask
! what you formerly viewed as host code outside of first level parallel region
...
end subroutine HostTask

subroutine MicTask
!DIR$ OFFLOAD BEGIN TARGET(mic:0)...
call MicTaskMic
...

The YourMPIreader can block for message received, and it alternates receiving into the 0'th the 1'th buffers.

The above will need a lot of work.

>>I think that putting an OpenMP BARRIER should solve the problem

Not necessarily.

OpenMP barrier on Xeon Phi with (up to) 244 threads can get rather expensive. On an example program (not synthetic benchmark) ~25% of the time was waiting at a barrier. This was with all threads presumably having the same work (at least by counting flops they did). Actual runtime per thread per parallel region, assuming same work, will vary greatly due to cache hit/miss differences between threads.

Jim Dempsey

mguy44 · ‎12-04-2014

Hello,

I'm working on your suggestions and I continue my tests. I have a question about one small code derived from above. Sometimes it executes fine, sometimes I get an error message :

COI sink startup: unknown exception in initializer
terminate called after throwing an instance of 'COIRESULT'
offload error: process on the device 0 was terminated by signal 6 (SIGABRT)

This message seems to appear at the beginning of the first offloaded region. I try running the code through idb but it doesn't see anything even when I get the message above ("threads exited normally").

Could someone give me some explanations or a pointer to a documentation about this message?