offload inside parallel region: problem with private allocatable

Domenico_B_ · ‎06-11-2015

hi,

I'm trying to offload some computation inside a parallel OMP region. I have problems with a PRIVATE allocatable array. I paste here simple code that shows the problem. The first time the offloaded code works as it should. The second time, the ALLOCATABLE variable p4 is not updated on the MIC.

The output is:

p1:           1
p2:           2
p3:           1           3
p3:           2           4
p3:           3           5
p3:           4           6
p3:           5           7
p3:           6           8
p3:           7           9
p3:           8          10
p3:           9          11
p3:          10          12
p4:           1          13
p4:           2          14
p4:           3          15
p4:           4          16
p4:           5          17
p4:           6          18
p4:           7          19
p4:           8          20
p4:           9          21
p4:          10          22

----------------

p1:           1
p2:          23
p3:           1          24
p3:           2          25
p3:           3          26
p3:           4          27
p3:           5          28
p3:           6          29
p3:           7          30
p3:           8          31
p3:           9          32
p3:          10          33
p4:           1          13
p4:           2          14
p4:           3          15
p4:           4          16
p4:           5          17
p4:           6          18
p4:           7          19
p4:           8          20
p4:           9          21
p4:          10          22

I compile with:

ifort -g -O0 -warn all -qoffload=mandatory -qopenmp prova.f

I checked that the offloads actually take place. There is no problem if the p4 variable is SHARED. I have tride also all combinations of IN clauses, and ATTRIBUTES OFFLOAD declarations.

Thank you very much.

      implicit none
      integer, parameter :: p1=1
      integer :: p2
      integer :: p3(10)
      integer, allocatable :: p4(:)
      integer :: i

      p2=2

      do i=1,10
        p3(i)=2+i
      enddo

      allocate(p4(10))

      do i=1,10
        p4(i)=12+i
      enddo

!DIR$ OFFLOAD BEGIN target(mic:0)
      print *,'p1:',p1
      print *,'p2:',p2
      do i=1,10
        print *,'p3:',i,p3(i)
      enddo
      do i=1,10
        print *,'p4:',i,p4(i)
      enddo
      print *,''
      print *,'----------------'
      print *,''
!DIR$ END OFFLOAD

!$OMP PARALLEL DEFAULT(none) NUM_THREADS(1)
!$OMP1 PRIVATE(i,p2,p3,p4)

      p2=23

      do i=1,10
        p3(i)=23+i
      enddo

      do i=1,10
        p4(i)=33+i
      enddo

!DIR$ OFFLOAD BEGIN target(mic:0)
      print *,'p1:',p1
      print *,'p2:',p2
      do i=1,10
        print *,'p3:',i,p3(i)
      enddo
     do i=1,10
        print *,'p4:',i,p4(i)
      enddo
      print *,''
      print *,'----------------'
      print *,''
!DIR$ END OFFLOAD

!$OMP END PARALLEL

      stop
      end

Frances_R_Intel · ‎06-13-2015

What you are seeing is the difference between how stack and heap data are handled.

By default, data stored on the stack are not retained on the coprocessor between offload sections. So p1, p2 and p3 are created on the coprocessor when the first offload section starts and destroyed when that offload section terminates. For the second offload section, brand new versions of p1, p2 and p3 are created on the coprocessor. It doesn't matter that in the second offload section, p1, p2 and p3 have different addresses because of their private status. They are all newly created on the coprocessor anyway.

By default, data stored in heap are retained between offload sections. So, when you enter the second offload section, there is already a p4 on the coprocessor. Unfortunately that p4 on the coprocessor maps to the original p4, not to the private copy of p4. So, when you print out p4 on the coprocessor, you are seeing the original values. When you use global, instead of private variables, the p4 array in the parallel section has the same address as it does outside the section and therefor maps to the p4 array on the coprocessor.

That said, I don't think you really want to write the code the way you have it. You do not want to have multiple offload sections in separate threads all trying to hit the same coprocessor. What you would want to do is offload a section of code to the coprocessor and then start up your OMP threads there.

Domenico_B_ · ‎06-23-2015

Thank you very much. That helps.

Now, just to be sure I understand correctly: allocatable arrays go on the heap and static variables on the stack?

Your suggestion about the coding strategy raises the following questions:

Suppose I have a server with 24 cores and 2 MICs. I can arrange the code to have a OMP calculation on the host and a concurrent (to some extent) OMP calculation on the MIC. Now my questions:

1. Shall I offload to MIC from outside the OMP region on the host?

2. In order to exploit both MIC cards, I would like to have two equivalent 'jobs' running on 12 cores + 1 MIC and then average their results (this is a Monte Carlo code, so averaging over multiple realizations is good !!). How can I possibly do that? Shall I run two completely separate jobs? Shall I use some MPI-OMP combination? Any suggestion would be greatly appreciated.

jimdempseyatthecove · ‎06-23-2015

The problem you have to resolve with code on your part is thread team coordination. Any thread on the host (inside or outside) a parallel region when entering an offload region, is ostensibly entering the parallel region under reentrancy rules. When on the host side this is made from outside the parallel region, the degree of reentrancy is 1 (not reentrant at start). If entered from within a parallel region on the host, then the degree of reentrancy his higher, potentially all the thread from within the parallel region on the host. Furthermore, in order for the per concurrent thread offload to instantiate an omp thread team, may require nested parallelism to be true in the MIC (though this may be true by default).

Now the problem you have to control is you do not want 12 or 24 threads on the host concurrently (each) in an offload inside the MIC, each using 240 threads (2880 or 5760 threads). There are ways to manipulate OpenMP to get what you want, but it is not easy (until you see how it is done). There is a thread on the MIC forum where a person wanted to make concurrent offloads containing MKL parallel computations. Note, the parallel MKL library internally uses OpenMP so this is an offload performed within a parallel region on the host containing OpenMP parallel region(s) inside the offload. IOW essentially what you want to so. It would be worth you while to see how he resolve this (with a little help from us on the forum).

For Monte Carlo, I suggest you consider using a double buffering technique. Perform the first offload to get the first batch of results, then prior to processing the batch, launch a next asynchronous offload with the signal clause while processing the prior batch, then wait for the signal then loop back to launch the next asynchronous offload.

If you find that your host threads are mostly waiting, then consider using the omp task feature to launch some producers.

Jim Dempsey

Domenico_B_ · ‎06-24-2015

dear Jim,

thank you. Could you please point me to the right thread on the forum you are referring to?

I found this one:

https://software.intel.com/en-us/forums/topic/518039

but it does not seem to be the right one.

Also, can you suggest where to get additional info on "degrees of reentrancy"?

jimdempseyatthecove · ‎06-24-2015

I tried locating the forum message but was unable to get a hit on my search terms.

Jim Dempsey