help with asynchronous calculation

conor_p_ · ‎06-26-2015

Hello, I would like to run an asynchronous calculation, but am having a hard time understanding with the intel user and reference guide are saying regarding this. I have code that looks like the following.

    signal_value  = 1
    !dir$ offload  target(mic:0) signal(signal_value),&
    !dir$ in(position: alloc_if(.false.),free_if(.false.)),&
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&
    !dir$ nocopy(nlist: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(numneigh: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))
    !---asynchronous computation subroutine on MIC
    call lj_cut_coul_dsf_nonewton(step)

    !---asynchronous computation on CPU host (next 4 subroutine calls)
    call bond_harmonic(step)
    call angle_harmonic(step)
    call dihedral_opls(step)
    call improper_harmonic(step)


    !dir$ offload_wait target(mic:0) WAIT (signal_value)

Now in this code, lj_cut_dsf_nonewton consists of a block of code inside openmp directives that I would like to run asynchronously on the xeon phi coprocessor. No code in this subroutine is not ran on the xeon phi coprocessor, and all the offload directives for the necessary arrays occurred above in the offload statement.

 !dir$ attributes offload:mic :: lj_cut_coul_dsf_nonewton
  subroutine lj_cut_coul_dsf_nonewton(step)

  !$omp parallel do default(firstprivate),&
  !$omp& shared(position,ff,nlist,numneigh,q,lj1,lj2,lj3,lj4)
       calculate non bonded forces for molecular dynamics on MIC
  !$omp end parallel do
  
   end subroutine

As shown in the comment, I want bond_harmonic, angle_harmonic, dihedral_opls, and improper_harmonic all to be ran on the host CPU asynchronously. However, when I compile the code, I get errors saying that global variables inside bond_harmonic, angle_harmonic, dihedral_opls, and improper_harmonic need to be declared with an offload target attribute.This makes me think that I am not understanding what the code is doing properly. I should not have to declare, and most importantly allocate memory, for these arrays/variables since they are never going to be on the coprocessor, and are supposed to be only being used asynchronously on the CPU. Could someone tell me if my understanding is correct, or where I am going wrong before I go about changing my code?

conor_p_ · ‎06-26-2015

Also, I have been staring at this for a while and apologize if this is something dumb, but I am getting the error

"a global variable within a procedue with the offload:target attribute must have the offload:target attribute [type]"

"a global variable within a procedue with the offload:target attribute must have the offload:target attribute [size]"

Now, I will post the code lj_cut_coul_dsf_nonewton, which is the only procedure with an offload attribute, where I have no variables called type or size. What could be generating this error?

 !dir$ attributes offload:mic :: lj_cut_coul_dsf_nonewton
  subroutine lj_cut_coul_dsf_nonewton(step)
    implicit none
    real*4 :: force,forcelj,forcecoul
    real*4 :: x1,y1,z1,x2,y2,z2
    real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i,dri
    double precision :: ffx,ffy,ffz
    real*4 :: qtmp,r,prefactor,erfcc,erfcd,t
    real*4 :: boxdx,boxdy,boxdz
    integer :: i,j,l,step
    integer :: itype,jtype,neigh
    integer :: tid,num
    integer :: offset,ioffset,neigh_off
    integer :: T1,T2,clock_rate,clock_max
    

    !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(firstprivate),&
    !$omp& shared(position,ff,nlist,numneigh,q,lj1,lj2,lj3,lj4)
    do i = 1 ,np
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type
       qtmp = q(i)
       ioffset = (itype-1)*numAtomType
       neigh_off = neigh_alloc*(i-1)
       num = numneigh(i)
     
       ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0   

       !dir$ vector aligned
       !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz)
       do j= 1,num
          
          neigh = nlist(neigh_off+j)
         
          dx = x1-position(neigh)%x
          dy = y1-position(neigh)%y
          dz = z1-position(neigh)%z
          jtype = position(neigh)%type
         
          boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
          boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
          boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
          boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)

          dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz
          dr2 = dx*dx + dy*dy + dz*dz

          !---lennard jones interactions
          dr2i = 1.0d0/dr2
          dr6i = dr2i*dr2i*dr2i
          if(dr2.gt.rcut2)dr6i=0.0d0
          offset = ioffset + jtype
          forcelj = dr6i*(lj1(offset)*dr6i-lj2(offset))
          potential = potential + dr6i*(dr6i*lj3(offset)-lj4(offset))      


          !---electrostatic calculations     
          r = sqrt(dr2)
          dri = 1.0d0/r
          prefactor =  qtmp*q(neigh)*dri
          if(dr2.gt.cut_coulsq)prefactor =0.0d0
          erfcd = exp(-alpha*alpha*r*r)
          t = 1.0 / (1.0 + EWALD_P*alpha*r)
          erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd
          forcecoul = prefactor * (erfcc*dri + 2.0*alpha*MY_PIS_INV*erfcd +&
               r*f_shift) * r
          e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift)             
             
          force   = (forcecoul+forcelj)*dr2i

          ffx = ffx + dx*force
          ffy = ffy + dy*force
          ffz = ffz + dz*force
          
       enddo
       ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz
    enddo
    !$omp end parallel do 
    
  end subroutine lj_cut_coul_dsf_nonewton

jimdempseyatthecove · ‎06-27-2015

Conner,

The signal argument is not a value. It is an arbitrary variable who's address is used to disambiguate sessions. If for example, the signal variable is a stack local variable in a subroutine that is called from multiple threads within a parallel region, then each thread, using the same named variable (possibly with the same value), has a different address for the variable. However, if you mistakenly place the signal variable in a module or make it save, then you will have issues if you have concurrent asynchronous offloads attempting to use the same variable (address). Signal variables can be used within a module, as illustrated in the user guide, but then must be exclusively used (non-concurrently) for a single purpose (per MIC).

Now for your issue at hand.

Your above listed subroutine is containing array references (at least to position(:) and nlist(:)) that are neither defined nor used. Yet you have implicit none and no error report.

If this is a copy and paste issue for the posting above and if these arrays exist in a module, then you have to first determine if these arrays are to be accessible by both Host and MIC, or used only within the MIC. When used by both, then you will need an offload transfer to synchronize the data.

The variables and arrays declared in a module can be attributed to indicate they reside in the mic and/or alternate within host. While the names of these variables/arrays can exist in both places (Host/MIC) the physical storage locations differ. You are required to transfer data between the two areas when applicable. Additionally for allocatable arrays, attributed in both places, then must be allocated in both places. Please see some of the example programs.

Jim Dempsey

Mudit_Sharma · ‎06-27-2015

hi I would like to know about LMASK to suggest which method will be better for vectorization.

Thanks!

Edit:Sorry for the irrelevance of the topic.I was on another page of the Zone.

Sorry!

conor_p_ · ‎06-27-2015

Thanks for your help, Jim! However, I don't think I quite followed what you were saying about signal_tag. I remember when I was trying to do asynchronous offloads, I had to initialize the value of the signal tag. I asked this quention in the following thread
https://software.intel.com/en-us/comment/1795195#comment-1795195

If you scroll to the bottom, kevin mentions "The signal tag must be initialized to a non-zero unique (from other signal variable's value where more than one signal is used) value. Add a non-zero initialization of signal1 before (line 51) the use in line 55." Is this not the case for the asynchronous calculation? Would the following, where signal_tag is neither declared or initialized be correct?

subroutine force_wrapper(step,neighbor_flag)

    implicit none
    integer :: step
    integer :: neighbor_flag

   
    !dir$ offload  target(mic:0) signal(signal_value),&
    !dir$ in(position: alloc_if(.false.),free_if(.false.)),&
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&
    !dir$ nocopy(nlist: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(numneigh: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))
    call lj_cut_coul_dsf_nonewton(step)

    !---asynchronous computation
    call bond_harmonic(step)
    call angle_harmonic(step)
    call dihedral_opls(step)
    call improper_harmonic(step)


    !dir$ offload_wait target(mic:0) WAIT (signal_value)

Also, regarding the position and nlist error, I apologize. I should have been more descriptive. Those variables are indeed allocated and declared in serrate modules in global arrays. In fact, I previously have been running successfully lj_cut_coul_dsf_nonewton as follows where the subroutine itself initiates the offload

 subroutine lj_cut_coul_dsf_nonewton(step)
    implicit none
    real*4 :: force,forcelj,forcecoul
    real*4 :: x1,y1,z1,x2,y2,z2
    real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i,dri
    double precision :: ffx,ffy,ffz
    real*4 :: qtmp,r,prefactor,erfcc,erfcd,t
    real*4 :: boxdx,boxdy,boxdz
    integer :: i,j,l,step
    integer :: itype,jtype,neigh
    integer :: tid,num
    integer :: offset,ioffset,neigh_off
    integer :: T1,T2,clock_rate,clock_max
    
    call system_clock(T1,clock_rate,clock_max)
    
    !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&
    !dir$ nocopy(nlist: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(numneigh: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))
  
    
    !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(firstprivate),&
    !$omp& shared(position,ff,nlist,numneigh,q,lj1,lj2,lj3,lj4)
    do i = 1 ,np
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type
       qtmp = q(i)
       ioffset = (itype-1)*numAtomType
       neigh_off = neigh_alloc*(i-1)
       num = numneigh(i)
     
       ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0   

       !dir$ vector aligned
       !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz)
       do j= 1,num
          
          neigh = nlist(neigh_off+j)
         
          dx = x1-position(neigh)%x
          dy = y1-position(neigh)%y
          dz = z1-position(neigh)%z
          jtype = position(neigh)%type
         
          boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
          boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
          boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
          boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)

          dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz
          dr2 = dx*dx + dy*dy + dz*dz

          !---lennard jones interactions
          dr2i = 1.0d0/dr2
          dr6i = dr2i*dr2i*dr2i
          if(dr2.gt.rcut2)dr6i=0.0d0
          offset = ioffset + jtype
          forcelj = dr6i*(lj1(offset)*dr6i-lj2(offset))
          potential = potential + dr6i*(dr6i*lj3(offset)-lj4(offset))      


          !---electrostatic calculations     
          r = sqrt(dr2)
          dri = 1.0d0/r
          prefactor =  qtmp*q(neigh)*dri
          if(dr2.gt.cut_coulsq)prefactor =0.0d0
          erfcd = exp(-alpha*alpha*r*r)
          t = 1.0 / (1.0 + EWALD_P*alpha*r)
          erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd
          forcecoul = prefactor * (erfcc*dri + 2.0*alpha*MY_PIS_INV*erfcd +&
               r*f_shift) * r
          e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift)             
             
          force   = (forcecoul+forcelj)*dr2i

          ffx = ffx + dx*force
          ffy = ffy + dy*force
          ffz = ffz + dz*force
          
       enddo
       ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz
    enddo
    !$omp end parallel do 
    !dir$ end offload
    call system_clock(T2,clock_rate,clock_max)

   ! print*,'elapsed time in force',real(T2-T1)/real(clock_rate)
    time_nonbond = time_nonbond + real(T2-T1)/real(clock_rate)
    potential = 0.50d0*potential
    e_coul = 0.50d0*e_coul
    
  end subroutine lj_cut_coul_dsf_nonewton

As you can see, the only change I made in the code was to declare lj_cut_dsf_nonewton as

!dir$ attributes offload:mic :: lj_cut_coul_dsf_nonewton
subroutine lj_cut_coul_dsf_nonewton(step)

I then removed the offload directive in this subroutine, and just call it during the asynchronous computation. However, now I am gathering these errors. Could this possibly be a compiler specific error? I am currently using intel/13.1.1.163

Frances_R_Intel · ‎06-29-2015

The confusion with the signal variable is a Fortran versus C/C++ issue. In both cases the value passed to signal( ) must be unique. I added a more complete explanation to that forum issue you cited but the short answer is that, for Fortran, signal( ) expects to be passed an integer value. What I recommend is that the value you use be LOC(array_name), where array_name is the name of some array being used in the offload directive. This means you don't need to remember that for some particular offload, you use a signal variable set to 6 or 10 or whatever; you only need to remember that you were moving an array named array_name.

Now, as to the message about type and size needing to have the offload attribute - I don't see where size is but I did find type and I think it comes down to the variable 'position' not being declared. You have said that position is declared inside a module, but I don't see the 'use' statement anywhere. If the compiler does not see any declaration for the variable named position, it will do the best it can, trying to figure out what you mean. And in this case, the best it can do is interpret position(i) as a function call (of unknown type but a function call none the less). So then, in position(i)%type, what the heck is type? Well, that must be a variable name. In your example showing an earlier version of lj_cut_coul_dsf_nonewton, you have given the compiler a clue that 'position' is a variable name because it is used as such in the offload statement. I'm kind of surprised that the code ran correctly but not that it compiled. If you add the use statement for your module into your current lj_cut_coul_dsf_nonewton, then the error message about type should disappear.

As to the global variables inside bond_harmonic, angle_harmonic, dihedral_opls, and improper_harmonic - I think Jim is right on target about this. Are those variables part of a module? Does that module have the offload attribute? It's an all or nothing thing when it comes to the module. Either the whole thing has the offload attribute or it doesn't.

As to the offload statement in force_wrapper. You are declaring all the variables as alloc_if(.false.). Did you allocated the space on the coprocessor somewhere before this? If not, you could be in trouble.

jimdempseyatthecove · ‎06-30-2015

Frances,

>> It's an all or nothing thing when it comes to the module. Either the whole thing has the offload attribute or it doesn't.

All the examples I've seen illustrates attributing individual elements within a module (e.g. an array or contains routine). Are you indicating that one can also attribute the "module foo" itself? (too?)

Jim Dempsey

Frances_R_Intel · ‎06-30-2015

You know, Jim, I can sometimes say the silliest things in a very public way. Yes, you are right, it is the elements inside the module and not the module itself that have the offload attribute.

So, Conor, have you managed to figure out why you are getting errors saying that global variables inside bond_harmonic, angle_harmonic, dihedral_opls, and improper_harmonic need to be declared with an offload target attribute or are you still seeing that?