double precision vs. single precision and vector alignment on MIC

conor_p_ · ‎02-12-2015

Hello everyone, I have a subroutine that serves as the force calculation in an n-body simulation shown below. There a couple of arrays that I will describe which will hopefully make the subroutine a little more clear.

1. array position. it contains positions of particles and atom types. it is of length number of particles np

type aos
    real*4 :: x,y,z
    integer :: type
end type aos

type(aos) :: position(np)

2. array neigh. it contains a list of pointers to particles that are within a certain distance of a given particle. it is of type integer.

3. array nnstart and nnfinish. nnstart(i) and nnfinish(i) gives the starting and ending locations for information for particle i in array neigh. If I am particle i, my neighbors are contained in the indices between nnstart(i) and nnfinish(i) in neigh. these are also integers

4. array q. array of real*4s. charges of given atom

5. lj1,lj2,lj3,lj4. these are all arrays of real*4s. they provide parameters for the atomic interactions.

6. array ff. this is the array that stores the forces for each particle. it is of the following type

type ff_data
     real*4 :: x,y,z
end type ff_data

all these arrays I have as globals and are defined for the mic as follows

!dir$ attributes offload:mic::position

My code was compiled using the -align array64byte flag. However, the inclusion of the !dir$ vector aligned statement on line 43 to assist vectorization of the inner loop produces a seg fault. I am very curious as to why, despite being compiled using the align array64byte, my arrays are not aligned. Any suggestions?

My second question is that I have ran this subroutine (shown below using single precision) using both full single precision and full double precision. In the single precision case, all array components are real*4 vs double precision. However, I only see about a 20% speed up. Now given that my bandwidth should double for SP vs. DP, and that my vector registers should be doing twice as many calculations, I expect a factor of two speed up. Is there something wrong with this reasoning, or is there something going on with this code?

Finally, does anyone have any suggestions for my current nearest integer function on line 54? I was given it as a suggestion on this forum, but don't know if there are faster alternatives.

subroutine lj_cut_coul_dsf_nonewton_SP(step,printflag)
    implicit none
    real*4 :: force,forcelj,forcecoul
    real*4 :: x1,y1,z1,x2,y2,z2
    real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i
    double precision ::  ffx,ffy,ffz
    real*4 :: fudge_factor
    real*4 :: qtmp,r,prefactor,erfcc,erfcd,t
    real*4 :: boxdx,boxdy,boxdz
    real*4 :: scalelj, scalecoul
    integer :: i,j,l,step
    integer :: itype,jtype,neigh
    integer :: tid,num
    integer :: offset
    integer :: printflag
    integer :: T1,T2,clock_rate,clock_max

    call system_clock(T1,clock_rate,clock_max)

    !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&
    !dir$ in(nlist: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(nnfirst: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(nnlast: alloc_if(.false.) free_if(.false.)),&
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))

    !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(private),&
    !$omp& shared(box,ibox,rcut2,cut_coulsq,alpha,EWALD_P,MY_PIS,np,numAtomType),&
    !$omp& shared(A1,A2,A3,A4,A5),&
    !$omp& shared(e_shift,f_shift),&
    !$omp& shared(position,ff,nlist,nnfirst,nnlast,q,lj1,lj2,lj3,lj4)
    do i = 1 ,np
       !---load data for particle i: positions, type, and charge
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type
       qtmp = q(i)

       ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0

       !dir$ vector aligned
       !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz)
       !... loop over neighbors for particle i
       do j= nnfirst(i),nnlast(i)
          
          !---load neighboring particle index
          neigh = nlist(j)

          !---calculate distance between particles using a vectorizable nearest integer function
          dx = x1-position(neigh)%x; dy = y1-position(neigh)%y; dz = z1-position(neigh)%z
          boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
          boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
          boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
          boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)
          dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz
          dr2 = dx*dx + dy*dy + dz*dz

          scalelj = 0.0d0
          if(dr2.lt.rcut2)scalelj = 1.0d0

          dr2i = 1.0d0/dr2
          dr6i = dr2i*dr2i*dr2i
          
          offset  = numAtomType*(itype-1)+jtype
          forcelj = scalelj*dr6i*(lj1(offset)*dr6i-lj2(offset))
          potential = potential + scalelj*dr6i*(dr6i*lj3(offset)-lj4(offset))      


          scalecoul = 0.0d0
          if(dr2.lt.cut_coulsq)scalecoul = 1.0d0
                 
          r = sqrt(dr2)
          prefactor =  scalecoul*qtmp*q(neigh)/r
          erfcd = exp(-alpha*alpha*r*r)
          t = 1.0 / (1.0 + EWALD_P*alpha*r)
          erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd
          forcecoul = prefactor * (erfcc/r + 2.0*alpha/MY_PIS * erfcd +&
               r*f_shift) * r
          
          e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift)             
          
          
          force   = (forcecoul+forcelj)*dr2i


          ffx = ffx + dx*force
          ffy = ffy + dy*force
          ffz = ffz + dz*force
          
       enddo
       ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz
    enddo
    !$omp end parallel do 
    !dir$ end offload

    potential = 0.50d0*potential
    e_coul = 0.50d0*e_coul

    call system_clock(T2,clock_rate,clock_max)

    if(printflag.eq.1)then
       print*,'elapsed time w/o newton SP',real(T2-T1)/real(clock_rate)
       print*,'calculated potentials:',potential*reps, e_coul*reps
    endif
    
  end subroutine lj_cut_coul_dsf_nonewton_SP

TimP · ‎02-12-2015

Your vector aligned directive asserts that nlist(nnfirst(i):nnlast(i)) is 64-byte aligned. Thus, if you set nlist(1) to be aligned by -align array64byte, you are in effect asserting that nnfirst can take on only values 1,5,9,.... which is OK if that is what you mean. It may be reasonable to assume that the compiler will not apply alignment assertions to position(:) or q(:) which I guess are implicit default real, as their vectorization has to be done by gather instructions. I don't know that you can expect significant advantage from alignment assertion when so much of your work is done in gathers.

Even if the loop were fully vectorized (what do your opt-report and vec-report show?), with so many conditionals and divides, and an unnecessarily high latency polynomial, you couldn't expect single precision to be twice as fast as double, even if you weren't using offload mode. VTune profiling should show you some of the time consuming obstacles and help you decide on possible faster alternatives. One of the more obvious is to replace r*r by dr2 which surely ought to be OK in double precision if you can consider single.

Compiler options are offered to give you e.g. 40 bit precision for divide and sqrt if you don't need 52.

If you would post actual code which can be compiled, we could look into some of your questions, including whether ieee_rint() can be optimized by recent compilers.

conor_p_ · ‎02-13-2015

Tim, thank you for pointing out nnfirst through nnlast! I was staring so intently as position and ff that I forgot about those two. I am thinking about changing the algorithm to something like

num = numneigh(i)

do j = 1,num
      neigh = nlist( (i-1)*MAX_NEIGH +j)
      ....
enddo

If I make MAX_NEIGH a multiple of 16, say 1024, then there should be no alignment issue correct? I am thinking this will allow, for each i, the packing of 16 neighbors at a time. As long as max_neigh isn't too large, I expect this should help performance.

opt-report and vec-report are reporting vectorization of inner loop along with gather generated for variable q: indirect access.

Could you explain further what you meant by unnecessarily high latency polynomial? Am I doing something that is unnecessary in the calculation?

I have trimmed my code down to the necessary modules necessary for compilation and execution. I apologize for the number of files. This is part of a much larger molecular dynamics code I am writing. The ONLY module of interest here is mod_force.f90. It contains the subroutine in question, lj_cut_coul_dsf_nonewton_sp.

lj_cut_coul_dsf_nonewton_dp is the same exact subroutine, and is right below lj_cut_dsf_nonewton_sp in mod_force, except it uses DP arrays for timing comparisons.

compile: ifort -align array64byte -openmp global.f90 mod_memory.f90 get_started.f90 mod_Grofile.f90 mod_pbc.f90 mod_readfile.f90 mod_nondimensionalize.f90 mod_assign_types.f90 mod_buildbond_lists.f90 mod_neighbor.f90 mod_force.f90 mod_restart.f90 MD.f90 -O3

make sure input.txt mol.txt and tta.txt are in the same folder.

TimP · ‎02-14-2015

I discussed polynomial evaluation yet again at https://sites.google.com/site/tprincesite/ (file about expression evaluation).

You needn't be put off by the references to Fortran standard if you wish to simply skip down to the subject above.

About to attempt to build the applicatiion.

TimP · ‎02-14-2015

I'm getting Subscript #1 of the array LJ1 has value -858993388 .....

and it can run only when built at -O1 without checking. My MIC is an old 4GB model.

If your polynomial proved to be a bottleneck (by no means a known at this stage), you might write something like

t * erfcd * (A1+t*A2 + t**2*(A3+t*(A4+t*A5)))

Once you have things working, you can try fiddling the imf-domain-exclusion list to speed up divide and sqrt (e.g. by setting -fast=2).

conor_p_ · ‎02-14-2015

I really apologize for that, Tim. To post my input files, I had to change the file extensions from what the usually are. Something got mixed up in the process. The code is fine. If you are willing to take one more stab at it, delete the mol.txt and mod_force.f90 you have, and use the attached ones, and all is well. I will also investigate those optimizations you suggested.