preventing execution of remainder loop on xeon phi coprocessor

conor_p_ · ‎02-20-2015

Hey everyone, consider the following sample code below.

compiling with ifort -O3 -align array64byte -openmp -vec-report6 spits out something to the effect that nlist is aligned, the SIMD generated vectorization, and position is 64 bit indexed in the offloaded inner loop at line 93. However in the remainder loop, as we expect, nothing is aligned but the remainder code is vectorized. The !dir$ vector aligned prevents the creation of a peel loop like want.

The algorithm is constructed such that num at line 93 is always a multiple of 16. Thus, the do j = 1,num loop should neverhave any remainder. I know that remainder loops will typically execute a less than optimal level on a MIC, so I would like to prevent its creation and or execution. My question is there a way to prevent the compiler from making this remainder loop? Do I not have to worry about it since I made num always a multiple of 16, because it will never be executed? That would be my guess, but I am not quite if there is a catch or scenario I am not aware of.

program test_alignment
  use ifport
  implicit none
  type aos
     real*4 :: x,y,z
     integer :: type
  end type aos
  
  type(aos),allocatable :: position(:)
  real*4 :: x1,y1,z1
  real*4 :: box,hbox,ibox,dx,dy,dz,dr2,dr2i,dr6i,dr12i
  real*4 :: rcut,rcut2
  double precision :: energy
  integer :: i,j,k,np
  integer :: offset,num
  integer :: neigh_alloc
  integer :: neigh,nnpt
  integer, allocatable :: nlist(:),numneigh(:)
  integer :: T1,T2,clock_rate,clock_max


  box = 80.00d0
  hbox = 0.50d0*box
  ibox = 1.0d0/box

  np = 60000
  allocate(position(0:np))
  allocate(numneigh(np))

  neigh_alloc = 10000
  allocate(nlist(10000*np))
  do i = 1,np
     position(i)%x = box*rand()
     position(i)%y = box*rand()
     position(i)%z = box*rand()
  enddo
  position(0)%x = 5*box
  position(0)%y = 5*box
  position(0)%z = 5*box

  rcut = 12.0d0
  rcut2 = rcut*rcut
  
  !$omp parallel do schedule(dynamic) default(firstprivate),&
  !$omp& shared(position,numneigh,nlist)
  do i =1,np
     x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z

     numneigh(i) = 0
     offset = (i-1)*neigh_alloc
     nnpt=0
     do j = 1,np
        if(i.eq.j)cycle
        
        dx = x1-position(j)%x
        dy = y1-position(j)%y
        dz = z1-position(j)%z

        dx = dx-box*nint(dx*ibox)
        dy = dy-box*nint(dy*ibox)
        dz = dz-box*nint(dz*ibox)

        dr2 = dx*dx + dy*dy + dz*dz

        if(dr2.lt.rcut2)then
           nnpt = nnpt + 1
           nlist(offset+nnpt) = j
        endif
     enddo

     do while(mod(nnpt,16).ne.0)
        nnpt = nnpt+1
        nlist(offset+nnpt) = 0
     enddo
     numneigh(i) = nnpt
  end do
  !$omp end parallel do

  energy =0.0d0
  call system_clock(T1,clock_rate,clock_max)
  !dir$ offload begin target(mic:0) in(position,numneigh,nlist)

  !$omp parallel do reduction(+:energy) schedule(dynamic),&
  !$omp& default(firstprivate),&
  !$omp& shared(position,numneigh,nlist)
  do i = 1,np-1
     x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z
     num = numneigh(i)
     offset = (i-1)*neigh_alloc

     !dir$ vector aligned
     !dir$ simd
     do j=1,num
        neigh = nlist(offset+j)
        
        dx = x1-position(neigh)%x
        dy = y1-position(neigh)%y
        dz = z1-position(neigh)%z
        dx = dx-box*nint(dx*ibox)
        dy = dy-box*nint(dy*ibox)
        dz = dz-box*nint(dz*ibox)

        dr2 = dx*dx + dy*dy + dz*dz
        dr2i = 1.0d0/dr2
        
        dr6i = 0.0d0
        if(dr2.lt.rcut2)dr6i=dr2i*dr2i*dr2i

        energy = energy + dr6i*(dr6i-1.0d0)
     enddo
  enddo
  !$omp end parallel do
  !dir$ end offload

  call system_clock(T2,clock_rate,clock_max)
  print*,'elapsed time',real(T2-T1)/real(clock_rate)
  print*,'what is energy',energy*0.5

end program test_alignment

jimdempseyatthecove · ‎02-21-2015

How about the obvious:

if(num == 16) then
     !dir$ vector aligned
     !dir$ simd
     do j=1,16
        neigh = nlist(offset+j)
        
        dx = x1-position(neigh)%x
        dy = y1-position(neigh)%y
        dz = z1-position(neigh)%z
        dx = dx-box*nint(dx*ibox)
        dy = dy-box*nint(dy*ibox)
        dz = dz-box*nint(dz*ibox)

        dr2 = dx*dx + dy*dy + dz*dz
        dr2i = 1.0d0/dr2
        
        dr6i = 0.0d0
        if(dr2.lt.rcut2)dr6i=dr2i*dr2i*dr2i

        energy = energy + dr6i*(dr6i-1.0d0)
     enddo
else
  stop "BUG num .ne. 16"
endif

Jim Dempsey

jimdempseyatthecove · ‎02-21-2015

You may want to define a parameter as to what num is expected to be. In the above case 16. But you may want different counts that are multiples of the type of REAL you use that fit in the cache line size.

Jim Dempsey

jimdempseyatthecove · ‎02-21-2015

If the above loop is non-optimal, then it would be due to (offset+j)

In this case, change the do loop to

do j=offset+1, offset+1+16
neigh = nlist(j)

Jim Dempsey