vectorization intensity 0.0 in SIMD vectorized loop for MIC application - Page 2

conor_p_ · ‎11-03-2014

Hello everyone,

I am writing code for an n-body simulation that I would like to put on a xeon phi. The subroutine I am working on involves calculating the energy of the system via a pairwise summation. The O(n2) algorithm would look like this, and is simply a double sum over all the particle pairs and np = number of particles. If a two particles are closer than a distance rcut2, their energy is added.

subroutine nearest_int
    implicit none
    double precision :: dx,dy,dz
    double precision :: x1,y1,z1
    double precision :: x2,y2,z2
    double precision :: dr2,dr2i,dr6i,dr12i
    integer  :: i,j
    integer :: T1,T2,clock_rate,clock_max
    
    potential =  0.0d0


    call system_clock(T1,clock_rate,clock_max)

    !dir$ offload begin target(mic:0) in(position)

    !$omp parallel do schedule(dynamic) reduction(+:potential) default(private),&
    !$omp& shared(position,rcut2,np)
    do i = 1,np
       
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z

       !dir$ simd reduction(+:potential)  
       do j = i+1,np
                    
          x2 = position(j)%x; y2 = position(j)%y; z2 = position(j)%z
          dx = x2-x1
          dy = y2-y1
          dz = z2-z1

          dr2 = dx*dx + dy*dy + dz*dz
      
          if(dr2.lt.rcut2)then
             
          
             dr2i = 1.0d0/dr2
             dr6i = dr2i*dr2i*dr2i
             dr12i = dr6i*dr6i
             
        
             potential = potential + 4.0d0*(dr12i-dr6i)
          endif
       enddo
    enddo
    !$omp end parallel do
   !dir$ end offload


    call system_clock(T2,clock_rate,clock_max)
    print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential
  end subroutine nearest_int

Here, position is an array of structures as follows

type atom
    double precision :: x,y,z
end type atom

type(atom) :: position

Now when I run my code using the O(n2) algorithm, the vectorization intensity is 6.51 which is good given that gather/scatter operations are being applied. Screenshots of the vtune summary are attached png's below entitled n2-1.png and n2-2.png. Now since O(n2) gives poor scaling to larger systems, an O(N) algorithm is preferred. To do this, we store in an array the particles that are close (1.2*rcut to be exact) and how many neighbors each particle has. The O(n2) algorithm now transforms to the following which involves using pointers to access the position array.

subroutine nearest_int
    implicit none
    double precision :: dx,dy,dz
    double precision :: x1,y1,z1
    double precision :: x2,y2,z2
    double precision :: dr2,dr2i,dr6i,dr12i
    integer  :: i,j
    integer :: T1,T2,clock_rate,clock_max
    integer :: neigh
    
    potential =  0.0d0


    call system_clock(T1,clock_rate,clock_max)

    !dir$ offload begin target(mic:0) in(position,vlistl,numneigh)

    !$omp parallel do schedule(dynamic) reduction(+:potential) default(private),&
    !$omp& shared(position,neigh_alloc,vlistl,numneigh,rcut2,np)
    do i = 1,np
       
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z

       !dir$ simd reduction(+:potential)  
       do j = 1,numneigh(i)
          
         neigh = vlistl(j + neigh_alloc*(i-1))
         x2 = position(neigh)%x; y2 = position(neigh)%y; z2 = position(neigh)%z
          
          dx = x2-x1
          dy = y2-y1
          dz = z2-z1

          dr2 = dx*dx + dy*dy + dz*dz
      
          if(dr2.lt.rcut2)then
             
          
             dr2i = 1.0d0/dr2
             dr6i = dr2i*dr2i*dr2i
             dr12i = dr6i*dr6i
             
        
             potential = potential + 4.0d0*(dr12i-dr6i)
          endif
       enddo
    enddo
    !$omp end parallel do
   !dir$ end offload


    call system_clock(T2,clock_rate,clock_max)
    print*,'elapsed time nint:',real(T2-T1)/real(clock_rate),potential
  end subroutine nearest_int

In my code I allocated vlistl as follows

neigh_alloc = 500
allocate(vlistl(500*np))

Now, although the -vec-report6 is telling me that the inner loop is indeed vectorized, I get a vectorization intensity of zero and horrible performance (barely beats serial, although this would make sense if the vectorization intensity is zero). The screen shots from the vtune analysis are given below in n-1.png and n-2.png. Here are my questions:

1. why I am getting this vectorization intensity inside a vectorized loop, and what can I do to improve my performance here?

2. If I can get this code to work on a MIC, I know to expect latency issues at line 27 (neigh = vlistl(j + neigh_alloc*(i-1)) in the O(n) algorithm. I would like to prefetch here. I know that the gather can bring in up to 8 pieces of data (I'm working in DP), on 16 cache lines. Can someone tell me the appropriate way to prefetch here to help mask the latency? I have fiddled with placing the following loop immediately after 27, but it didn't change the performance

do k = 0, 15
     call mm_prefecth(position(vlistl(j + neigh_alloc*(i-1) + k+8)%x,1)
enddo

3.I compiled with the -align array64byte flag so I believe all arrays should be aligned on 64byte boundaries. Does this mean that the arrays are also padded to a multiple of the cache line size? If not, how would I do this?

The simd pragma is supposedly getting the loop to vectorize. I sort my particles so that particles that are close in space are close in memory. I know the AOS structure isn't as good as SOA for vectorization, however I still was getting good performance using AOS in the O(n2 algorithm). Also, I know this is the data structure intel has implemented in various softwares employing this algorithm (lammps for example, although this is in c++ and not fortran). The full code compiles with (modules are attached). The subroutine of interest is the sole subroutine in module mod_force.f90. The numneigh and vlistl array are created in the subroutine build_neighbor_n2 in mod_neighbor.f90 and are allocated in subroutine init_list in mod_neighbor.f90. All arrays are defined as globals in the module global.f90

ifort -align array64byte -openmp global.f90 get_started.f90 mod_init_posit.f90 mod_neighbor.f90 mod_force.f90 MD.f90 -O3 -o new.out

Sorry for the long question/description. But I wanted to give a decent account of what I have tried already. Any help is appreciated.

jimdempseyatthecove · ‎11-07-2014

Using your AOS with the expanded vlist (array of type vlist_t = index, x, y, z)

do i = 1,np
    x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z
    k = neigh_alloc*(i-1)
    do j = 1,numneigh(i)
               x2 = vlistl(j + k)%x
               y2 = vlistl(j + k)%y
               z2 = vlistl(j + k)%z

               ....compute stuff....
    enddo
enddo

Or better:

do i = 1,np
    x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z
    k = neigh_alloc*(i-1)
    do j = 1+k,numneigh(i)+k
               x2 = vlistl(j)%x
               y2 = vlistl(j)%y
               z2 = vlistl(j)%z

               ....compute stuff....
    enddo
enddo

This reduces one level of indirection.

Additionally, this may improve the compiler's ability to recognize a gather (that resides within 4 cache lines).

Jim Dempsey