help with hiding memory latency in linked cell algorithm

conor_p_ · ‎10-06-2014

Hey guys,

I am wondering if you could possibly give me any suggestions on how to hide the memory latency I am encountering in my xeon algorithm (line 120). The code loops over all cells, and for each cell loops over its 13 bigger neighbors. For each cell pair, I loop over the atoms present and calculate the energy pairwise. I have tried every trick I can find. I use the -align array64 byte compilation flag (I am a fortran coder), I have inserted !dir$ vector aligned outside my outermost loop, and am using the !dir$ simd on the calculation inner loop. I have experimented with !dir$ unroll_and_jam and !dir$ prefetch without any success. Here is a look at my algorithm below. In the MIC code, line 120 is the bottle neck (this is where I retrieve data from the x,y,z, arrays). Depending on the simulation parameters (number of particles, density, and cutoff) the xeon code is anywhere from 25% to 200% slower than 16 openmp threads. I didn't post the other modules, all they do is set up the arrays used in global, because I didn't want to clutter the post. However if you would like me to post them so that you can compile and run the code yourself, let me know

program lennard
  
  use ifport
  use position
  use global
  use modlist
  use omp_lib
  implicit none
  double precision :: potential, pressure
  integer :: criterium
  double precision :: energy,filter,box,rcut2
  double precision :: minx,miny,minz
  double precision :: dx,dy,dz
  double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
  double precision :: x1,y1,z1,x2,y2,z2
  integer :: l,i,j,k
  integer :: c1s,c1e,c2s,c2e
  integer :: T1,T2,T3,T4,clock_rate,clock_max
  integer :: count,cell_neigh
  integer :: num,tid,count1,count2,count3
  integer :: rep,np
  integer :: local_num,local_atom_num
  integer :: local_nlist(1000)
  integer :: local_start(300),local_end(300)

  



  param%rcut  = 3.0d0
  param%rcut2 = param%rcut**2
  param%dens  = 1.00d0
  param%np    = 60000
  param%vol   = param%np/param%dens
  param%box   = param%vol**(1.0/3.0)
  param%hbox  = param%box/(2.0d0)

  np = param%np
  box = param%box
  rcut2 = param%rcut2

  allocate(x(param%np),y(param%np),z(param%np))
  allocate(flat(3*param%np))
  print*,'distance cut off:',param%rcut
  print*,'density:',param%dens
  print*,'volume:',param%vol
  print*,'box:',param%box
  print*,'hbox:',param%hbox
  print*,'number of particles:',param%np



  call seed(10)
  !call srand(10)

  !--- initialize particles on a cubic grid
  call init_position('cubic grid', 'configuration.dat')
  
  !--- initialize list variables
  call init_list()
  call compute_cell_neighbors()
  call total_cell_sort_init(0)
  call build_r()
 

  !$omp parallel
  tid = omp_get_thread_num()
  if(tid.eq.0)then
     num = omp_get_num_threads()
     print*,'USING THIS MANY:',num
  endif
  !$omp end parallel
    
  
  do rep = 1,3
     print*,'beginning rep:',rep
     
     energy = 0.0d0
     
     
     count1 = 0
     
     !-----------------------------------------------------------------------------!
     ! case 1: MIC computation with data transfer                                  !
     !-----------------------------------------------------------------------------!
     call system_clock(T3,clock_rate,clock_max)
     
     !dir$ offload_transfer target(mic:0),&
     !dir$& in(x: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(y: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(z: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(start: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(end: alloc_if(.true.) free_if(.false.)),&
     !dir$& in(cnum: alloc_if(.true.) free_if(.false.))
     
     !dir$ offload begin target(mic:0) nocopy(x,y,z,min_x,min_y,min_z,start,end,cnum) in(listvar) inout(energy)
     
     !$omp parallel do reduction(+:energy),&
     !$omp& default(private) schedule(dynamic) shared(param,min_x,min_y,min_z,start,end,cnum,x,y,z,listvar)

     !DIR$ VECTOR ALIGNED
     do i = 0,listvar%ncellT-1
        c1s = start(i); c1e = end(i)
        
        
        do j = i*13,13*i+12
           
           cell_neigh = cnum(j)
           c2s = start(cell_neigh); c2e = end(cell_neigh)
           minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
                      
           do k= c1s,c1e
              x1 = x(k); y1 = y(k); z1 = z(k)
              
              !dir$ SIMD
              do l = c2s,c2e
                 x2 = x(l); y2 = y(l); z2 = z(l)
                 
                 dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz
                 dr2 = dx*dx + dy*dy + dz*dz
                 
                 filter = 0.0d0
                 if(dr2.lt.param%rcut2)then
                    filter = 1.0d0
                 endif
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
                 
                 energy = energy + 4.0d0*(dr12i-dr6i)*filter
                 
                 
              enddo
           enddo
        enddo
        
        do k = c1s,c1e-1
           x1 = x(k); y1 = y(k); z1 = z(k)

           !dir$ SIMD
           do l = k+1,c1e
              x2 = x(k); y2 = y(k); z2 = z(k)
              dx = x2-x1; dy = y2-y1; dz = z2-z1
              
              dr2 = dx*dx + dy*dy + dz*dz
              
              filter = 0.0d0
              if(dr2.lt.param%rcut2)then
                 filter = 1.0d0
              endif
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)*filter
           enddo
        enddo
     enddo
     !$omp end parallel do
     !dir$ end offload
     call system_clock(T4,clock_rate,clock_max)
     print*,'timing for MIC computation case 2:',real(T4-T3)/real(clock_rate)      
     print*,'energy case 2:',energy
     print*
     print*
     
  end do

  !-----------------------------------------------------------!
  ! case four: openmp                                         !
  !-----------------------------------------------------------!
   
   
   
   energy = 0.0d0
   
   call system_clock(T1,clock_rate,clock_max)
   
   !$omp parallel do schedule(dynamic) reduction(+:energy),&
   !$omp& default(private) shared(param,min_x,min_y,min_z,cnum,start,end,x,y,z,listvar)
   do i = 0,listvar%ncellT-1

     c1s = start(i); c1e = end(i)

   
     do j = i*13,13*i+12
   
        cell_neigh = cnum(j)
     
        c2s = start(cell_neigh); c2e = end(cell_neigh)
        minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
        
        do k= c1s,c1e
           x1 = x(k); y1 = y(k); z1 = z(k)
         
           do l = c2s,c2e
              x2 = x(l); y2 = y(l); z2 = z(l)
      
              dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz          
              dr2 = dx*dx + dy*dy + dz*dz
                                                           
          
              if(dr2.lt.param%rcut2)then
                 dr = sqrt(dr2)
                 dri = 1.0d0/dr
                 dr2i = 1.0d0*dri*dri
                 dr6i = dr2i*dr2i*dr2i
                 dr12i = dr6i*dr6i
              
                 energy = energy + 4.0d0*(dr12i-dr6i)
              endif
        
           enddo
        enddo
     enddo
     do k = c1s,c1e-1
        x1 = x(k); y1 = y(k); z1 = z(k)

        do l = k+1,c1e
           x2 = x(l); y2 = y(l); z2 = z(l)
           dx = x2-x1; dy = y2-y1; dz = z2-z1
           dr2 = dx*dx + dy*dy + dz*dz
          
          
           if(dr2.lt.param%rcut2)then
              
              dr = sqrt(dr2)
              dri = 1.0d0/dr
              dr2i = 1.0d0*dri*dri
              dr6i = dr2i*dr2i*dr2i
              dr12i = dr6i*dr6i
              
              energy = energy + 4.0d0*(dr12i-dr6i)
           endif
        enddo
     enddo
  enddo
  !$omp end parallel do
  call system_clock(T2,clock_rate,clock_max)
  
  print*,'time case 4 with filter:',real(T2-T1)/real(clock_rate)
  print*,'energy case 4:',energy
  print*
  print*
  
 
  stop 
end program lennard

McCalpinJohn · ‎10-06-2014

I would start with some manual prefetches for the loads on line 120 immediately after line 112 where the loop bounds for the loop starting at line 119 are first computed. Similarly the loads on line 116 can be prefetched immediately after the loop bounds are computed on line 106. The compiler probably inserts prefetches, but it may not hoist them upstream as far as possible (which is especially important for short loops).

conor_p_ · ‎10-06-2014

Can you tell me what those directives look like? I am not sure if I fully understand the format of the prefetch directive The way I am interpreting your response it would be on lines 106 and 112, or did you have something else in mind?

!dir$ prefetch:x:0:1
!dir$ prefetch:y:0:1
!dir$ prefetch:z:0:1

McCalpinJohn · ‎10-06-2014

I don't think that I have used the directive-based prefetching in Fortran -- I usually use the intrinsics in C, but it looks like these exist in Fortran as well.

I don't understand exactly what the code is doing -- the loads at line 120 look like they are repeated for every pass through the loop on k at line 115. If the ranges c1s to c1e and c2s to c2e are short, then everything loaded should fit in the L1 cache and it should be safe to prefetch all the x(l),y(l),z(l) values loaded at line 120 using a replicated loop inserted after line 112.

Something like:

         
109	        do j = i*13,13*i+12
110	            
111	           cell_neigh = cnum(j)
112	           c2s = start(cell_neigh); c2e = end(cell_neigh)
                   do l=c2s,c2e
                           call MM_PREFETCH(x(l))
                           call MM_PREFETCH(y(l)
                           call MM_PREFETCH(z(l))
                   end do
113	           minx = min_x(j); miny = min_y(j) ; minz = min_z(j)
114	                       
115	           do k= c1s,c1e
116	              x1 = x(k); y1 = y(k); z1 = z(k)
117	               
118	              !dir$ SIMD
119	              do l = c2s,c2e
120	                 x2 = x(l); y2 = y(l); z2 = z(l)

jimdempseyatthecove · ‎10-06-2014

John,

Correct me if I am wrong. Prefetch instruction are only functional (noop other wise) when the virtual memory page table (TLB) is also present in cache. IOW if the page table reference is not also in cache then the prefetches are counterproductive. To correct for this, at some point earlier, at least one location in x, y, and z per page must have been fetched (not prefetched) prior to the prefetch anywhere within the page. Page size being 4KB or 4MB. Outside the j loop, and/or periodically inside the j loop he may need to touch a location within the next page.

Jim Dempsey

conor_p_ · ‎10-06-2014

So I ran the above code with the prefetch directives inserted, an there was no noticeable change in the time of computation unfortunately.

McCalpinJohn · ‎10-06-2014

Software prefetches on Xeon Phi wil get dropped if they miss in the TLB, so large (2MiB) pages are much more helpful on Xeon Phi than on some other processors.

Xeon Phi has a rather unusual DTLB structure. The Level 1 DTLB is normal -- it holds up to 64 entries for 4KiB pages (mapping 256 KiB) and up to 8 entries for 2 MiB pages (mapping 16 MiB). The Level 2 TLB is very different -- it holds zero entries for 4KiB pages and up to 64 entries for 2MiB pages (mapping 128 MiB). For the case of 4KiB pages, the Level 2 TLB holds Page Directory Entries (PDEs) rather than Page Table Entries (PTEs). So if a load to a 4 KiB page misses in the L1 DTLB, it automatically misses in the L2 TLB, which activates the hardware page table walker. This does the usual 4-level page table walk, but is very likely to find the PDE in the L2 TLB. Each PDE entry maps 2 MiB (the same as a 2 MiB PTE), so this structure allows the L2 TLB to map 128 MiB with 4KiB pages -- at the cost of additional (typically very fast) page table walks. The is an extraordinarily neat trick, but it is a problem on Xeon Phi because Xeon Phi is the only recent Intel processor that drops software prefetches when they cause page table walks.

For the code under test, it is not clear how much locality is going to be present in the load streams. If there is modest locality, then 2 MiB pages should enable most of the software prefetches to succeed. If not, then the processor is going to stall a lot. Multiple threads can help emulate out-of-order processing, but there is only one page table walker shared by the four thread contexts, so threads won't help if that is the main problem.

It is not clear to me how much of the performance is due to poor latency tolerance and how much due to difficulty in vectorization.

conor_p_ · ‎10-06-2014

Do you have any suggestions for how I could possibly figure that out? I don't know why i would not have good vectorization here. I am looping down chunks of x,y,z by unit strides. Is it possible this algorithm just isn't going to work well on a xeon Phi?

conor_p_ · ‎10-06-2014

I have attached the three other modules used for this subroutine. They in no way need to be looked at. If anyone from the intel team would be willing to throw this into VTUNE briefly (I do not have access to it), I would greatly appreciate it. I am getting desperate here. Currently the results of the MIC runs are approximately 2.4e-2 s where as the 16 openmp threads are 1.0e-2 s

compile: ifort -align array64byte -openmp global.f90 position.f90 modlist.f90 MC.f90 -O3 -o new.out

jimdempseyatthecove · ‎10-07-2014

John,

Thanks for the detailed description. The complexity of it sounds like it is ripe for a visualization tool to integrate with the Intel Analyzer (or as a new tool). The current tools will tell you the source lines where cache misses occur, but (to my knowledge) not the location of and number of prefetches voided. And even if it did show the counts of prefetch hits/voids, it still wouldn't tell you the locations (relative sympollicly) shuch that you could determine where and when to insert a hard read ahead to force the load of a PDE at the expense of a stall for memory. Note, if there were a specialized form of PREFETCH (additional bits are available) that would load the PDE for a virtual address taking any memory hits as required loading the PDE while not loading the memory at the virtual address, then you would take one less memory read hit to get the PDE loaded .and. not burn a register to receive data. With a proper visualization tool it would assist in determining where and when to place the PDE prefetches.

Note, I am not sure if this is Connor's experience or not. When a program is double-buffered where say one state is advanced with buffer A as input and buffer B as output, and the next state is advanced with buffer B as input and buffer A as output (Ping-Pong style). And where each buffer is larger than 128MB (per core), then as you transition states, it is highly probable that none of the next PDE's to be used are residing in cache. This in turn means all of the first entry into page prefetches are counter productive. For example, assume 4KiB page and an array of doubles. Page size is 512 doubles. If the estimated best L2 distance for prefetch is 256 cells in advance, then on average 1/2 of your prefetches are voided. This is likely the best reason for using 2MiB page size. *** However, using 2MiB page size reduces the number of L1 TLB's from 64 to 8. An exemplar would be a 5-point stencil with HT threads in each core processing adjacent stencils (thus increasing reuse of L1 cache). This requires 14 TLBs for one layer of the 4-wide stencil, + another 14 for prefetch, + some uncertainty amount for thread skew in processing the stencils. This will require at least 28 TLBs and 2x that might handle HT sibling thread skew.

Jim Dempsey

McCalpinJohn · ‎10-07-2014

Note that for 2 MiB pages, missing in the L1 DTLB (8 2MiB entries) and hitting in the L2 TLB (64 2MiB entries) is a TLB hit, so software prefetches are not squashed. The extra latency to get the 2 MiB translation from the L2 TLB is only a few cycles, which should be negligible in any realistic case.

The Xeon Phi has several performance counter events that may be relevant, but some will require carefully constructed tests to understand fully. Fortunately some events look easy -- since you can count the VPREFETCH0 and VPREFETCH1 software prefetch instructions in several ways at the L1 and L2 cache levels, it should be possible to set up some tests to estimate how many are not getting issued (due to page table walks). The TLB miss counters should also be valuable, but I need to check to see exactly how "DATA_PAGE_WALK" and "LONG_DATA_PAGE_WALK" differ between 4KiB and 2MiB page accesses.

There are many places in the memory subsystem where increasing the number of threads results in reduced overall performance, so comparing performance with 1,2,3,4 threads per core is almost always something that should be included early in the performance characterization work.

conor_p_ · ‎10-07-2014

Sorry gentleman, I am a chemical engineering PhD, and not a computer scientist. What you are discussing is way over my head. However I will note that the size of these arrays (60000) is such that they should already fit in cache.