- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey guys,
I am wondering if you could possibly give me any suggestions on how to hide the memory latency I am encountering in my xeon algorithm (line 120). The code loops over all cells, and for each cell loops over its 13 bigger neighbors. For each cell pair, I loop over the atoms present and calculate the energy pairwise. I have tried every trick I can find. I use the -align array64 byte compilation flag (I am a fortran coder), I have inserted !dir$ vector aligned outside my outermost loop, and am using the !dir$ simd on the calculation inner loop. I have experimented with !dir$ unroll_and_jam and !dir$ prefetch without any success. Here is a look at my algorithm below. In the MIC code, line 120 is the bottle neck (this is where I retrieve data from the x,y,z, arrays). Depending on the simulation parameters (number of particles, density, and cutoff) the xeon code is anywhere from 25% to 200% slower than 16 openmp threads. I didn't post the other modules, all they do is set up the arrays used in global, because I didn't want to clutter the post. However if you would like me to post them so that you can compile and run the code yourself, let me know
program lennard use ifport use position use global use modlist use omp_lib implicit none double precision :: potential, pressure integer :: criterium double precision :: energy,filter,box,rcut2 double precision :: minx,miny,minz double precision :: dx,dy,dz double precision :: dr,dr2,dri,dr2i,dr6i,dr12i double precision :: x1,y1,z1,x2,y2,z2 integer :: l,i,j,k integer :: c1s,c1e,c2s,c2e integer :: T1,T2,T3,T4,clock_rate,clock_max integer :: count,cell_neigh integer :: num,tid,count1,count2,count3 integer :: rep,np integer :: local_num,local_atom_num integer :: local_nlist(1000) integer :: local_start(300),local_end(300) param%rcut = 3.0d0 param%rcut2 = param%rcut**2 param%dens = 1.00d0 param%np = 60000 param%vol = param%np/param%dens param%box = param%vol**(1.0/3.0) param%hbox = param%box/(2.0d0) np = param%np box = param%box rcut2 = param%rcut2 allocate(x(param%np),y(param%np),z(param%np)) allocate(flat(3*param%np)) print*,'distance cut off:',param%rcut print*,'density:',param%dens print*,'volume:',param%vol print*,'box:',param%box print*,'hbox:',param%hbox print*,'number of particles:',param%np call seed(10) !call srand(10) !--- initialize particles on a cubic grid call init_position('cubic grid', 'configuration.dat') !--- initialize list variables call init_list() call compute_cell_neighbors() call total_cell_sort_init(0) call build_r() !$omp parallel tid = omp_get_thread_num() if(tid.eq.0)then num = omp_get_num_threads() print*,'USING THIS MANY:',num endif !$omp end parallel do rep = 1,3 print*,'beginning rep:',rep energy = 0.0d0 count1 = 0 !-----------------------------------------------------------------------------! ! case 1: MIC computation with data transfer ! !-----------------------------------------------------------------------------! call system_clock(T3,clock_rate,clock_max) !dir$ offload_transfer target(mic:0),& !dir$& in(x: alloc_if(.true.) free_if(.false.)),& !dir$& in(y: alloc_if(.true.) free_if(.false.)),& !dir$& in(z: alloc_if(.true.) free_if(.false.)),& !dir$& in(min_x: alloc_if(.true.) free_if(.false.)),& !dir$& in(min_y: alloc_if(.true.) free_if(.false.)),& !dir$& in(min_z: alloc_if(.true.) free_if(.false.)),& !dir$& in(start: alloc_if(.true.) free_if(.false.)),& !dir$& in(end: alloc_if(.true.) free_if(.false.)),& !dir$& in(cnum: alloc_if(.true.) free_if(.false.)) !dir$ offload begin target(mic:0) nocopy(x,y,z,min_x,min_y,min_z,start,end,cnum) in(listvar) inout(energy) !$omp parallel do reduction(+:energy),& !$omp& default(private) schedule(dynamic) shared(param,min_x,min_y,min_z,start,end,cnum,x,y,z,listvar) !DIR$ VECTOR ALIGNED do i = 0,listvar%ncellT-1 c1s = start(i); c1e = end(i) do j = i*13,13*i+12 cell_neigh = cnum(j) c2s = start(cell_neigh); c2e = end(cell_neigh) minx = min_x(j); miny = min_y(j) ; minz = min_z(j) do k= c1s,c1e x1 = x(k); y1 = y(k); z1 = z(k) !dir$ SIMD do l = c2s,c2e x2 = x(l); y2 = y(l); z2 = z(l) dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz dr2 = dx*dx + dy*dy + dz*dz filter = 0.0d0 if(dr2.lt.param%rcut2)then filter = 1.0d0 endif dr = sqrt(dr2) dri = 1.0d0/dr dr2i = 1.0d0*dri*dri dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i energy = energy + 4.0d0*(dr12i-dr6i)*filter enddo enddo enddo do k = c1s,c1e-1 x1 = x(k); y1 = y(k); z1 = z(k) !dir$ SIMD do l = k+1,c1e x2 = x(k); y2 = y(k); z2 = z(k) dx = x2-x1; dy = y2-y1; dz = z2-z1 dr2 = dx*dx + dy*dy + dz*dz filter = 0.0d0 if(dr2.lt.param%rcut2)then filter = 1.0d0 endif dr = sqrt(dr2) dri = 1.0d0/dr dr2i = 1.0d0*dri*dri dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i energy = energy + 4.0d0*(dr12i-dr6i)*filter enddo enddo enddo !$omp end parallel do !dir$ end offload call system_clock(T4,clock_rate,clock_max) print*,'timing for MIC computation case 2:',real(T4-T3)/real(clock_rate) print*,'energy case 2:',energy print* print* end do !-----------------------------------------------------------! ! case four: openmp ! !-----------------------------------------------------------! energy = 0.0d0 call system_clock(T1,clock_rate,clock_max) !$omp parallel do schedule(dynamic) reduction(+:energy),& !$omp& default(private) shared(param,min_x,min_y,min_z,cnum,start,end,x,y,z,listvar) do i = 0,listvar%ncellT-1 c1s = start(i); c1e = end(i) do j = i*13,13*i+12 cell_neigh = cnum(j) c2s = start(cell_neigh); c2e = end(cell_neigh) minx = min_x(j); miny = min_y(j) ; minz = min_z(j) do k= c1s,c1e x1 = x(k); y1 = y(k); z1 = z(k) do l = c2s,c2e x2 = x(l); y2 = y(l); z2 = z(l) dx = x2-x1-minx; dy = y2-y1-miny; dz = z2-z1-minz dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.param%rcut2)then dr = sqrt(dr2) dri = 1.0d0/dr dr2i = 1.0d0*dri*dri dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i energy = energy + 4.0d0*(dr12i-dr6i) endif enddo enddo enddo do k = c1s,c1e-1 x1 = x(k); y1 = y(k); z1 = z(k) do l = k+1,c1e x2 = x(l); y2 = y(l); z2 = z(l) dx = x2-x1; dy = y2-y1; dz = z2-z1 dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.param%rcut2)then dr = sqrt(dr2) dri = 1.0d0/dr dr2i = 1.0d0*dri*dri dr6i = dr2i*dr2i*dr2i dr12i = dr6i*dr6i energy = energy + 4.0d0*(dr12i-dr6i) endif enddo enddo enddo !$omp end parallel do call system_clock(T2,clock_rate,clock_max) print*,'time case 4 with filter:',real(T2-T1)/real(clock_rate) print*,'energy case 4:',energy print* print* stop end program lennard
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I would start with some manual prefetches for the loads on line 120 immediately after line 112 where the loop bounds for the loop starting at line 119 are first computed. Similarly the loads on line 116 can be prefetched immediately after the loop bounds are computed on line 106. The compiler probably inserts prefetches, but it may not hoist them upstream as far as possible (which is especially important for short loops).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you tell me what those directives look like? I am not sure if I fully understand the format of the prefetch directive The way I am interpreting your response it would be on lines 106 and 112, or did you have something else in mind?
!dir$ prefetch:x:0:1 !dir$ prefetch:y:0:1 !dir$ prefetch:z:0:1
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I don't think that I have used the directive-based prefetching in Fortran -- I usually use the intrinsics in C, but it looks like these exist in Fortran as well.
I don't understand exactly what the code is doing -- the loads at line 120 look like they are repeated for every pass through the loop on k at line 115. If the ranges c1s to c1e and c2s to c2e are short, then everything loaded should fit in the L1 cache and it should be safe to prefetch all the x(l),y(l),z(l) values loaded at line 120 using a replicated loop inserted after line 112.
Something like:
109 do j = i*13,13*i+12 110 111 cell_neigh = cnum(j) 112 c2s = start(cell_neigh); c2e = end(cell_neigh) do l=c2s,c2e call MM_PREFETCH(x(l)) call MM_PREFETCH(y(l) call MM_PREFETCH(z(l)) end do 113 minx = min_x(j); miny = min_y(j) ; minz = min_z(j) 114 115 do k= c1s,c1e 116 x1 = x(k); y1 = y(k); z1 = z(k) 117 118 !dir$ SIMD 119 do l = c2s,c2e 120 x2 = x(l); y2 = y(l); z2 = z(l)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Correct me if I am wrong. Prefetch instruction are only functional (noop other wise) when the virtual memory page table (TLB) is also present in cache. IOW if the page table reference is not also in cache then the prefetches are counterproductive. To correct for this, at some point earlier, at least one location in x, y, and z per page must have been fetched (not prefetched) prior to the prefetch anywhere within the page. Page size being 4KB or 4MB. Outside the j loop, and/or periodically inside the j loop he may need to touch a location within the next page.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So I ran the above code with the prefetch directives inserted, an there was no noticeable change in the time of computation unfortunately.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Software prefetches on Xeon Phi wil get dropped if they miss in the TLB, so large (2MiB) pages are much more helpful on Xeon Phi than on some other processors.
Xeon Phi has a rather unusual DTLB structure. The Level 1 DTLB is normal -- it holds up to 64 entries for 4KiB pages (mapping 256 KiB) and up to 8 entries for 2 MiB pages (mapping 16 MiB). The Level 2 TLB is very different -- it holds zero entries for 4KiB pages and up to 64 entries for 2MiB pages (mapping 128 MiB). For the case of 4KiB pages, the Level 2 TLB holds Page Directory Entries (PDEs) rather than Page Table Entries (PTEs). So if a load to a 4 KiB page misses in the L1 DTLB, it automatically misses in the L2 TLB, which activates the hardware page table walker. This does the usual 4-level page table walk, but is very likely to find the PDE in the L2 TLB. Each PDE entry maps 2 MiB (the same as a 2 MiB PTE), so this structure allows the L2 TLB to map 128 MiB with 4KiB pages -- at the cost of additional (typically very fast) page table walks. The is an extraordinarily neat trick, but it is a problem on Xeon Phi because Xeon Phi is the only recent Intel processor that drops software prefetches when they cause page table walks.
For the code under test, it is not clear how much locality is going to be present in the load streams. If there is modest locality, then 2 MiB pages should enable most of the software prefetches to succeed. If not, then the processor is going to stall a lot. Multiple threads can help emulate out-of-order processing, but there is only one page table walker shared by the four thread contexts, so threads won't help if that is the main problem.
It is not clear to me how much of the performance is due to poor latency tolerance and how much due to difficulty in vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Do you have any suggestions for how I could possibly figure that out? I don't know why i would not have good vectorization here. I am looping down chunks of x,y,z by unit strides. Is it possible this algorithm just isn't going to work well on a xeon Phi?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I have attached the three other modules used for this subroutine. They in no way need to be looked at. If anyone from the intel team would be willing to throw this into VTUNE briefly (I do not have access to it), I would greatly appreciate it. I am getting desperate here. Currently the results of the MIC runs are approximately 2.4e-2 s where as the 16 openmp threads are 1.0e-2 s
compile: ifort -align array64byte -openmp global.f90 position.f90 modlist.f90 MC.f90 -O3 -o new.out
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
John,
Thanks for the detailed description. The complexity of it sounds like it is ripe for a visualization tool to integrate with the Intel Analyzer (or as a new tool). The current tools will tell you the source lines where cache misses occur, but (to my knowledge) not the location of and number of prefetches voided. And even if it did show the counts of prefetch hits/voids, it still wouldn't tell you the locations (relative sympollicly) shuch that you could determine where and when to insert a hard read ahead to force the load of a PDE at the expense of a stall for memory. Note, if there were a specialized form of PREFETCH (additional bits are available) that would load the PDE for a virtual address taking any memory hits as required loading the PDE while not loading the memory at the virtual address, then you would take one less memory read hit to get the PDE loaded .and. not burn a register to receive data. With a proper visualization tool it would assist in determining where and when to place the PDE prefetches.
Note, I am not sure if this is Connor's experience or not. When a program is double-buffered where say one state is advanced with buffer A as input and buffer B as output, and the next state is advanced with buffer B as input and buffer A as output (Ping-Pong style). And where each buffer is larger than 128MB (per core), then as you transition states, it is highly probable that none of the next PDE's to be used are residing in cache. This in turn means all of the first entry into page prefetches are counter productive. For example, assume 4KiB page and an array of doubles. Page size is 512 doubles. If the estimated best L2 distance for prefetch is 256 cells in advance, then on average 1/2 of your prefetches are voided. This is likely the best reason for using 2MiB page size. *** However, using 2MiB page size reduces the number of L1 TLB's from 64 to 8. An exemplar would be a 5-point stencil with HT threads in each core processing adjacent stencils (thus increasing reuse of L1 cache). This requires 14 TLBs for one layer of the 4-wide stencil, + another 14 for prefetch, + some uncertainty amount for thread skew in processing the stencils. This will require at least 28 TLBs and 2x that might handle HT sibling thread skew.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Note that for 2 MiB pages, missing in the L1 DTLB (8 2MiB entries) and hitting in the L2 TLB (64 2MiB entries) is a TLB hit, so software prefetches are not squashed. The extra latency to get the 2 MiB translation from the L2 TLB is only a few cycles, which should be negligible in any realistic case.
The Xeon Phi has several performance counter events that may be relevant, but some will require carefully constructed tests to understand fully. Fortunately some events look easy -- since you can count the VPREFETCH0 and VPREFETCH1 software prefetch instructions in several ways at the L1 and L2 cache levels, it should be possible to set up some tests to estimate how many are not getting issued (due to page table walks). The TLB miss counters should also be valuable, but I need to check to see exactly how "DATA_PAGE_WALK" and "LONG_DATA_PAGE_WALK" differ between 4KiB and 2MiB page accesses.
There are many places in the memory subsystem where increasing the number of threads results in reduced overall performance, so comparing performance with 1,2,3,4 threads per core is almost always something that should be included early in the performance characterization work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry gentleman, I am a chemical engineering PhD, and not a computer scientist. What you are discussing is way over my head. However I will note that the size of these arrays (60000) is such that they should already fit in cache.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page