- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey everyone, consider the following sample code below.
compiling with ifort -O3 -align array64byte -openmp -vec-report6 spits out something to the effect that nlist is aligned, the SIMD generated vectorization, and position is 64 bit indexed in the offloaded inner loop at line 93. However in the remainder loop, as we expect, nothing is aligned but the remainder code is vectorized. The !dir$ vector aligned prevents the creation of a peel loop like want.
The algorithm is constructed such that num at line 93 is always a multiple of 16. Thus, the do j = 1,num loop should neverhave any remainder. I know that remainder loops will typically execute a less than optimal level on a MIC, so I would like to prevent its creation and or execution. My question is there a way to prevent the compiler from making this remainder loop? Do I not have to worry about it since I made num always a multiple of 16, because it will never be executed? That would be my guess, but I am not quite if there is a catch or scenario I am not aware of.
program test_alignment use ifport implicit none type aos real*4 :: x,y,z integer :: type end type aos type(aos),allocatable :: position(:) real*4 :: x1,y1,z1 real*4 :: box,hbox,ibox,dx,dy,dz,dr2,dr2i,dr6i,dr12i real*4 :: rcut,rcut2 double precision :: energy integer :: i,j,k,np integer :: offset,num integer :: neigh_alloc integer :: neigh,nnpt integer, allocatable :: nlist(:),numneigh(:) integer :: T1,T2,clock_rate,clock_max box = 80.00d0 hbox = 0.50d0*box ibox = 1.0d0/box np = 60000 allocate(position(0:np)) allocate(numneigh(np)) neigh_alloc = 10000 allocate(nlist(10000*np)) do i = 1,np position(i)%x = box*rand() position(i)%y = box*rand() position(i)%z = box*rand() enddo position(0)%x = 5*box position(0)%y = 5*box position(0)%z = 5*box rcut = 12.0d0 rcut2 = rcut*rcut !$omp parallel do schedule(dynamic) default(firstprivate),& !$omp& shared(position,numneigh,nlist) do i =1,np x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z numneigh(i) = 0 offset = (i-1)*neigh_alloc nnpt=0 do j = 1,np if(i.eq.j)cycle dx = x1-position(j)%x dy = y1-position(j)%y dz = z1-position(j)%z dx = dx-box*nint(dx*ibox) dy = dy-box*nint(dy*ibox) dz = dz-box*nint(dz*ibox) dr2 = dx*dx + dy*dy + dz*dz if(dr2.lt.rcut2)then nnpt = nnpt + 1 nlist(offset+nnpt) = j endif enddo do while(mod(nnpt,16).ne.0) nnpt = nnpt+1 nlist(offset+nnpt) = 0 enddo numneigh(i) = nnpt end do !$omp end parallel do energy =0.0d0 call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position,numneigh,nlist) !$omp parallel do reduction(+:energy) schedule(dynamic),& !$omp& default(firstprivate),& !$omp& shared(position,numneigh,nlist) do i = 1,np-1 x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z num = numneigh(i) offset = (i-1)*neigh_alloc !dir$ vector aligned !dir$ simd do j=1,num neigh = nlist(offset+j) dx = x1-position(neigh)%x dy = y1-position(neigh)%y dz = z1-position(neigh)%z dx = dx-box*nint(dx*ibox) dy = dy-box*nint(dy*ibox) dz = dz-box*nint(dz*ibox) dr2 = dx*dx + dy*dy + dz*dz dr2i = 1.0d0/dr2 dr6i = 0.0d0 if(dr2.lt.rcut2)dr6i=dr2i*dr2i*dr2i energy = energy + dr6i*(dr6i-1.0d0) enddo enddo !$omp end parallel do !dir$ end offload call system_clock(T2,clock_rate,clock_max) print*,'elapsed time',real(T2-T1)/real(clock_rate) print*,'what is energy',energy*0.5 end program test_alignment
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How about the obvious:
if(num == 16) then !dir$ vector aligned !dir$ simd do j=1,16 neigh = nlist(offset+j) dx = x1-position(neigh)%x dy = y1-position(neigh)%y dz = z1-position(neigh)%z dx = dx-box*nint(dx*ibox) dy = dy-box*nint(dy*ibox) dz = dz-box*nint(dz*ibox) dr2 = dx*dx + dy*dy + dz*dz dr2i = 1.0d0/dr2 dr6i = 0.0d0 if(dr2.lt.rcut2)dr6i=dr2i*dr2i*dr2i energy = energy + dr6i*(dr6i-1.0d0) enddo else stop "BUG num .ne. 16" endif
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You may want to define a parameter as to what num is expected to be. In the above case 16. But you may want different counts that are multiples of the type of REAL you use that fit in the cache line size.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If the above loop is non-optimal, then it would be due to (offset+j)
In this case, change the do loop to
do j=offset+1, offset+1+16
neigh = nlist(j)
Jim Dempsey

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page