- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am receiving a very strange segfault that seems to depend on the version of ifort I am using. I am including the subroutine that is producing the segfault. I have ran the code with a number of different compiler versions. Some produce a segfault, and some do not. At first I thought this might be a heap/stack issue since it didn't segfault with gfortran, so I had -ulimit -s unlimited for all these runs. This code does use a MIC, although the subroutine under question does not yet (I had commented out all openmp and offload directions), and some compilers produced some warning. I included them, just in case. Code was compiled with no optimizations. If you have any suggestion, I would greatly appreciate it. I am extremely perplexed. This is part of a much larger molecular dynamics code, so I am hesitant to post the whole code. But let me know if thats necessary.
x86_64-k1om-linux-ld: warning: libsvml.so, needed by /apps/rhel6/intel/composer_xe_
x86_64-k1om-linux-ld: warning: libirng.so, needed by /apps/rhel6/intel/composer_xe_
x86_64-k1om-linux-ld: warning: libintlc.so.5, needed by /apps/rhel6/intel/composer_xe_
subroutine build_neighbor_nonewton_MIC(step)
implicit none
real*4 :: x1,y1,z1,x2,y2,z2
real*4:: dx,dy,dz,dr2
real*4 :: boxdx,boxdy,boxdz
integer :: i,j,k,z,m
integer :: c1s,c1e,c2s,c2e
integer :: cell_neigh,step
integer :: num1tmp,num2tmp,num3tmp
integer :: neigh_flag
integer :: nnpt,n0,neigh
integer :: bin
integer :: T1,T2,clock_rate,clock_max
integer :: cached_bin,ncache
numneigh(:) = 0
nlist(:) = 0
cached_bin = -1
ncache = 0
nnpt = 0
print*,'welcome to mic build'
call system_clock(T1,clock_rate,clock_max)
! !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&
! !dir$ inout(nlist: alloc_if(.false.) free_if(.false.)),&
! !dir$ inout(numneigh: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(pptr_cache: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(x_cache: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(y_cache: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(z_cache: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(start: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(endposit: alloc_if(.false.) free_if(.false.)),&
! !dir$ nocopy(cnum: alloc_if(.false.) free_if(.false.))
! !$omp parallel do schedule(dynamic),&
! !$omp& shared(position,num3bond),&
! !$omp& shared(cnum,start,endposit,x_cache,y_cache,z_cache,pptr_cache),&
! !$omp& shared(nlist,numneigh),&
! !$omp& shared(np,rv2,box,ibox)
do i = 1,np
x1 = position(i)%x
y1 = position(i)%y
z1 = position(i)%z
num3tmp = num3bond(i)
nnpt = 0
n0 = (i-1)*512
bin = xyz2bin(x1,y1,z1)
c1s = start(bin); c1e = endposit(bin)
if(bin.ne.cached_bin)then
ncache = 0
do k = bin*26,26*bin+25
cell_neigh = cnum(k)
c2s = start(cell_neigh); c2e = endposit(cell_neigh)
do z= c2s,c2e
ncache = ncache + 1
pptr_cache(ncache)= z
x_cache(ncache)= position(z)%x
y_cache(ncache)= position(z)%y
z_cache(ncache)= position(z)%z
enddo
enddo
do z = c1s,c1e
ncache = ncache + 1
pptr_cache(ncache)= z
x_cache(ncache)= position(z)%x
y_cache(ncache)= position(z)%y
z_cache(ncache)= position(z)%z
enddo
cached_bin = bin
endif
!dir$ simd
do k = 1,ncache
neigh = pptr_cache(k)
neigh_flag = bonded_test_inline(num3tmp,i,neigh,specbond,np)
if(neigh_flag.eq.0.and.neigh.ne.i)then
dx = x1-x_cache(k)
dy = y1-y_cache(k)
dz = z1-z_cache(k)
dx = dx-box*nint(dx*ibox)
dy = dy-box*nint(dy*ibox)
dz = dz-box*nint(dz*ibox)
dr2 = dx*dx+dy*dy+dz*dz
if(dr2.lt.rv2)then
nnpt = nnpt+1
nlist(n0+nnpt) = neigh
endif
endif
enddo
numneigh(i) = nnpt
enddo
! !$omp end parallel do
! !dir$ end offload
call system_clock(T2,clock_rate,clock_max)
print*,'what is elapsed time',real(T2-T1)/real(clock_rate)
do i = 1,10
print*,numneigh(i)
enddo
print*,'we have now exited'
end subroutine build_neighbor_nonewton_MIC
!dir$ attributes inline :: bonded_test_inline
integer function bonded_test_inline(num3tmp,part,l,bonds,np)
implicit none
integer :: m,j,l,num3tmp,np,part
integer :: bonds(12,np)
bonded_test_inline = 0
do m = 1,num3tmp
if(bonds(m,part).eq.l)then
bonded_test_inline = 1
endif
enddo
end function bonded_test_inline
integer function xyz2bin(x,y,z)
implicit none
real*4 :: x,y,z
integer :: ix,iy,iz,numlength
if(x.lt.box)then
ix = int(x/rn)
else
ix = ncellD-1
endif
if(y.lt.box)then
iy = int(y/rn)
else
iy = ncellD-1
endif
if(z.lt.box)then
iz = int(z/rn)
else
iz = ncellD-1
endif
xyz2bin = ix+iy*ncellD+iz*ncellD*ncellD
return
end function xyz2bin
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I forgot to mention that when I compiled with the make opt=debug which looks like
cflags = -pg FFLAGS = $(INC) -CB -g -traceback LDFLAGS = -JModules
with the 13.1.1.163 version, then I don't even get a segfault! Very perplexing.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so I found the "error." If I do not include the !dir$ simd directive on line 87, the code runs fine. Does anyone have any idea why the different versions of the compiler are treating this simd directive differently, and what I can do to prevent this segfault? I would like to run this subroutine on a xeon phi coprocessor, and need that SIMD to get efficient vectorization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The !dir$ simd directive on line 87 is a contract you made with the compiler stating the arrays indexed by the loop control variable are aligned, on MIC this is to 64 byte boundary. When you compile without optimization then the !dir$ simd has no effect. Also, with the flow control (if tests) it is hard to imagine the code would vectorize.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks, jim. I totally agree about the code alignment. When I compile with full optimizations, including the !dir$ simd, I compile with
-O3 -align array64byte
So all arrays should be aligned. nlist which is used in the conditional if branch is allocated as follows
neigh_alloc = 512 allocate(nlist(1024*np))
Thus, each particle i gets 512 elements to work with in the nlist array, and I don't think there should be an alignment issue there. x_cache, y_cache, z_cache, and pptr_cache are allocated with 2056 elements similarly. Even when I get rid of the inline function call and just do
!dir$ simd
do k = 1,ncache
neigh = pptr_cache(k)
neigh_flag = 0
if(neigh_flag.eq.0.and.neigh.ne.i)then
dx = x1-x_cache(k)
dy = y1-y_cache(k)
dz = z1-z_cache(k)
dx = dx-box*nint(dx*ibox)
dy = dy-box*nint(dy*ibox)
dz = dz-box*nint(dz*ibox)
dr2 = dx*dx+dy*dy+dz*dz
if(dr2.lt.rv2)then
nnpt = nnpt+1
nlist(n0+nnpt) = neigh
endif
endif
enddo
I get the segfault. So why wouldn't my arrays be aligned here? Are you familiar with any method to get a loop like this vectorize with the conditional statement?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Insert sanity checks in front of the !dir$ simd
if(mod(loc(x_cache(1),64) .ne. 0) stop "(mod(loc(x_cache(1),64) .ne. 0)"
if(mod(loc(y_cache(1),64) .ne. 0) stop "(mod(loc(y_cache(1),64) .ne. 0)"
if(mod(loc(z_cache(1),64) .ne. 0) stop "(mod(loc(z_cache(1),64) .ne. 0)"
if(mod(loc(nlist(n0+nnpt+1),64) .ne. 0) stop "(mod(loc(nlist(n0+nnpt+1),64) .ne. 0)"
Your loop will not vectorize well due to the nested if tests. Part of the loop may be able to perform a vector gather. You are now missing the bonded_test_inline resulting in neigh_flag always being 0.
Consider this change:
real*4 :: dr2(2056) ! same size as pptr_cache ... !dir$ simd do k = 1,ncache dx = x1-x_cache(k) dy = y1-y_cache(k) dz = z1-z_cache(k) dx = dx-box*nint(dx*ibox) dy = dy-box*nint(dy*ibox) dz = dz-box*nint(dz*ibox) dr2(k) = dx*dx+dy*dy+dz*dz enddo do k = 1,ncache neigh = pptr_cache(k) if(neigh.ne.i)then if(dr2(k).lt.rv2)then nnpt = nnpt+1 nlist(n0+nnpt) = neigh endif endif enddo
The first loop will vectorize. At most you will perform one useless calculation (neigh == i), but the remainder will all be vectorized.
The second loop produces the new index table.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
As always, thank you, Jim. Could you please explain to me your "sanity check" a little further. I am not really familiar with that syntax, and am not quite sure what you are doing there. Also, do you have any suggestions for a vectorizable nearest integer function? Currently I am trying
dx = x1-x_cache(k) boxdx = dx*ibox boxdx = boxdx+sign(1/epsilon(boxdx),boxdx) -sign(1/epsilon(boxdx),dx) dx = dx-boxdx*ibox
in lieu of
dx = x1 - x_cache(k) dx = dx-box*nint(dx*ibox)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
A "sanity check" means your are self-diagnosing your own assumptions about what your code is supposed to be doing or what it expects/requires. Usually one inserts checks to see if an array is allocated or large enough or if an index is within bounds. You can call a "sanity check" as an assert.
The !dir$ simd tells the compiler that you know for certain that the arrays indexed by the loop control variables are known to be vector aligned. The sanity check verifies this. Note, you can place the sanity checks into a conditional compile section such that the test are performed during test builds. An easy way of doing this is to use FPP and conditionally define a macro, potentially naming the macro _ASSERT.
The test I listed assumes the lower bound of the arrays are 1, and then uses LOC(yourArray(1)) to obtain the address of the beginning of the array. An array aligned to a 64-byte boundary will have an address that has a modulus of 64 equal to 0. (0, 64, 128, 192, ...). The tests asserts that what you assume to be true is indeed true. These tests need not be in the production code unless the production code is not in control of the allocations. An example of this is your app is a library function that requires the input array to be aligned. The sanity test in this case could emit a meaningful error message in lieu of a seg fault.
RE: suggestion
rbox = ibox ! convert outside the loop
...
dx = x1 - x_cache(k)
dx = dx - box*anint(dx*rbox) ! verify box is correct
Check to see if the intrinsic anint is vectorizable.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The meaning of !dir$ simd is a little different from that.
It says nothing about data alignment. But it does assert that your loop contains no vector dependencies between iterations (roughly, one iteration must not depend on the result of a previous one). It asks the compiler to do everything possible to vectorize the loop, irrespective of whether or not it will run faster. Because the directive disables the compiler's own dependency analysis, the programmer is responsible for specifying reduction variables and variables that are private to each loop iteration (this is exactly analogous to programming for OpenMP threading).If this is not done correctly, you will likely get incorrect results.
In your code, you have several variables that are private to each loop iteration: neigh, neigh_flag, dx, dy, dz, dr2. This may result in a race condition - one loop iteration may overwrite the value of a variable while it is in use by a different iteration. Worse, the variable nnpt results in a dependency between iterations that is very likely to lead to incorrect results. Whether or not the overwrites lead to a seg fault is a matter of chance - a wrong value of nnpt could lead to an illegal memory address for nlist, for example. But this is clearly and illegal use of !DIR$ SIMD.
If there was no data compression (the piece of code involving nlist and nnpt), the directive could be fixed by adding a PRIVATE clause for the necessary variables. But I don't think there is a way in the current compilers to enforce safe vectorization of a compression loop. (This is "compression", because nnpt is incremented and nlist is filled for some values of k, but not for others).
To read more about the SIMD directive, in addition to what is in the compiler user and reference guide, see the section "The SIMD Directive" of my article "Explicit Vector Programming in Fortran" at https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran, especially the last code sample.
Note that the OpenMP 4.0 standard now supports a !$OMP SIMD directive that is closely similar to !DIR$ SIMD. You can read about this at openmp.org. With the Intel compiler, you need to compile with -openmp or -openmp-simd for such directives to be recognized. Building with -align array64byte is also generally a good idea, but it is not required for correctness unless you are asserting alignment. !$OMP SIMD supports an "ALIGNED" clause that allows you to specify data alignment, if you so wish.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much, martyn. I believe I have fixed the race condition. I made x_cache,y_cache,z_cache, and pptr_cache private to each thread. Also the memory blocks looped over by each thread in nlist is thread specific. I believe this should fix any race conditions. Am I correct in this?
I ran the code with the following flags and received no warning of the memory not being aligned
if(mod(loc(x_cache(1),64) .ne. 0) stop "(mod(loc(x_cache(1),64) .ne. 0)" if(mod(loc(y_cache(1),64) .ne. 0) stop "(mod(loc(y_cache(1),64) .ne. 0)" if(mod(loc(z_cache(1),64) .ne. 0) stop "(mod(loc(z_cache(1),64) .ne. 0)" if(mod(loc(nlist(n0+nnpt+1),64) .ne. 0) stop "(mod(loc(nlist(n0+nnpt+1),64) .ne. 0)"
I compiled with -O3 -openmp -align array64byte -vec-report6 and got the following below. Here, the compiler is telling me that x_cache,y_cache,z_cache, and pptr_cache all have unaligned access at some point. At other points, it mentions that y_cache and z_cache are aligned within the same subroutine. What is going on here? Why is the alignment changing throughout the subroutine, and why aren't the arrays aligned?
Source/mod_neighbor.f90(102): (col. 9) remark: *MIC* loop was not vectorized: loop was transformed to memset or memcpy. Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(120): (col. 18) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access. Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: reference global_mp_y_cache_ has aligned access. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(118): (col. 15) remark: *MIC* LOOP WAS VECTORIZED. Source/mod_neighbor.f90(120): (col. 18) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access. Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(118): (col. 15) remark: *MIC* PEEL LOOP WAS VECTORIZED. Source/mod_neighbor.f90(120): (col. 18) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access. Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access. Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(118): (col. 15) remark: *MIC* REMAINDER LOOP WAS VECTORIZED. Source/mod_neighbor.f90(114): (col. 12) remark: *MIC* loop was not vectorized: not inner loop. Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_: strided by 4. Source/mod_neighbor.f90(132): (col. 15) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access. Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: reference global_mp_y_cache_ has aligned access. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(130): (col. 12) remark: *MIC* SIMD LOOP WAS VECTORIZED. Source/mod_neighbor.f90(132): (col. 15) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access. Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(130): (col. 12) remark: *MIC* PEEL LOOP WAS VECTORIZED. Source/mod_neighbor.f90(132): (col. 15) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access. Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access. Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(130): (col. 12) remark: *MIC* REMAINDER LOOP WAS VECTORIZED. Source/mod_neighbor.f90(150): (col. 12) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(151): (col. 12) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(152): (col. 12) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access. Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: reference global_mp_dr2array_ has aligned access. Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(149): (col. 9) remark: *MIC* SIMD LOOP WAS VECTORIZED. Source/mod_neighbor.f90(150): (col. 12) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(151): (col. 12) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(152): (col. 12) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access. Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: reference global_mp_dr2array_ has unaligned access. Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(149): (col. 9) remark: *MIC* PEEL LOOP WAS VECTORIZED. Source/mod_neighbor.f90(150): (col. 12) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access. Source/mod_neighbor.f90(151): (col. 12) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access. Source/mod_neighbor.f90(152): (col. 12) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access. Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: reference global_mp_dr2array_ has aligned access. Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: unaligned access used inside loop body. Source/mod_neighbor.f90(149): (col. 9) remark: *MIC* REMAINDER LOOP WAS VECTORIZED. Source/mod_neighbor.f90(168): (col. 9) remark: *MIC* loop was not vectorized: existence of vector dependence. Source/mod_neighbor.f90(97): (col. 6) remark: *MIC* loop was not vectorized: not inner loop. Source/mod_neighbor.f90(91): (col. 12) remark: *MIC* loop was not vectorized: not inner loop.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, I realized the answer the second I posted. Of course just compiling with -align array64byte isn't sufficient. The inclusion of the !dir$ vector aligned pragma outside the appropriate loops generated a vec-report with all arrays aligned. Thanks everyone.
My last question is if there is a way to see specifically if a line of code if vectorized or not. I know the -vec-report will tell me what loops are vectorized. But how can i tell if the anint function, like below/ inside a loop is being vectorized? I know the whole loop is being vectorized, but I need to check to see if the anint is a vectorizable function.
do k =1,ncache
dx = dx-box*anint(dx*ibox)
...
enddo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Although, it is interesting to note that the compiler denotes that a gather is implemented for the global array position which is an array of structures
type aos
real*4 :: x,y,z
integer :: type
end type aos
the inclusion of the !dir$ vector aligned outside any loop containing that array causes a segfault
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The making of the x_cache(... private, eliminating a race condition is a red herring. What it potentially could have done is take an unaligned array and copy it to an aligned buffer. The correct fix is to assure the x, y, z cach array are aligned, omit the copy to private, and ignore any superfluous race condition report. Any race condition of importance would have been on the nlist(...=
To find out if vectorization on anint is used, use VTune and look at the disassembly. Or, in front of the do loop insert
print(*) "Lookie here"
and place a break on the print statement. This should work with /O3.
When at break, open the disassembly window, and look at the code following the call to the fortran print routine.
The loop may be hard to identify if you are not targeting for host.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Non-stride 1 vectors (IOW those using gather) are not aligned. You might find this paper interesting: http://research.colfaxinternational.com/file.axd?file=2014%2f11%2fColfax-Optimization.pdf
Look at the section relating to AOS verses SOA
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, so I have made a small compilable version of this subroutine. I do not have access to vtune, so any help here would be appreciated.I have created an array called soa which stores my particle positions in an SOA type behavior in an attempt to get the loop at line 70 use aligned gathers. However, the compiler says theres a vector dependence and doesn't mention anything about the alignment when I compile with -vec-report6 flag. Jim, I am pretty sure that cache needs to be local to each thread here. The algorithm creates for each particle i, if the bin of i is different from particle i-1, an array of positions called cache. Since each thread will take unique particles, which will have a unique cache vector, I believe it needs to be private. I compile with
ifort -O3 -align array64byte -openmp -fp=model fast=2 fimf-domain-exclusion=15 -openmp global.f90 mod_xyz2bin.f90 mod_neighbor.f90 mod_force.f90 MD.f90 -o new.out
If you compile with the -vec-report6 flag, you will see that cache is aligned. However, the inclusion of the !dir$ vector aligned pragma mysteriously causes the code to segfault on the MIC every time. Can someone please explain this to me?
Is there any way to get the loop at line 70 to vectorize so I can get optimal memory transfer? The author of "High Performance Parallelism Pearls, chapter 8" claims that vectorizes. I will attach a copy of his code. This issue is his is in C, and I am trying to make his algorithm in my fortran code.
Also, I am hoping that aligment betters the performance here. I ran with 16 openmp threads and got 0.02 s on average, but only 0.06s on mic. The serial code version executes in about 0.2s
Finally about the alignment of arrays of structures such as
type aos
real*4 :: x,y,z
integer :: type
end type aos
I have seen numerous C++ codes use this structure along with a !dir$ vector aligned. The xeon phi coprocessor code for lammps, which was written by michael brown at intel, uses this very type of code. I will show a very short segment of his code which shows this
#if defined(__INTEL_COMPILER)
#pragma vector aligned
#pragma simd reduction(+:fxtmp, fytmp, fztmp, fwtmp, sevdwl, \
sv0, sv1, sv2, sv3, sv4, sv5)
#endif
for (int jj = 0; jj < jnum; jj++) {
flt_t forcelj, evdwl;
forcelj = evdwl = (flt_t)0.0;
const int sbindex = jlist[jj] >> SBBITS & 3;
const int j = jlist[jj] & NEIGHMASK;
const flt_t delx = xtmp - x.x;
const flt_t dely = ytmp - x.y;
const flt_t delz = ztmp - x.z;
const int jtype = x.w;
As you can see, he is using this type of AOS, and must be generating aligned access to use the vector aligned command. Is this a c++ vs fortran issue, or is there something I am missing?
Here is my subroutine
subroutine build_neighbor_nonewton_MIC(step)
implicit none
real*4 :: x1,y1,z1,x2,y2,z2
real*4:: dx,dy,dz,dr2
real*4 :: boxdx,boxdy,boxdz
integer :: bonds(12)
integer :: start_stencil,end_stencil
integer :: check
integer :: i,j,k,z,m
integer :: c1s,c1e,c2s,c2e
integer :: cell_neigh,step
integer :: num1tmp,num2tmp,num3tmp
integer :: neigh_flag
integer :: nnpt,n0,neigh
integer :: bin,count
integer :: T1,T2,clock_rate,clock_max
integer :: cached_bin,ncache
numneigh(:) = 0
nlist(:) = 0
cached_bin = -1
ncache = 0
do i =1,np
soa(i) = position(i)%x
soa(np+i) = position(i)%y
soa(np+np+i) = position(i)%z
enddo
call system_clock(T1,clock_rate,clock_max)
!dir$ offload begin target(mic:0) in(soa: alloc_if(.false.),free_if(.false.)),&
!dir$ out(nlist: alloc_if(.false.) free_if(.false.)),&
!dir$ inout(numneigh: alloc_if(.false.) free_if(.false.)),&
!dir$ in(atombin: alloc_if(.false.) free_if(.false.)),&
!dir$ in(start: alloc_if(.false.) free_if(.false.)),&
!dir$ in(endposit: alloc_if(.false.) free_if(.false.)),&
!dir$ in(cnum: alloc_if(.false.) free_if(.false.)),&
!dir$ nocopy(pptr_cache: alloc_if(.false.) free_if(.false.)),&
!dir$ nocopy(cache: alloc_if(.false.) free_if(.false.)),&
!dir$ nocopy(dr2array: alloc_if(.false.) free_if(.false.))
!$omp parallel do schedule(dynamic) default(firstprivate),&
!$omp& shared(soa,atombin),&
!$omp& shared(cnum,start,endposit),&
!$omp& shared(nlist,numneigh)
do i = 1,np
x1 = soa(i)
y1 = soa(np+i)
z1 = soa(np*2+i)
n0 = (i-1)*neigh_alloc
bin = atombin(i)
if(bin.ne.cached_bin)then
ncache = 0
start_stencil = bin*27
end_stencil = 27*bin+26
do k = start_stencil,end_stencil
cell_neigh = cnum(k)
c2s = start(cell_neigh); c2e = endposit(cell_neigh)
do z= c2s,c2e
ncache = ncache + 1
pptr_cache(ncache)= z
cache(ncache) = soa(z)
cache(cache_size+ncache) = soa(np+z)
cache(2*cache_size+ncache) = soa(2*np+z)
enddo
enddo
cached_bin = bin
endif
!dir$ vector aligned
!dir$ simd
do k=1,ncache
dx = x1-cache(k)
dy = y1-cache(cache_size+k)
dz = z1-cache(cache_size*2+k)
boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)
dx = dx-boxdx*box
dy = dy-boxdy*box
dz = dz-boxdz*box
dr2array(k) = dx*dx+dy*dy+dz*dz
enddo
nnpt = 0
do k = 1,ncache
neigh = pptr_cache(k)
if(neigh.ne.i)then
if(dr2array(k).lt.rv2)then
nnpt = nnpt+1
nlist(n0+nnpt) = neigh
endif
endif
enddo
numneigh(i) = nnpt
enddo
!$omp end parallel do
!dir$ end offload
call system_clock(T2,clock_rate,clock_max)
print*,'what is elapsed time',real(T2-T1)/real(clock_rate)
print*
end subroutine build_neighbor_nonewton_MIC
Finally, here is the code from high performance parallelism pearls which claims vectorization of both the stencil and neighptr loop
MMD_float* stencil_cache = threads->data[tid].stencil_cache;
for (int i = start_atom; i < end_atom; i++) {
int* neighptr = &neighbors[i * maxneighs];
int n = 0;
int local = 0;
int remote = 0;
const MMD_float xtmp = x[i * PAD + 0];
const MMD_float ytmp = x[i * PAD + 1];
const MMD_float ztmp = x[i * PAD + 2];
#ifdef AVX
__m128i mi = _mm_set1_epi32(i);
__m128i mstart = _mm_set1_epi32(start_atom);
__m128i mend = _mm_set1_epi32(end_atom-1); // need to get around lack of cmpge/cmple
__m256 mxtmp = _mm256_set1_ps(xtmp);
__m256 mytmp = _mm256_set1_ps(ytmp);
__m256 mztmp = _mm256_set1_ps(ztmp);
#endif
// If we encounter a new bin, cache its contents.
const int ibin = coord2bin(xtmp, ytmp, ztmp);
if (ibin != cached_bin)
{
ncache = 0;
for (int k = 0; k < nstencil; k++)
{
const int jbin = ibin + stencil;
int* loc_bin = &bins[jbin * atoms_per_bin];
for (int m = 0; m < bincount[jbin]; m++)
{
const int j = loc_bin;
*((int*)&stencil_cache[0*CACHE_SIZE + ncache]) = j;
stencil_cache[1*CACHE_SIZE + ncache] = x[j * PAD + 0];
stencil_cache[2*CACHE_SIZE + ncache] = x[j * PAD + 1];
stencil_cache[3*CACHE_SIZE + ncache] = x[j * PAD + 2];
ncache++;
}
}
if (ncache >= CACHE_SIZE) printf("ERROR: Too many atoms in the stencil - %d > %d\n", ncache, CACHE_SIZE);
cached_bin = ibin;
}
// Otherwise, we just look at the neighbors in the cache.
int c = 0;
#ifdef AVX
for (; c < (ncache/8)*8; c += 8)
{
const __m256 delx = _mm256_sub_ps(mxtmp, _mm256_load_ps(&stencil_cache[1*CACHE_SIZE + c]));
const __m256 dely = _mm256_sub_ps(mytmp, _mm256_load_ps(&stencil_cache[2*CACHE_SIZE + c]));
const __m256 delz = _mm256_sub_ps(mztmp, _mm256_load_ps(&stencil_cache[3*CACHE_SIZE + c]));
const __m256 rsq = _mm256_add_ps(_mm256_add_ps(_mm256_mul_ps(delx, delx), _mm256_mul_ps(dely, dely)), _mm256_mul_ps(delz, delz));
__m256 mask = _mm256_cmple_ps(rsq, mcutneighsq);
__m128i j1 = _mm_load_si128((__m128i*) &stencil_cache[c + 0]);
__m128 cmask1 = _mm256_castps256_ps128(mask);
__m128 nmask1 = _mm_castsi128_ps(_mm_cmpgt_epi32(j1, mi));
__m128 rmask1 = _mm_castsi128_ps(_mm_or_si128(_mm_cmplt_epi32(j1, mstart), _mm_cmpgt_epi32(j1, mend)));
__m128 lmask1 = _mm_andnot_ps(rmask1, nmask1);
rmask1 = _mm_and_ps(cmask1, rmask1);
lmask1 = _mm_and_ps(cmask1, lmask1);
n += pack_neighbors(&neighptr, j1, _mm_or_ps(lmask1, rmask1));
local += _mm_popcnt_u32(_mm_movemask_ps(lmask1));
remote += _mm_popcnt_u32(_mm_movemask_ps(rmask1));
__m128i j2 = _mm_load_si128((__m128i*) &stencil_cache[c + 4]);
__m128 cmask2 = _mm256_extractf128_ps(mask, 1);
__m128 nmask2 = _mm_castsi128_ps(_mm_cmpgt_epi32(j2, mi));
__m128 rmask2 = _mm_castsi128_ps(_mm_or_si128(_mm_cmplt_epi32(j2, mstart), _mm_cmpgt_epi32(j2, mend)));
__m128 lmask2 = _mm_andnot_ps(rmask2, nmask2);
rmask2 = _mm_and_ps(cmask2, rmask2);
lmask2 = _mm_and_ps(cmask2, lmask2);
n += pack_neighbors(&neighptr, j2, _mm_or_ps(lmask2, rmask2));
local += _mm_popcnt_u32(_mm_movemask_ps(lmask2));
remote += _mm_popcnt_u32(_mm_movemask_ps(rmask2));
}
#endif
for (; c < ncache; c++)
{
const int j = *((int*)&stencil_cache[0*CACHE_SIZE +c]);
if (j <= i && j >= start_atom && j < end_atom) continue;
const MMD_float delx = xtmp - stencil_cache[1*CACHE_SIZE + c];
const MMD_float dely = ytmp - stencil_cache[2*CACHE_SIZE + c];
const MMD_float delz = ztmp - stencil_cache[3*CACHE_SIZE + c];
const MMD_float rsq = delx * delx + dely * dely + delz * delz;
if (rsq <= cutneighsq)
{
neighptr[n++] = j;
if (j >= start_atom && j < end_atom) local++;
else remote++;
}
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sorry, forgot to mention that the subroutine under question is the second subroutine in mod_neighbor.f90.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The arrays cache, pptr_cache, dr2array must be local .OR. allocated large enough such that each thread can use a different zone of the array.
cache and dr2 array must be aligned.
It is unclear as to if making cache and dr2array private will cause the private copy to have aligned attributes.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim, you were spot on. The arrays are allocated such that they are aligned in the serial sections of the code before this subroutine is called. However when openmp allocates the private thread arrays, for some reason they are not allocated on 64 byte boundaries. This causes the segfault when a !dir$ vector aligned is used anywhere in the openmp section. I am going to do some digging around, and see if there is a way to make openmp allocate on a 64 byte boundary. Are you aware of any procedure for this?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
V15 of the compiler supports the OpenMP SMID clause. But in your case I do not think you can use it as you don't (can't) use simd on the outer part of the loop, but instead use (desire) it on an inner loop
I suggest you use the openmp thread number to compute an index to a scratch area in the array (or use a 2D array with one index being the thread number). The former would be faster.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page