Intel® Fortran Compiler
Build applications that can scale for the future with optimized code designed for Intel® Xeon® and compatible processors.
29423 Discussions

ifort compiler version specific seg fault error

conor_p_
Beginner
5,180 Views

Hello,

I am receiving a very strange segfault that seems to depend on the version of ifort I am using. I am including the subroutine that is producing the segfault. I have ran the code with a number of different compiler versions. Some produce a segfault, and some do not. At first I thought this might be a heap/stack issue since it didn't segfault with gfortran, so I had -ulimit -s unlimited for all these runs. This code does use a MIC, although the subroutine under question does not yet (I had commented out all openmp and offload directions), and some compilers produced some warning. I included them, just in case. Code was compiled with no optimizations. If you have any suggestion, I would greatly appreciate it. I am extremely perplexed. This is part of a much larger molecular dynamics code, so I am hesitant to post the whole code. But let me know if thats necessary.

15.0.1.133, no segfault. completed whole code successfully
x86_64-k1om-linux-ld: warning: libimf.so, needed by /apps/rhel6/intel/composer_xe_2015.1.133/compiler/lib/mic/liboffload.so.5, not found (try using -rpath or -rpath-link)
x86_64-k1om-linux-ld: warning: libsvml.so, needed by /apps/rhel6/intel/composer_xe_2015.1.133/compiler/lib/mic/liboffload.so.5, not found (try using -rpath or -rpath-link)
x86_64-k1om-linux-ld: warning: libirng.so, needed by /apps/rhel6/intel/composer_xe_2015.1.133/compiler/lib/mic/liboffload.so.5, not found (try using -rpath or -rpath-link)
x86_64-k1om-linux-ld: warning: libintlc.so.5, needed by /apps/rhel6/intel/composer_xe_2015.1.133/compiler/lib/mic/liboffload.so.5, not found (try using -rpath or -rpath-link)
 
version 13.1.1.163 produces segfault with warning
warning: ipo: warning #11010 *MIC* file format not recognized for /lib64/lipthread.so.0
warning: ipo: warning #11010 *MIC* file format not recognized for /lib64/lipthread.so.6
warning: ipo: warning #11010 *MIC* file format not recognized for /lib64/lipthread.so.6
 
14.0.2.144 segfault
 
14.0.0.080 no segfault. completed whole code successfully.
 
13.0.1.117 seg fault
subroutine build_neighbor_nonewton_MIC(step)
    implicit none

     real*4 :: x1,y1,z1,x2,y2,z2
     real*4:: dx,dy,dz,dr2
     real*4 :: boxdx,boxdy,boxdz
     integer :: i,j,k,z,m
     integer :: c1s,c1e,c2s,c2e
     integer :: cell_neigh,step
     integer :: num1tmp,num2tmp,num3tmp
     integer :: neigh_flag
     integer :: nnpt,n0,neigh
     integer :: bin
     integer :: T1,T2,clock_rate,clock_max
     integer :: cached_bin,ncache


     numneigh(:) = 0
     nlist(:) = 0
     cached_bin = -1
     ncache = 0

   
     nnpt = 0
     print*,'welcome to mic build'

     call system_clock(T1,clock_rate,clock_max)



  !   !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&
  !   !dir$ inout(nlist: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ inout(numneigh: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(pptr_cache: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(x_cache: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(y_cache: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(z_cache: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(start: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(endposit: alloc_if(.false.) free_if(.false.)),&
  !   !dir$ nocopy(cnum: alloc_if(.false.) free_if(.false.))

     

  !   !$omp parallel do schedule(dynamic),&
  !   !$omp& shared(position,num3bond),&
  !   !$omp& shared(cnum,start,endposit,x_cache,y_cache,z_cache,pptr_cache),&
  !   !$omp& shared(nlist,numneigh),&
  !   !$omp& shared(np,rv2,box,ibox)
     do i = 1,np
        x1 = position(i)%x
        y1 = position(i)%y
        z1 = position(i)%z
        num3tmp = num3bond(i)
        nnpt = 0        

        n0 = (i-1)*512
        bin = xyz2bin(x1,y1,z1)
        c1s = start(bin); c1e = endposit(bin)

        if(bin.ne.cached_bin)then
           ncache = 0
        
           do k = bin*26,26*bin+25   
              cell_neigh = cnum(k)            
              c2s = start(cell_neigh); c2e = endposit(cell_neigh)
              
              do z= c2s,c2e
                 ncache = ncache + 1
                 pptr_cache(ncache)= z 
                 x_cache(ncache)= position(z)%x
                 y_cache(ncache)= position(z)%y
                 z_cache(ncache)= position(z)%z
              enddo
           enddo
           do z = c1s,c1e
              ncache = ncache + 1
              pptr_cache(ncache)= z 
              x_cache(ncache)= position(z)%x
              y_cache(ncache)= position(z)%y
              z_cache(ncache)= position(z)%z
           enddo
           
           cached_bin = bin
        endif


        !dir$ simd
        do k = 1,ncache
           neigh = pptr_cache(k)
           neigh_flag = bonded_test_inline(num3tmp,i,neigh,specbond,np)

           if(neigh_flag.eq.0.and.neigh.ne.i)then
              
              dx = x1-x_cache(k)
              dy = y1-y_cache(k)
              dz = z1-z_cache(k)


              dx = dx-box*nint(dx*ibox)
              dy = dy-box*nint(dy*ibox)
              dz = dz-box*nint(dz*ibox)
  
              dr2 = dx*dx+dy*dy+dz*dz
         
              if(dr2.lt.rv2)then 
                 nnpt = nnpt+1
            
                 nlist(n0+nnpt) = neigh
              endif
           endif
        enddo
        numneigh(i) = nnpt
     enddo
  !   !$omp end parallel do
  !   !dir$ end offload

     call system_clock(T2,clock_rate,clock_max)
     print*,'what is elapsed time',real(T2-T1)/real(clock_rate)

     do i = 1,10
        print*,numneigh(i)
     enddo

     print*,'we have now exited'

                
   end subroutine build_neighbor_nonewton_MIC
                 


   !dir$ attributes inline :: bonded_test_inline
   integer function bonded_test_inline(num3tmp,part,l,bonds,np)
     implicit none
     integer :: m,j,l,num3tmp,np,part
     integer :: bonds(12,np)
     bonded_test_inline = 0
     do m = 1,num3tmp
        if(bonds(m,part).eq.l)then
           bonded_test_inline = 1
        endif
     enddo
   end function bonded_test_inline
integer function xyz2bin(x,y,z)
    implicit none
    real*4  :: x,y,z
    integer :: ix,iy,iz,numlength
    
    if(x.lt.box)then    
       ix = int(x/rn)
    else
       ix = ncellD-1
    endif
    
    if(y.lt.box)then    
       iy = int(y/rn)
    else
       iy = ncellD-1
    endif
    
    if(z.lt.box)then    
       iz = int(z/rn)
    else
       iz = ncellD-1
    endif
    
    xyz2bin = ix+iy*ncellD+iz*ncellD*ncellD
    
    return
  end function xyz2bin

 

0 Kudos
18 Replies
conor_p_
Beginner
5,180 Views

Sorry, I forgot to mention that when I compiled with the make opt=debug which looks like

cflags = -pg
FFLAGS = $(INC) -CB -g -traceback
LDFLAGS = -JModules

with the 13.1.1.163 version, then I don't even get a segfault! Very perplexing.

0 Kudos
conor_p_
Beginner
5,179 Views

Ok, so I found the "error." If I do not include the !dir$ simd directive on line 87, the code runs fine. Does anyone have any idea why the different versions of the compiler are treating this simd directive differently, and what I can do to prevent this segfault? I would like to run this subroutine on a xeon phi coprocessor, and need that SIMD to get efficient vectorization.

 

 

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,180 Views

The !dir$ simd directive on line 87 is a contract you made with the compiler stating the arrays indexed by the loop control variable are aligned, on MIC this is to 64 byte boundary. When you compile without optimization then the !dir$ simd has no effect. Also, with the flow control (if tests) it is hard to imagine the code would vectorize.

Jim Dempsey

 

0 Kudos
conor_p_
Beginner
5,179 Views

Thanks, jim. I totally agree about the code alignment. When I compile with full optimizations, including the !dir$ simd, I compile with

-O3 -align array64byte 

So all arrays should be aligned. nlist which is used in the conditional if branch is allocated as follows

neigh_alloc = 512

allocate(nlist(1024*np))

Thus, each particle i gets 512 elements to work with in the nlist array, and I don't think there should be an alignment issue there. x_cache, y_cache, z_cache, and pptr_cache are allocated with 2056 elements similarly. Even when I get rid of the inline function call and just do

       !dir$ simd
        do k = 1,ncache
           neigh = pptr_cache(k)           
           neigh_flag = 0

           if(neigh_flag.eq.0.and.neigh.ne.i)then
              
              dx = x1-x_cache(k)
              dy = y1-y_cache(k)
              dz = z1-z_cache(k)


              dx = dx-box*nint(dx*ibox)
              dy = dy-box*nint(dy*ibox)
              dz = dz-box*nint(dz*ibox)
  
              dr2 = dx*dx+dy*dy+dz*dz
         
              if(dr2.lt.rv2)then 
                 nnpt = nnpt+1
                 nlist(n0+nnpt) = neigh
              endif
           endif
        enddo

I get the segfault. So why wouldn't my arrays be aligned here? Are you familiar with any method to get a loop like this vectorize with the conditional statement?

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,179 Views

Insert sanity checks in front of the !dir$ simd

if(mod(loc(x_cache(1),64) .ne. 0) stop "(mod(loc(x_cache(1),64) .ne. 0)"
if(mod(loc(y_cache(1),64) .ne. 0) stop "(mod(loc(y_cache(1),64) .ne. 0)"
if(mod(loc(z_cache(1),64) .ne. 0) stop "(mod(loc(z_cache(1),64) .ne. 0)"
if(mod(loc(nlist(n0+nnpt+1),64) .ne. 0) stop "(mod(loc(nlist(n0+nnpt+1),64) .ne. 0)"

Your loop will not vectorize well due to the nested if tests. Part of the loop may be able to perform a vector gather. You are now missing the bonded_test_inline resulting in neigh_flag always being 0.

Consider this change:

real*4 :: dr2(2056) ! same size as pptr_cache
...
!dir$ simd
do k = 1,ncache
  dx = x1-x_cache(k)
  dy = y1-y_cache(k)
  dz = z1-z_cache(k)

  dx = dx-box*nint(dx*ibox)
  dy = dy-box*nint(dy*ibox)
  dz = dz-box*nint(dz*ibox)

  dr2(k) = dx*dx+dy*dy+dz*dz
 enddo

 do k = 1,ncache
    neigh = pptr_cache(k)           
    if(neigh.ne.i)then
       if(dr2(k).lt.rv2)then 
          nnpt = nnpt+1
          nlist(n0+nnpt) = neigh
       endif
    endif
 enddo

The first loop will vectorize. At most you will perform one useless calculation (neigh == i), but the remainder will all be vectorized.

The second loop produces the new index table.

Jim Dempsey

0 Kudos
conor_p_
Beginner
5,180 Views

As always, thank you, Jim. Could you please explain to me your "sanity check" a little further. I am not really familiar with that syntax, and am not quite sure what you are doing there. Also, do you have any suggestions for a vectorizable nearest integer function? Currently I am trying

dx = x1-x_cache(k)
boxdx = dx*ibox
boxdx = boxdx+sign(1/epsilon(boxdx),boxdx) -sign(1/epsilon(boxdx),dx)
dx = dx-boxdx*ibox

in lieu of

dx = x1 - x_cache(k)
dx = dx-box*nint(dx*ibox)

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,180 Views

A "sanity check" means your are self-diagnosing your own assumptions about what your code is supposed to be doing or what it expects/requires. Usually one inserts checks to see if an array is allocated or large enough or if an index is within bounds. You can call a "sanity check" as an assert.

The !dir$ simd tells the compiler that you know for certain that the arrays indexed by the loop control variables are known to be vector aligned. The sanity check verifies this. Note, you can place the sanity checks into a conditional compile section such that the test are performed during test builds. An easy way of doing this is to use FPP and conditionally define a macro, potentially naming the macro _ASSERT.

The test I listed assumes the lower bound of the arrays are 1, and then uses LOC(yourArray(1)) to obtain the address of the beginning of the array. An array aligned to a 64-byte boundary will have an address that has a modulus of 64 equal to 0. (0, 64, 128, 192, ...). The tests asserts that what you assume to be true is indeed true. These tests need not be in the production code unless the production code is not in control of the allocations. An example of this is your app is a library function that requires the input array to be aligned. The sanity test in this case could emit a meaningful error message in lieu of a seg fault.

RE: suggestion

rbox = ibox ! convert outside the loop
...
dx = x1 - x_cache(k)
dx = dx - box*anint(dx*rbox) ! verify box is correct

Check to see if the intrinsic anint is vectorizable.

Jim Dempsey

 

0 Kudos
Martyn_C_Intel
Employee
5,180 Views

 

The meaning of !dir$ simd is a little different from that.

It says nothing about data alignment.  But it does assert that your loop contains no vector dependencies between iterations (roughly, one iteration must not depend on the result of a previous one). It asks the compiler to do everything possible to vectorize the loop, irrespective of whether or not it will run faster. Because the directive disables the compiler's own dependency analysis, the programmer is responsible for specifying reduction variables and variables that are private to each loop iteration (this is exactly analogous to programming for OpenMP threading).If this is not done correctly, you will likely get incorrect results.

In your code, you have several variables that are private to each loop iteration: neigh, neigh_flag, dx, dy, dz, dr2. This may result in a race condition - one loop iteration may overwrite the value of a variable while it is in use by a different iteration. Worse, the variable nnpt results in a dependency between iterations that is very likely to lead to incorrect results. Whether or not the overwrites lead to a seg fault is a matter of chance - a wrong value of nnpt could lead to an illegal memory address for nlist, for example. But this is clearly and illegal use of !DIR$ SIMD.

If there was no data compression (the piece of code involving nlist and nnpt), the directive could be fixed by adding a PRIVATE clause for the necessary variables. But I don't think there is a way in the current compilers to enforce safe vectorization of a compression loop. (This is "compression", because nnpt is incremented and nlist is filled for some values of k, but not for others).

To read more about the SIMD directive, in addition to what is in the compiler user and reference guide, see the section "The SIMD Directive" of my article "Explicit Vector Programming in Fortran" at https://software.intel.com/en-us/articles/explicit-vector-programming-in-fortran, especially the last code sample.

Note that the OpenMP 4.0 standard now supports a !$OMP SIMD   directive that is closely similar to !DIR$ SIMD. You can read about this at openmp.org. With the Intel compiler, you need to compile with -openmp or -openmp-simd for such directives to be recognized.  Building with -align array64byte is also generally a good idea, but it is not required for correctness unless you are asserting alignment. !$OMP SIMD supports an "ALIGNED" clause that allows you to specify data alignment, if you so wish.

0 Kudos
conor_p_
Beginner
5,180 Views

Thank you very much, martyn. I believe I have fixed the race condition. I made x_cache,y_cache,z_cache, and pptr_cache private to each thread. Also the memory blocks looped over by each thread in nlist is thread specific. I believe this should fix any race conditions. Am I correct in this?

I ran the code with the following flags and received no warning of the memory not being aligned

if(mod(loc(x_cache(1),64) .ne. 0) stop "(mod(loc(x_cache(1),64) .ne. 0)"
if(mod(loc(y_cache(1),64) .ne. 0) stop "(mod(loc(y_cache(1),64) .ne. 0)"
if(mod(loc(z_cache(1),64) .ne. 0) stop "(mod(loc(z_cache(1),64) .ne. 0)"
if(mod(loc(nlist(n0+nnpt+1),64) .ne. 0) stop "(mod(loc(nlist(n0+nnpt+1),64) .ne. 0)"

 

I compiled with -O3 -openmp -align array64byte -vec-report6 and got the following below. Here, the compiler is telling me that x_cache,y_cache,z_cache, and pptr_cache all have unaligned access at some point. At other points, it mentions that y_cache and z_cache are aligned within the same subroutine. What is going on here? Why is the alignment changing throughout the subroutine, and why aren't the arrays aligned?

Source/mod_neighbor.f90(102): (col. 9) remark: *MIC* loop was not vectorized: loop was transformed to memset or memcpy.
Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(120): (col. 18) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access.
Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: reference global_mp_y_cache_ has aligned access.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(118): (col. 15) remark: *MIC* LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(120): (col. 18) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access.
Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(118): (col. 15) remark: *MIC* PEEL LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(120): (col. 18) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access.
Source/mod_neighbor.f90(122): (col. 18) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(123): (col. 18) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access.
Source/mod_neighbor.f90(124): (col. 18) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(118): (col. 15) remark: *MIC* REMAINDER LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(114): (col. 12) remark: *MIC* loop was not vectorized: not inner loop.
Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: gather was generated for the variable .2.53_2global_mp_position_:  strided by 4.
Source/mod_neighbor.f90(132): (col. 15) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access.
Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: reference global_mp_y_cache_ has aligned access.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(130): (col. 12) remark: *MIC* SIMD LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(132): (col. 15) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access.
Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(130): (col. 12) remark: *MIC* PEEL LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(132): (col. 15) remark: *MIC* vectorization support: reference global_mp_pptr_cache_ has unaligned access.
Source/mod_neighbor.f90(134): (col. 15) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(135): (col. 15) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: reference global_mp_z_cache_ has aligned access.
Source/mod_neighbor.f90(136): (col. 15) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(130): (col. 12) remark: *MIC* REMAINDER LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(150): (col. 12) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(151): (col. 12) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(152): (col. 12) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access.
Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: reference global_mp_dr2array_ has aligned access.
Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(149): (col. 9) remark: *MIC* SIMD LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(150): (col. 12) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(151): (col. 12) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(152): (col. 12) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access.
Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: reference global_mp_dr2array_ has unaligned access.
Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(149): (col. 9) remark: *MIC* PEEL LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(150): (col. 12) remark: *MIC* vectorization support: reference global_mp_x_cache_ has unaligned access.
Source/mod_neighbor.f90(151): (col. 12) remark: *MIC* vectorization support: reference global_mp_y_cache_ has unaligned access.
Source/mod_neighbor.f90(152): (col. 12) remark: *MIC* vectorization support: reference global_mp_z_cache_ has unaligned access.
Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: reference global_mp_dr2array_ has aligned access.
Source/mod_neighbor.f90(163): (col. 12) remark: *MIC* vectorization support: unaligned access used inside loop body.
Source/mod_neighbor.f90(149): (col. 9) remark: *MIC* REMAINDER LOOP WAS VECTORIZED.
Source/mod_neighbor.f90(168): (col. 9) remark: *MIC* loop was not vectorized: existence of vector dependence.
Source/mod_neighbor.f90(97): (col. 6) remark: *MIC* loop was not vectorized: not inner loop.
Source/mod_neighbor.f90(91): (col. 12) remark: *MIC* loop was not vectorized: not inner loop.

 

0 Kudos
conor_p_
Beginner
5,180 Views

Sorry, I realized the answer the second I posted. Of course just compiling with -align array64byte isn't sufficient. The inclusion of the !dir$ vector aligned pragma outside the appropriate loops generated a vec-report with all arrays aligned. Thanks everyone.

My last question is if there is a way to see specifically if a line of code if vectorized or not. I know the -vec-report will tell me what loops are vectorized. But how can i tell if the anint function, like below/ inside a loop is being vectorized? I know the whole loop is being vectorized, but I need to check to see if the anint is a vectorizable function.

do k =1,ncache
     dx = dx-box*anint(dx*ibox)
     ...
enddo
0 Kudos
conor_p_
Beginner
5,180 Views

Although, it is interesting to note that the compiler denotes that a gather is implemented for the global array position which is an array of structures

type aos
    real*4 :: x,y,z
    integer :: type
end type aos

the inclusion of the !dir$ vector aligned outside any loop containing that array causes a segfault

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,180 Views

The making of the x_cache(... private, eliminating a race condition is a red herring. What it potentially could have done is take an unaligned array and copy it to an aligned buffer. The correct fix is to assure the x, y, z cach array are aligned, omit the copy to private, and ignore any superfluous race condition report. Any race condition of importance would have been on the nlist(...=

To find out if vectorization on anint is used, use VTune and look at the disassembly. Or, in front of the do loop insert

print(*) "Lookie here"

and place a break on the print statement. This should work with /O3.

When at break, open the disassembly window, and look at the code following the call to the fortran print routine.

The loop may be hard to identify if you are not targeting for host.

Jim Dempsey

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,180 Views

Non-stride 1 vectors (IOW those using gather) are not aligned. You might find this paper interesting: http://research.colfaxinternational.com/file.axd?file=2014%2f11%2fColfax-Optimization.pdf

Look at the section relating to AOS verses SOA

Jim Dempsey

0 Kudos
conor_p_
Beginner
5,180 Views

Ok, so I have made a small compilable version of this subroutine. I do not have access to vtune, so any help here would be appreciated.I have created an array called soa which stores my particle positions in an SOA type behavior in an attempt to get the loop at line 70 use aligned gathers. However, the compiler says theres a vector dependence and doesn't mention anything about the alignment when I compile with -vec-report6 flag. Jim, I am pretty sure that cache needs to be local to each thread here. The algorithm creates for each particle i, if the bin of i is different from particle i-1, an array of positions called cache. Since each thread will take unique particles, which will have a unique cache vector, I believe it needs to be private. I compile with

ifort -O3 -align array64byte -openmp -fp=model fast=2 fimf-domain-exclusion=15 -openmp global.f90 mod_xyz2bin.f90 mod_neighbor.f90 mod_force.f90 MD.f90 -o new.out

If you compile with the -vec-report6 flag, you will see that cache is aligned. However, the inclusion of the !dir$ vector aligned pragma mysteriously causes the code to segfault on the MIC every time. Can someone please explain this to me?

Is there any way to get the loop at line 70 to vectorize so I can get optimal memory transfer? The author of "High Performance Parallelism Pearls, chapter 8" claims that vectorizes. I will attach a copy of his code.  This issue is his is in C, and I am trying to make his algorithm in my fortran code.

Also, I am hoping that aligment betters the performance here. I ran with 16 openmp threads and got 0.02 s on average, but only 0.06s on mic. The serial code version executes in about 0.2s

Finally about the alignment of arrays of structures such as

type aos
     real*4 :: x,y,z
     integer :: type
end type aos

I have seen numerous C++ codes use this structure along with a !dir$ vector aligned. The xeon phi coprocessor code for lammps, which was written by michael brown at intel, uses this very type of code. I will show a very short segment of his code which shows this

        #if defined(__INTEL_COMPILER)
        #pragma vector aligned
	#pragma simd reduction(+:fxtmp, fytmp, fztmp, fwtmp, sevdwl, \
	                       sv0, sv1, sv2, sv3, sv4, sv5)
        #endif
        for (int jj = 0; jj < jnum; jj++) {
          flt_t forcelj, evdwl;
          forcelj = evdwl = (flt_t)0.0;

          const int sbindex = jlist[jj] >> SBBITS & 3;
          const int j = jlist[jj] & NEIGHMASK;
          const flt_t delx = xtmp - x.x;
          const flt_t dely = ytmp - x.y;
          const flt_t delz = ztmp - x.z;
          const int jtype = x.w;

As you can see, he is using this type of AOS, and must be generating aligned access to use the vector aligned command. Is this a c++ vs fortran issue, or is there something I am missing?

Here is my subroutine

subroutine build_neighbor_nonewton_MIC(step)
    implicit none

     real*4 :: x1,y1,z1,x2,y2,z2
     real*4:: dx,dy,dz,dr2
     real*4 :: boxdx,boxdy,boxdz
     integer :: bonds(12)
     integer :: start_stencil,end_stencil
     integer :: check
     integer :: i,j,k,z,m
     integer :: c1s,c1e,c2s,c2e
     integer :: cell_neigh,step
     integer :: num1tmp,num2tmp,num3tmp
     integer :: neigh_flag
     integer :: nnpt,n0,neigh
     integer :: bin,count
     integer :: T1,T2,clock_rate,clock_max
     integer :: cached_bin,ncache
 


     numneigh(:) = 0
     nlist(:) = 0
     cached_bin = -1
     ncache = 0
  
     do i =1,np
        soa(i) = position(i)%x
        soa(np+i) = position(i)%y
        soa(np+np+i) = position(i)%z
     enddo

 
     call system_clock(T1,clock_rate,clock_max)


     !dir$ offload begin target(mic:0) in(soa: alloc_if(.false.),free_if(.false.)),&
     !dir$ out(nlist: alloc_if(.false.) free_if(.false.)),&
     !dir$ inout(numneigh: alloc_if(.false.) free_if(.false.)),&
     !dir$ in(atombin: alloc_if(.false.) free_if(.false.)),&
     !dir$ in(start: alloc_if(.false.) free_if(.false.)),&
     !dir$ in(endposit: alloc_if(.false.) free_if(.false.)),&
     !dir$ in(cnum: alloc_if(.false.) free_if(.false.)),&
     !dir$ nocopy(pptr_cache: alloc_if(.false.) free_if(.false.)),&
     !dir$ nocopy(cache: alloc_if(.false.) free_if(.false.)),&
     !dir$ nocopy(dr2array: alloc_if(.false.) free_if(.false.))
 
     
     !$omp parallel do schedule(dynamic) default(firstprivate),&
     !$omp& shared(soa,atombin),&
     !$omp& shared(cnum,start,endposit),&
     !$omp& shared(nlist,numneigh)
     do i = 1,np
        x1 = soa(i)
        y1 = soa(np+i)
        z1 = soa(np*2+i)
      
        n0 = (i-1)*neigh_alloc
        bin = atombin(i)
    
        if(bin.ne.cached_bin)then
           ncache = 0
           start_stencil = bin*27
           end_stencil   = 27*bin+26
 
           do k = start_stencil,end_stencil
              cell_neigh = cnum(k)            
              c2s = start(cell_neigh); c2e = endposit(cell_neigh)

              do z= c2s,c2e                                  
                 ncache = ncache + 1
                 pptr_cache(ncache)= z                                         
                 cache(ncache) = soa(z)
                 cache(cache_size+ncache) = soa(np+z)
                 cache(2*cache_size+ncache) = soa(2*np+z)
              enddo
           enddo         
           cached_bin = bin
        endif
        
        !dir$ vector aligned
        !dir$ simd
        do k=1,ncache
           
           dx = x1-cache(k)
           dy = y1-cache(cache_size+k)
           dz = z1-cache(cache_size*2+k)


           boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
           boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
           boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
           boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)
           dx = dx-boxdx*box
           dy = dy-boxdy*box
           dz = dz-boxdz*box

           dr2array(k) = dx*dx+dy*dy+dz*dz
        enddo
         
         
        nnpt = 0
        do k = 1,ncache
           neigh = pptr_cache(k)
           if(neigh.ne.i)then
                 
              if(dr2array(k).lt.rv2)then
                 nnpt = nnpt+1
                 nlist(n0+nnpt) = neigh
              endif

           endif
        enddo


        numneigh(i) = nnpt
     enddo

     !$omp end parallel do
     !dir$ end offload


     call system_clock(T2,clock_rate,clock_max)
     print*,'what is elapsed time',real(T2-T1)/real(clock_rate)
     print*

   end subroutine build_neighbor_nonewton_MIC

     Finally, here is the code from high performance parallelism pearls which claims vectorization of both the stencil and neighptr loop

MMD_float* stencil_cache = threads->data[tid].stencil_cache;
    for (int i = start_atom; i < end_atom; i++) {

      int* neighptr = &neighbors[i * maxneighs];

      int n = 0;
      int local = 0;
      int remote = 0;

      const MMD_float xtmp = x[i * PAD + 0];
      const MMD_float ytmp = x[i * PAD + 1];
      const MMD_float ztmp = x[i * PAD + 2];

#ifdef AVX
      __m128i mi = _mm_set1_epi32(i);
      __m128i mstart = _mm_set1_epi32(start_atom);
      __m128i mend = _mm_set1_epi32(end_atom-1); // need to get around lack of cmpge/cmple
      __m256 mxtmp = _mm256_set1_ps(xtmp);
      __m256 mytmp = _mm256_set1_ps(ytmp);
      __m256 mztmp = _mm256_set1_ps(ztmp);
#endif

      // If we encounter a new bin, cache its contents.
      const int ibin = coord2bin(xtmp, ytmp, ztmp);
      if (ibin != cached_bin)
      {
        ncache = 0;
        for (int k = 0; k < nstencil; k++)
        {
          const int jbin = ibin + stencil;
          int* loc_bin = &bins[jbin * atoms_per_bin];
          for (int m = 0; m < bincount[jbin]; m++)
          {
            const int j = loc_bin;
            *((int*)&stencil_cache[0*CACHE_SIZE + ncache]) = j;
            stencil_cache[1*CACHE_SIZE + ncache] = x[j * PAD + 0];
            stencil_cache[2*CACHE_SIZE + ncache] = x[j * PAD + 1];
            stencil_cache[3*CACHE_SIZE + ncache] = x[j * PAD + 2];
            ncache++;
          }
        }
        if (ncache >= CACHE_SIZE) printf("ERROR: Too many atoms in the stencil - %d > %d\n", ncache, CACHE_SIZE);
        cached_bin = ibin;
      }

      // Otherwise, we just look at the neighbors in the cache.
      int c = 0;
#ifdef AVX
      for (; c < (ncache/8)*8; c += 8)
      {
        const __m256 delx = _mm256_sub_ps(mxtmp, _mm256_load_ps(&stencil_cache[1*CACHE_SIZE + c]));
        const __m256 dely = _mm256_sub_ps(mytmp, _mm256_load_ps(&stencil_cache[2*CACHE_SIZE + c]));
        const __m256 delz = _mm256_sub_ps(mztmp, _mm256_load_ps(&stencil_cache[3*CACHE_SIZE + c]));
        const __m256 rsq = _mm256_add_ps(_mm256_add_ps(_mm256_mul_ps(delx, delx), _mm256_mul_ps(dely, dely)), _mm256_mul_ps(delz, delz));
        __m256 mask = _mm256_cmple_ps(rsq, mcutneighsq);

        __m128i j1 = _mm_load_si128((__m128i*) &stencil_cache[c + 0]);
        __m128 cmask1 = _mm256_castps256_ps128(mask);
        __m128 nmask1 = _mm_castsi128_ps(_mm_cmpgt_epi32(j1, mi));
        __m128 rmask1 = _mm_castsi128_ps(_mm_or_si128(_mm_cmplt_epi32(j1, mstart), _mm_cmpgt_epi32(j1, mend)));
        __m128 lmask1 = _mm_andnot_ps(rmask1, nmask1);
        rmask1 = _mm_and_ps(cmask1, rmask1);
        lmask1 = _mm_and_ps(cmask1, lmask1);
        n += pack_neighbors(&neighptr, j1, _mm_or_ps(lmask1, rmask1));
        local += _mm_popcnt_u32(_mm_movemask_ps(lmask1));
        remote += _mm_popcnt_u32(_mm_movemask_ps(rmask1));
        
        __m128i j2 = _mm_load_si128((__m128i*) &stencil_cache[c + 4]);
        __m128 cmask2 = _mm256_extractf128_ps(mask, 1);
        __m128 nmask2 = _mm_castsi128_ps(_mm_cmpgt_epi32(j2, mi));
        __m128 rmask2 = _mm_castsi128_ps(_mm_or_si128(_mm_cmplt_epi32(j2, mstart), _mm_cmpgt_epi32(j2, mend)));
        __m128 lmask2 = _mm_andnot_ps(rmask2, nmask2);
        rmask2 = _mm_and_ps(cmask2, rmask2);
        lmask2 = _mm_and_ps(cmask2, lmask2);
        n += pack_neighbors(&neighptr, j2, _mm_or_ps(lmask2, rmask2));
        local += _mm_popcnt_u32(_mm_movemask_ps(lmask2));
        remote += _mm_popcnt_u32(_mm_movemask_ps(rmask2));
      }
#endif

      for (; c < ncache; c++)
      {

        const int j = *((int*)&stencil_cache[0*CACHE_SIZE +c]);
        if (j <= i && j >= start_atom && j < end_atom) continue;

        const MMD_float delx = xtmp - stencil_cache[1*CACHE_SIZE + c];
        const MMD_float dely = ytmp - stencil_cache[2*CACHE_SIZE + c];
        const MMD_float delz = ztmp - stencil_cache[3*CACHE_SIZE + c];
        const MMD_float rsq = delx * delx + dely * dely + delz * delz;

        if (rsq <= cutneighsq)
        {
           neighptr[n++] = j;
           if (j >= start_atom && j < end_atom) local++;
           else remote++;
        }

      }

   

 

0 Kudos
conor_p_
Beginner
5,180 Views

Sorry, forgot to mention that the subroutine under question is the second subroutine in mod_neighbor.f90. 

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,180 Views

The arrays cache, pptr_cache, dr2array must be local .OR. allocated large enough such that each thread can use a different zone of the array.

cache and dr2 array must be aligned.

It is unclear as to if making cache and dr2array private will cause the private copy to have aligned attributes.

Jim Dempsey

 

0 Kudos
conor_p_
Beginner
5,180 Views

Jim, you were spot on. The arrays are allocated such that they are aligned in the serial sections of the code before this subroutine is called. However when openmp allocates the private thread arrays, for some reason they are not allocated on 64 byte boundaries. This causes the segfault when a !dir$ vector aligned is used anywhere in the openmp section. I am going to do some digging around, and see if there is a way to make openmp allocate on a 64 byte boundary. Are you aware of any procedure for this?

0 Kudos
jimdempseyatthecove
Honored Contributor III
5,180 Views

V15 of the compiler supports the OpenMP SMID clause. But in your case I do not think you can use it as you don't (can't) use simd on the outer part of the loop, but instead use (desire) it on an inner loop

I suggest you use the openmp thread number to compute an index to a scratch area in the array (or use a 2D array with one index being the thread number). The former would be faster.

Jim Dempsey

 

0 Kudos
Reply