- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello everyone, I have a subroutine that serves as the force calculation in an n-body simulation shown below. There a couple of arrays that I will describe which will hopefully make the subroutine a little more clear.
1. array position. it contains positions of particles and atom types. it is of length number of particles np
type aos real*4 :: x,y,z integer :: type end type aos type(aos) :: position(np)
2. array neigh. it contains a list of pointers to particles that are within a certain distance of a given particle. it is of type integer.
3. array nnstart and nnfinish. nnstart(i) and nnfinish(i) gives the starting and ending locations for information for particle i in array neigh. If I am particle i, my neighbors are contained in the indices between nnstart(i) and nnfinish(i) in neigh. these are also integers
4. array q. array of real*4s. charges of given atom
5. lj1,lj2,lj3,lj4. these are all arrays of real*4s. they provide parameters for the atomic interactions.
6. array ff. this is the array that stores the forces for each particle. it is of the following type
type ff_data real*4 :: x,y,z end type ff_data
all these arrays I have as globals and are defined for the mic as follows
!dir$ attributes offload:mic::position
My code was compiled using the -align array64byte flag. However, the inclusion of the !dir$ vector aligned statement on line 43 to assist vectorization of the inner loop produces a seg fault. I am very curious as to why, despite being compiled using the align array64byte, my arrays are not aligned. Any suggestions?
My second question is that I have ran this subroutine (shown below using single precision) using both full single precision and full double precision. In the single precision case, all array components are real*4 vs double precision. However, I only see about a 20% speed up. Now given that my bandwidth should double for SP vs. DP, and that my vector registers should be doing twice as many calculations, I expect a factor of two speed up. Is there something wrong with this reasoning, or is there something going on with this code?
Finally, does anyone have any suggestions for my current nearest integer function on line 54? I was given it as a suggestion on this forum, but don't know if there are faster alternatives.
subroutine lj_cut_coul_dsf_nonewton_SP(step,printflag) implicit none real*4 :: force,forcelj,forcecoul real*4 :: x1,y1,z1,x2,y2,z2 real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i double precision :: ffx,ffy,ffz real*4 :: fudge_factor real*4 :: qtmp,r,prefactor,erfcc,erfcd,t real*4 :: boxdx,boxdy,boxdz real*4 :: scalelj, scalecoul integer :: i,j,l,step integer :: itype,jtype,neigh integer :: tid,num integer :: offset integer :: printflag integer :: T1,T2,clock_rate,clock_max call system_clock(T1,clock_rate,clock_max) !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),& !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),& !dir$ in(nlist: alloc_if(.false.) free_if(.false.)),& !dir$ in(nnfirst: alloc_if(.false.) free_if(.false.)),& !dir$ in(nnlast: alloc_if(.false.) free_if(.false.)),& !dir$ in(q: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),& !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.)) !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(private),& !$omp& shared(box,ibox,rcut2,cut_coulsq,alpha,EWALD_P,MY_PIS,np,numAtomType),& !$omp& shared(A1,A2,A3,A4,A5),& !$omp& shared(e_shift,f_shift),& !$omp& shared(position,ff,nlist,nnfirst,nnlast,q,lj1,lj2,lj3,lj4) do i = 1 ,np !---load data for particle i: positions, type, and charge x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type qtmp = q(i) ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0 !dir$ vector aligned !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz) !... loop over neighbors for particle i do j= nnfirst(i),nnlast(i) !---load neighboring particle index neigh = nlist(j) !---calculate distance between particles using a vectorizable nearest integer function dx = x1-position(neigh)%x; dy = y1-position(neigh)%y; dz = z1-position(neigh)%z boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx) boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy) boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz) dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz dr2 = dx*dx + dy*dy + dz*dz scalelj = 0.0d0 if(dr2.lt.rcut2)scalelj = 1.0d0 dr2i = 1.0d0/dr2 dr6i = dr2i*dr2i*dr2i offset = numAtomType*(itype-1)+jtype forcelj = scalelj*dr6i*(lj1(offset)*dr6i-lj2(offset)) potential = potential + scalelj*dr6i*(dr6i*lj3(offset)-lj4(offset)) scalecoul = 0.0d0 if(dr2.lt.cut_coulsq)scalecoul = 1.0d0 r = sqrt(dr2) prefactor = scalecoul*qtmp*q(neigh)/r erfcd = exp(-alpha*alpha*r*r) t = 1.0 / (1.0 + EWALD_P*alpha*r) erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd forcecoul = prefactor * (erfcc/r + 2.0*alpha/MY_PIS * erfcd +& r*f_shift) * r e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift) force = (forcecoul+forcelj)*dr2i ffx = ffx + dx*force ffy = ffy + dy*force ffz = ffz + dz*force enddo ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz enddo !$omp end parallel do !dir$ end offload potential = 0.50d0*potential e_coul = 0.50d0*e_coul call system_clock(T2,clock_rate,clock_max) if(printflag.eq.1)then print*,'elapsed time w/o newton SP',real(T2-T1)/real(clock_rate) print*,'calculated potentials:',potential*reps, e_coul*reps endif end subroutine lj_cut_coul_dsf_nonewton_SP
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Your vector aligned directive asserts that nlist(nnfirst(i):nnlast(i)) is 64-byte aligned. Thus, if you set nlist(1) to be aligned by -align array64byte, you are in effect asserting that nnfirst can take on only values 1,5,9,.... which is OK if that is what you mean. It may be reasonable to assume that the compiler will not apply alignment assertions to position(:) or q(:) which I guess are implicit default real, as their vectorization has to be done by gather instructions. I don't know that you can expect significant advantage from alignment assertion when so much of your work is done in gathers.
Even if the loop were fully vectorized (what do your opt-report and vec-report show?), with so many conditionals and divides, and an unnecessarily high latency polynomial, you couldn't expect single precision to be twice as fast as double, even if you weren't using offload mode. VTune profiling should show you some of the time consuming obstacles and help you decide on possible faster alternatives. One of the more obvious is to replace r*r by dr2 which surely ought to be OK in double precision if you can consider single.
Compiler options are offered to give you e.g. 40 bit precision for divide and sqrt if you don't need 52.
If you would post actual code which can be compiled, we could look into some of your questions, including whether ieee_rint() can be optimized by recent compilers.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Tim, thank you for pointing out nnfirst through nnlast! I was staring so intently as position and ff that I forgot about those two. I am thinking about changing the algorithm to something like
num = numneigh(i) do j = 1,num neigh = nlist( (i-1)*MAX_NEIGH +j) .... enddo
If I make MAX_NEIGH a multiple of 16, say 1024, then there should be no alignment issue correct? I am thinking this will allow, for each i, the packing of 16 neighbors at a time. As long as max_neigh isn't too large, I expect this should help performance.
opt-report and vec-report are reporting vectorization of inner loop along with gather generated for variable q: indirect access.
Could you explain further what you meant by unnecessarily high latency polynomial? Am I doing something that is unnecessary in the calculation?
I have trimmed my code down to the necessary modules necessary for compilation and execution. I apologize for the number of files. This is part of a much larger molecular dynamics code I am writing. The ONLY module of interest here is mod_force.f90. It contains the subroutine in question, lj_cut_coul_dsf_nonewton_sp.
lj_cut_coul_dsf_nonewton_dp is the same exact subroutine, and is right below lj_cut_dsf_nonewton_sp in mod_force, except it uses DP arrays for timing comparisons.
compile: ifort -align array64byte -openmp global.f90 mod_memory.f90 get_started.f90 mod_Grofile.f90 mod_pbc.f90 mod_readfile.f90 mod_nondimensionalize.f90 mod_assign_types.f90 mod_buildbond_lists.f90 mod_neighbor.f90 mod_force.f90 mod_restart.f90 MD.f90 -O3
make sure input.txt mol.txt and tta.txt are in the same folder.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I discussed polynomial evaluation yet again at https://sites.google.com/site/tprincesite/ (file about expression evaluation).
You needn't be put off by the references to Fortran standard if you wish to simply skip down to the subject above.
About to attempt to build the applicatiion.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm getting Subscript #1 of the array LJ1 has value -858993388 .....
and it can run only when built at -O1 without checking. My MIC is an old 4GB model.
If your polynomial proved to be a bottleneck (by no means a known at this stage), you might write something like
t * erfcd * (A1+t*A2 + t**2*(A3+t*(A4+t*A5)))
Once you have things working, you can try fiddling the imf-domain-exclusion list to speed up divide and sqrt (e.g. by setting -fast=2).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I really apologize for that, Tim. To post my input files, I had to change the file extensions from what the usually are. Something got mixed up in the process. The code is fine. If you are willing to take one more stab at it, delete the mol.txt and mod_force.f90 you have, and use the attached ones, and all is well. I will also investigate those optimizations you suggested.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page