- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hey everyone, I have the following set of nested loops which I would like to vectorize. If you're familiar with molecular simulations, this is a linked cell energy calculation. I checked the requirements for intel vectorization, and I believe I satisfy them. There aren't multiple branch points. All array accessing, except for envar in the inner most loop, is unit stride. I believe the if statements in these loops can be interpreted as mask-constructs. I keep getting the error 'loop not vectorized: not inner loop," hence the addition of the !dir$ imp simd collapse(4). When i compile with -vec-report2 -guide -openmp the compiler still says "loop not vectorized: not inner loop," despite the collapse command. I would like to port this to the xeon phi coprocessor, so any help with the vectorization here would be greatly appreciated. I understand this is a fair amount of code, but i have tried to add many comments for explanation.
!$omp parallel do schedule(dynamic) reduction(+:energy) default(private) firstprivate(listvar), &
!$omp& shared(r,tr,cv,param,envar,alpha,GaussChar,pf)
!dir$ omp simd collapse(4)
!dir$ vector aligned
!--- loop over all cells present
do i = 0,listvar%ncellT-1
!--- these are pointers to array r which contains the positions of particles along with the particle type.
!--- c1s is the location in r where the first atom in cell i is stored. c1e is where the last atom is stored
c1s = tr(i)%start
c1e = tr(i)%end
!--- loop over cell neighbors
do k=1,26
!---determine this cell
cell_neigh = cv(k,i)%cnum
!--- only calculate energy between particle 1 and particle 2 once.
!---This is done by only calculating energies between cell pairs (i,cell_neigh) where cell_neigh is gt than i
if(cell_neigh.gt.i)then
!---minimum image criterion for distance calculation
minx =cv(k,i)%min_x
miny =cv(k,i)%min_y
minz =cv(k,i)%min_z
!---pointers to positions in array r. Tell where the atoms for cell cell_neigh are contained in r
c2s = tr(cell_neigh)%start
c2e = tr(cell_neigh)%end
!--- loop over particles in each cell
do j = c1s,c1e
x1 = r(j)%x
y1 = r(j)%y
z1 = r(j)%z
type1 = r(j)%type
do l= c2s,c2e
x2 = r(l)%x
y2 = r(l)%y
z2 = r(l)%z
type2 = r(l)%type
!--- obtain displacements
!--- apply minimum image here
dx = x2-x1-minx
dy = y2-y1-miny
dz = z2-z1-minz
dr2 = dx*dx+dy*dy+dz*dz
!--- is distance is less than cutoff, calculate energy
if(dr2.lt.param%rcut2)then
!--- obtain interaction parameters for atoms of type1 and type2
val(:)= envar(type1,type2)%val(:)
dr = sqrt(dr2)
dri = 1.0d0/dr
dr2i = val(1)/dr2
dr6i = dr2i*dr2i*dr2i
dr12i = dr6i*dr6i
!--- lennard jones energy
energy = energy + 4.0d0*val(2)*(dr12i - dr6i)
!--- electrostatic energy
!energy = energy + pf*charge*(erfc(alpha*dr)-GaussChar)
endif
enddo
enddo
endif
enddo
enddo
!$omp end parallel do
Link Copied
- « Previous
-
- 1
- 2
- Next »
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Can you tell me what you ran the code on to obtain those timings? I am running it on a MIC and was only getting about 1080 ms for the first loop. Not much of a speed up.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The first loop
x1 = rnew(1,k); y1 = rnew(2,k); z1 = rnew(3,k) x2 = rnew(1,l); y1 = rnew(2,l); z2 = rnew(3,l)
is likely exceeding an internal capacity of the compiler optimizer to keep track of might be fully vectorizable.
Below shows an excerpt of the fist and second loops. The first loop performs scalar operations to build up vectors using multiple fetch and merge instructions (more work), whereas the second loop uses fully vectorized fetches.
Block 19: 66 vmovsd xmm14, qword ptr [rbp+r14*1+0x6120] // vmovsd is a scalar (1 double) 64 add r15, 0x4 66 vmovsd xmm15, qword ptr [rbp+r14*1+0x6220] // vmovsd is a scalar (1 double) 66 vmovhpd xmm13, xmm14, qword ptr [rbp+r14*1+0x61a0] // vmovhpt is a scalar (1 double) move to the upper half of 2-double vector 66 vmovhpd xmm14, xmm15, qword ptr [rbp+r14*1+0x62a0] // vmovhpd is a scalar (1 double) move to the upper half of 2-double vector 66 vmovsd xmm15, qword ptr [rbp+r14*1+0x6130] // vmovsd is a scalar (1 double) 66 vmovhpd xmm15, xmm15, qword ptr [rbp+r14*1+0x61b0] 66 vinsertf128 ymm13, ymm13, xmm14, 0x1 // combine two 2-double vector into one 4-double vector 66 vmovsd xmm14, qword ptr [rbp+r14*1+0x6230] // vmovsd is a scalar (1 double) 66 vmovhpd xmm14, xmm14, qword ptr [rbp+r14*1+0x62b0] // vmovhpd is a scalar (1 double) move to the upper half of 2-double vector 64 add r14, 0x200 66 vinsertf128 ymm14, ymm15, xmm14, 0x1 // combine two 2-double vector into one 4-double vector 69 vsubpd ymm15, ymm9, ymm13 // now perform 4-wide vector operations 69 vsubpd ymm13, ymm10, ymm14 69 vmulpd ymm14, ymm15, ymm15 69 vmulpd ymm13, ymm13, ymm13 69 vaddpd ymm15, ymm14, ymm8 69 vaddpd ymm13, ymm15, ymm13 71 vcmppd ymm14, ymm13, ymm1, 0x1 72 vandpd ymm15, ymm0, ymm14 72 vaddpd ymm3, ymm15, ymm3 64 cmp r15, r11 64 jb 0x140001360 <Block 19> ----------------------------- Block 43: 97 lea r12, ptr [r14*8] 95 add r14, 0x4 97 lea r15, ptr [r12+rax*8] 100 vsubpd ymm12, ymm8, ymmword ptr [rbp+r15*1+0x759398] 100 vsubpd ymm14, ymm9, ymmword ptr [rbp+r15*1+0x7ce698] 100 vsubpd ymm15, ymm13, ymmword ptr [rbp+r15*1+0x843998] 100 vmulpd ymm12, ymm12, ymm12 100 vmulpd ymm14, ymm14, ymm14 100 vmulpd ymm15, ymm15, ymm15 100 vaddpd ymm12, ymm12, ymm14 100 vaddpd ymm12, ymm12, ymm15 102 vcmppd ymm14, ymm12, ymm1, 0x1 // TimP note non-branching merge 103 vandpd ymm15, ymm0, ymm14 103 vaddpd ymm3, ymm15, ymm3 95 cmp r14, r11 95 jb 0x1400016b0 <Block 43>
The above were gathered from VTune.
If you lack having VTune, then between the two loops insert a function call to something innocuous say SLEEPQQ(0).
You are not interested in performing the innocuous function. Your interest is in placing a break point into fully optimized code at a line number that has a single representation in the code .AND. is in close proximity of the code of interest. When you reach the break, using you debugger of choice, then select the Dissassembly window to look at the code. You will have to hunt a little because the above code does not show the loop setup and loop termination sections of the code.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok aweome, I went to my actual code and was getting some weird stuff. My vectorization varies if the variable being summed in the innermost loop has also been passed as an input argument to the subroutine. Wrote the following example:
program vec
use ifport
use modcell_en
implicit none
!--- convention array
double precision :: rnew(3,60000)
integer :: start(1000), end(1000)
!--- bs variables
double precision :: energy
double precision :: x1,y1,z1,x2,y2,z2,dr2
integer :: i,j,k,l,count
integer :: c1s,c1e,c2s,c2e
call srand(10)
!--- dont care about this. this is just initializing random positions
do i = 1,60000
rnew(1,i) = 10*rand(); rnew(2,i) = 10*rand(); rnew(3,i) = 10*rand()
part%x(i) = rnew(1,i); part%y(i) = rnew(2,i); part%z(i) = rnew(3,i)
end do
!--- dont car about this either. just assigning pointers to rnew and part
count = 0
do i = 1,1000
start(i) = count*60+1
end(i) = (count+1)*60
count = count + 1
enddo
call get_energy(energy)
stop
end program vec
module modcell_en
type r
double precision :: x(60000),y(60000),z(60000)
end type r
!--- global structure of array variable
type(r) :: part
integer :: start(60000),end(60000)
contains
subroutine get_energy(energy)
implicit none
double precision :: energy, pot
double precision :: x1,y1,z1
double precision :: x2,y2,z2
double precision :: dr,dr2,dri,dr2i,dr6i,dr12i
integer :: i,j,k,l
integer :: c1s,c1e,c2s,c2e
energy =0.0d0
!---main code I would like to port to MIC in array of structure format
do i= 1,1000
do j = 1,1000
c1s = start(i)
c1e = end(i)
c2s = start(j)
c2e = end(j)
do l=c2s,c2e
do k = c1s,c1e
x1 = part%x(k); y1 = part%y(k); z1 = part%z(k)
x2 = part%x(l); y2 = part%y(l); z2 = part%z(l)
dr2 = (x2-x1)**2+(y2-y1)**2+(z2-z1)**2
if(dr2.lt.2)then
dr = sqrt(dr2)
dri = 1.0d0/dr
dr2i = dri*dri
dr6i = dr2i**3
dr12i = dr6i**2
energy = energy + 4.0d0*(dr12i-dr6i)
endif
enddo
enddo
enddo
enddo
end subroutine get_energy
end module modcell_en
In this scenario when I vectorize with IFORT -align -c -xSSES -vec-report2 modcell_en.f90 vec.f90
innermost loop: not vectorized, existence of vector dependence
all other loops: not vectorized, not inner loop.
However, if I make a local variable be summed, say some dummy variable pot, all three inner loops vectorize!
if(dr2.lt.2)then
dr = sqrt(dr2)
dri = 1.0d0/dr
dr2i = dri*dri
dr6i = dr2i**3
dr12i = dr6i**2
pot = pot + 4.0d0*(dr12i-dr6i)
endif
Even weirder to me, if I then try to assign that dummy variable to the variable passed to the subroutine, only the inner loop vectorizes.
if(dr2.lt.2)then
dr = sqrt(dr2)
dri = 1.0d0/dr
dr2i = dri*dri
dr6i = dr2i**3
dr12i = dr6i**2
pot = pot + 4.0d0*(dr12i-dr6i)
endif
energy = pot
Any suggestions here? If I could get all three loops to vectorize, I would go crazy.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Why not use -align array32byte -msse4.1 if your target CPUs are SSE4? If your compiler is so old that it doesn't accept -align array32byte you should tell us the version. I have no idea what SSE5 would do. -align array64byte is important for MIC, and shouldn't hurt on host.
If your cases vary as to whether the sum reductions are vectorized, you would expect the vectorized version to be slightly more accurate.
You can't expect the compiler to simplify away the sqrt (which ought to help vectorization and certainly will help performance). It may try odd things with approximate inverse sqrt when the option -no-prec-sqrt is in effect (and that is the only way which isn't dead slow on MIC).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only important loop to vectorize is the inner most loop.
As a programming practice (and aid to compiler optimization), move your loop invariant code outside the invariant loop. IOW move the x2 = ... out of the do k loop. Although the compiler should do this, and does for most cases, it won't hurt you to do this. If you were doing this calculation by hand, wouldn't you lookup the values for x2, y2, z2 once per advance in l?
I too find it peculiar that changing the dummy variable energy for local variable pot changes the optimization.
On the newer compiler you have the
!DIR$ SIMD [clause[[,] clause] ...]
Your sketch code is using "if(dr2.lt.2) then". Use 2.0d0. The compiler, in order to maintain backwards consistency, may be required to use integer 2 and then convert to double. You will have the same issue if you use dr2 and 2.0 where dr2 is a double. (it is OK to use **2,...)
Tim's suggestion of changing the precision against performance is good, provided that you can deal with the loss in precision.
In looking at the inner loop you find that it performs a square of the distance calculation then performs a filter operation to select which pair of particles are to contribute to the final energy.
The question you need to ask (and answer) yourself: Do the preponderance of separations exclude to contributions of energy?
If the answer to this is No, meaning most of the particle paring contribute (execute the if(dr2.lt.2.0d0) section), then consider
if(dr2.gt.2.0d0) dr2 = 2.0d0 ! avoid overflow dr = sqrt(dr2) dri = 1.0d0/dr dr2i = dri*dri dr6i = dr2i**3 dr12i = dr6i**2 if(dr2.lt.2.0d0) energy = energy + 4.0d0*(dr12i-dr6i)
While the above is performing unnecessary calculations which are discarded, these calculations are occurring within different lanes of an 8-wide vector. These calculations are being performed at the same time and are essentially free (time wise). The above technique should experience speed-up when at least on average 2 out of 8 particle parings contribute to the energy.
If the pairing probability is less than 1 in 8, then consider converting the inner loop into a picking loop, where the picked items (those .lt. 2.0d0) are written to an output array. Then following the picking loop, a new second loop is executed on the sub-set (using vectors) to find the energy contribution.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I should also mention, that if the picking loop strategy is used, that the entire iteration space over i and j my find clusters of near particle pairings, and conversely clusters of disparate particle pairings.
If (when) this is the situation, then consider having each thread generate a picking list in the i and j loops but not process the selected energies. At then end of the i/j loops, but inside the parallel region, have each thread publish the count of picks, have a barrier, then have each thread copy the picks to a consolidation array of picks (using accumulated pick counts of lesser threads as starting offset. Then finally perform a new omp do calculation of the energy contributions. In this manner, the high computation section (otherwise performed by the few threads hitting the clusters) is equally distributed amongst all the threads.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page
- « Previous
-
- 1
- 2
- Next »