packed store operation for vectorization on MIC

conor_p_ · ‎09-10-2014

Hey everyone, I have a loop structure that looks like the following.

do all atoms i 
     numneigh(i) = 0

     do all potential neighbors k do

          j = potential_neighbor(k)
          delx = x(j)-x(i); dely = y(j)-y(i); delz = z(j)-z(i)
          
         dr2 = delx**2 + dely**2 + delz**2

          if(dr2.lt.rcut)
                   numneigh(i) = numneigh(i)+1
                   neighbor(i)(numneigh(i)) = j
           endif
       end do
enddo

Now I am reading a paper discussing how we can implement this code efficiently on a MIC. Clearly the above loop will not auto vectorize due to the loop dependence in lines 12 and 13. They mention that an efficient way to vectorize appending to slits in this manner is to use a packed store. They further mention: "For a SIMD register packed with req values, the result of a comparison with Rc+Rs is a W bit mask, and a packed store write a a subset of indices to contiguous memory based upon the mask." However as a Chemical Engineering PhD, I don't know really know whats going on here. I ran a decent google search, and the info was a little above my head. Could anyone explain this concept further to me. And if possible, modify the above code for me with this procedure so that I have an example to look at?

TimP · ‎09-11-2014

I don't see what it may have to do with your pseudo-code, but ifort for MIC can compile Fortran PACK into native pack instructions, giving about twice the performance of C++ in my test. So, if PACK is appropriate for your usage, go ahead and use it.

conor_p_ · ‎09-11-2014

The point of the packed store is to get that loop to vectorize. Do you see another way to do this? Here is the full paragraph from the paper I am trying to decipher. Tim if this makes any sense to you, could you provide a pseudo code for me to look at?

"For a SIMD register packed with rsq values, the result of a comparison with rcut is a W-bit mask, and a packed store writes a subset of indices to contiguous memory based up this mask. KCI includes a packed store instruction, which we can emulate in other SIMD instructions sets; both 128-bit, and 256-bit AVX, we achieve it with a mask loop up, a single shuffle, and a vector store. We determine the number of neighbors appended to the list by counting the number of bits in the comparison mask:

TaylorIoTKidd · ‎10-23-2014

Conor,

We missed your post.

Do you still need an answer to your question?

Regards
--
Taylor

jimdempseyatthecove · ‎10-24-2014

We may be running into the danger of how to optimize the inner most portion of a ~~bad~~ inefficient algorithm. I do not think there is sufficient information to provide best guidance.

Presumptuously this is an n-body situation. It is undefined as to what "all potential neighbors" is. How did potential_neighbor() get filled in? Does the (dr2.lt.rcut) specify the next iteration potential_neighbor()? If this is an n-body problem wouldn't it make sense to save dr2 for use in subsequent computation?

These are just a few of the questions.

Other questions could be are the atoms periodically relocated such that they are clustered together in order to improve vectorization and locality?

Jim Dempsey

conor_p_ · ‎10-27-2014

Yes, you are correct. This is an n body simulation. I was trying to not bog you down in details, but it looks like I didn't provide enough information. There is a linked list type algorithm. potential_neighbor(i) would contain pointers to atoms that are within one of the 26 neighboring cells that atom i is in. I am looking for how to handle that conditional if statement, but If I need to provide the full code just let me know.

TaylorIoTKidd · ‎11-14-2014

Conor,

Is the paper authored by Intel? Please provide a reference to it.

Regards
--
Taylor

jimdempseyatthecove · ‎11-14-2014

Details....

Are all the atoms within each neighboring cell contained in each linked list?

Am I correct to assume that the purpose of the linked list(s) are to permit atoms to migrate from cell to cell?

If so, then it might be more beneficial to move the entire atom (or high compute portion of the atom) from cell to cell than to move the pointer. The moves only occur when an atom migrates a cell barrier. By doing this you may be enabling better vectorization. Note, you need not squish out the vacated atom positions, at least not at every instance of a cell migration. Just structure the vacated cells such that they produce null results (being careful to compute in a manner that avoids divide by zero), Note massless and/or chargeless atoms produce no force. Periodically you could squish out the null entries.

Jim Dempsey