vectorization on 64 byte windows for use on MIC

conor_p_ · ‎06-09-2014

so I currently have a code in fortran whose premise is shown below. Recently I have tried to export my code to a MIC, and obtained the segfault warning. Upon some research, I learned this was most likely due to my vectors not being aligned properly. My goal here is to always have the cl, loc_r, and cv variables aligned on 64 byte boundaries for use on a xeon phi coprocessor. However, I don't know how to get the allocate command to do this. Assuming all the vectors are aligned, where is the appropriate place in dosomecalc() to place the !dir$ vector alligned statement? I have read both the fortran array data intel document, and the data alignment to assist vectorization document. However neither of them appear to answer the scenario I am in. I am a chemical engineer and not a computer scientist, and its quite possible that I just was not able to understand the documents. Any help or advice would greatly be appreciated.

module mymod

implicit none

type cell_data
double precision, allocatable :: loc_r(:,:)
end type cell_data

type vlist
integer :: cnum
double precision :: minx,miny,minz
end type vlist

type list
type(cell_data), allocatable :: cl(:)
type(vlist), allocatable :: cv(:,:)
end type list

type(list) :: listvar

contains

subroutine alloc()


if(conditon1.eq.true)then
deallocate(listvar%cl)
allocate(listvar%cl(0:new_size)

deallocate(listvar%cv)
allocate(listvar%cv(0:new_size)

do i = 1,new_size
allocate(listvar%cl(i)%loc_r(3,someothersize)
enddo

elseif(condition2.eq.true)then
do i = 1,cur_size
deallocate(listvar%cl(i)%loc_r)
allocate(listvar%cl(i)%loc_r(3,somenewsize)
enddo

end if

end subroutine alloc


end module mymod


!--- this is another module
subroutine dosomecalc()
use mymod
implicit none

!--- loop over all cells
do i = 0,listvar%ncellT-1

do k=1,26

!---determine this cell
cell_neigh = listvar%cv(k,i)%cnum

if(cell_neigh.gt.i)then

!---minimum image
minx =listvar%cv(k,i)%min_x
miny =listvar%cv(k,i)%min_y
minz =listvar%cv(k,i)%min_z


!--- number of particles in cell i
cell1_num = listvar%cl(i)%num

!--- loop over particles in each cell

do j = 1, cell1_num


x1 = listvar%cl(i)%loc_r(1,j)
y1 = listvar%cl(i)%loc_r(2,j)
z1 = listvar%cl(i)%loc_r(3,j)

!--- number of particles in cell_neigh
cell2_num = listvar%cl(cell_neigh)%num


do l= 1, cell2_num


x2 = listvar%cl(cell_neigh)%loc_r(1,l)
y2 = listvar%cl(cell_neigh)%loc_r(2,l)
z2 = listvar%cl(cell_neigh)%loc_r(3,l)

!--- obtain displacements
!--- apply minimum image here
dx = x2-x1-minx
dy = y2-y1-miny
dz = z2-z1-minz

dr2 = dx*dx+dy*dy+dz*dz
if(dr2.lt.param%rcut2)then

dr2i = 1.0d0/dr2
dr6i = dr2i*dr2i*dr2i
dr12i = dr6i*dr6i
energy = energy + dr12i - dr6i
endif
enddo
enddo
endif
enddo
enddo

end subroutine dosomecalc

TimP · ‎06-10-2014

These alignment issues aren't standard Fortran, so the Intel compiler user guide (in documents directory of compiler installation) is the reference you need.

!dir$ vector aligned

(as a prelude to vectorizable data assignment) doesn't bring about alignment; on MIC it would likely ensure segfault when encountering data which are misaligned. Without alignment assertions, the compiler must build in alignment adjustment, so there seems room to doubt your suspicion that alignment is your problem (although it may be important for performance).

The feature in your quoted code which looks problematical is your declaration of type vlist which requires padding after the integer element to allow the following double precision to be aligned, even to an 8 byte boundary. As the docs tell you, an attribute align directive inside the type declaration block would be the only way to correct this, if in fact this is your problem. If the compiler didn't build in padding to an 8-byte boundary the way you have shown it, it may be a bug, but you have contributed to make that happen.

Read about !dir$ attributes align ..... in the compiler user guide . The attributes directive is placed in the procedure which declares the array.

There is also the compile line option -align array64byte which would relieve you of the need to declare alignment of individual data objects and make it easy to optimize for other targets, e.g. array32byte for corei7.

conor_p_ · ‎06-10-2014

Ok, thanks for that advice! Unfortunately, it still did not resolve my issue. I am still receiving the following error:

offload error: process on the device was terminated by signal 11

The loop in subroutine dosomecalc() is the one I am attempting to export using

!$dir$ offload begin target(mic:0)

!omp parallel do schedule(dynamic) reduction(+:energy) default(private) shared(param,listvar)

Also, when I tried !$dir$ offload begin target(mic:0) IN(param,listvar) I got an error saying variables with intent in cannot have allocatable arrays. However when I got rid of the IN statement, I no longer received that error. Sorry if I am now shifting from the topic that is the title of this post, let me know if I should just post a new question.

conor_p_ · ‎06-10-2014

Sorry, I should also note that I am using -align -align records -align array64byte -align rec32byte

Frances_R_Intel · ‎06-10-2014

I'm afraid I'm with Tim. I do not believe unaligned data is what caused you to segfault.

I saw your post on StackOverflow yesterday and wasn't quite sure how to respond. I notice that you also posted your problem in another forum issue here on the Intel site (https://software.intel.com/en-us/forums/topic/516307). If it is ok with you we can refer that issue to this one and just deal with the issue here.

Your Fortran code has a very C++ feel to it. (Of course, that might be because I am an old Fortran programmer from the time when Fortran was simpler.) Is C++ your preferred language? That may be part of the problem you are having. In the line:

allocate(listvar%cl(0:new_size)

you are allocating new_size+1 elements of unknown size. (Notice I said new_size+1 elements - in Fortran, array syntax is (first_element:last_element) rather than (first_element:count).) I believe you are thinking of cl as an array of pointers to objects of cell_type. But actually, it is an array of objects of cell_type which have unknown size. Perhaps this is valid Fortran - as I say, I am an old fossil when it comes to this new fangled Fortran - but I don't believe the compiler is going to do what you want it to here. Perhaps if you actually made cl an array of pointers to cell_data?

If you do decide to try making cl an array of pointers, you will want to be careful about your deallocations - before you deallocate the cl array, you will want to deallocate the cell_data being pointed to.

Of course, being the old fossil that I am, I would probably get rid of these structures and just use simple arrays. So much easier for the compiler to vectorize (and for us old fossils to read.)

conor_p_ · ‎06-10-2014

Its interesting you think C++ is my default. I barely have any experience coding in C++. The code runs fine in serial or openmp across cores. However when I switch to the MIC, that is when I segfault. I don't know if that helps you determine the error at all. I have never debugged on a MIC before, so I am at a loss.

conor_p_ · ‎06-10-2014

Is it possible that listvar is not being copied to the MIC? I still don't understand why it would complain that I am using a data type which contain allocatable arrays when I use IN(listvar), but does not complain when I do not use this statement. Perhaps this is indicitave of something?

Kevin_D_Intel · ‎06-11-2014

Where exactly did you place the offload/omp directives?

We do not support offloading of a derived-type with an allocatable component in the current release so that’s probably related to the seg-fault. Currently the derived-type must be bitwise-copyable without any allocatable component, so all fields must be numeric. We are adding support for this in our next release (currently in Beta) due out later this year; however, the support is only for a numeric allocatable component meaning it will not support nested allocations so what you have is not supported, but I don’t know if you cannot use offload for your case at all. I haven’t followed where you placed the offload/omp directives (on the inner-most loop?). Can you clarify that and try posting the actual code showing that using the “{code}” insert tool to help with readability.

Also, the data exclusive to the coprocessor is not limited in terms of its complexity. If you are able to recreate the data corresponding to your structures on the coprocessor then you could use your structures on the coprocessor exclusively.

conor_p_ · ‎06-11-2014

Oh man, that is potentially game ending news for me. Monte Carlo is such a pain to parallelize. Here is the code reposted!

!$dir$ offload begin target(mic:0)

!omp parallel do schedule(dynamic) reduction(+:energy) default(private) shared(param,listvar)

!--- loop over all cells

do i = 0,listvar%ncellT-1
do k=1,26
!---determine this cell
cell_neigh = listvar%cv(k,i)%cnum
if(cell_neigh.gt.i)then

!---minimum image

minx =listvar%cv(k,i)%min_x
miny =listvar%cv(k,i)%min_y
minz =listvar%cv(k,i)%min_z

!--- number of particles in cell i

cell1_num = listvar%cl(i)%num

!--- loop over particles in each cell

do j = 1, cell1_num

x1 = listvar%cl(i)%loc_r(1,j)
y1 = listvar%cl(i)%loc_r(2,j)
z1 = listvar%cl(i)%loc_r(3,j)

!--- number of particles in cell_neigh

cell2_num = listvar%cl(cell_neigh)%num

do l= 1, cell2_num

x2 = listvar%cl(cell_neigh)%loc_r(1,l)
y2 = listvar%cl(cell_neigh)%loc_r(2,l)
z2 = listvar%cl(cell_neigh)%loc_r(3,l)

!--- obtain displacements

!--- apply minimum image here

dx = x2-x1-minx
dy = y2-y1-miny
dz = z2-z1-minz

dr2 = dx*dx+dy*dy+dz*dz

if(dr2.lt.param%rcut2)then

dr2i = 1.0d0/dr2
dr6i = dr2i*dr2i*dr2i
dr12i = dr6i*dr6i
energy = energy + dr12i - dr6i


endif
enddo
enddo
endif
enddo
enddo

!$omp end parallel do

!dir$ end offload
end subroutine dosomecalc

jimdempseyatthecove · ‎06-11-2014

You might find better vectorization opportunities by changing from one 3D XYZ storage format to three 1D formats.

Logically this:

!--- loop over all cells
do i = 0,listvar%ncellT-1
  do k=1,26
    !---determine this cell
    cell_neigh = listvar%cv(k,i)%cnum
    if(cell_neigh.gt.i)then
      !---minimum image
      minx =listvar%cv(k,i)%min_x
      miny =listvar%cv(k,i)%min_y
      minz =listvar%cv(k,i)%min_z
      !--- number of particles in cell i
      cell1_num = listvar%cl(i)%num
      !--- loop over particles in each cell
      do j = 1, cell1_num
        x1 = listvar%cl(i)%loc_r_X(j)  ! independent 1D array of Xs
        y1 = listvar%cl(i)%loc_r_Y(j)  ! independent 1D array of Ys
        z1 = listvar%cl(i)%loc_r_Z(j)  ! independent 1D array of Zs
        !--- number of particles in cell_neigh  
        cell2_num = listvar%cl(cell_neigh)%num
        do l= 1, cell2_num
          x2 = listvar%cl(cell_neigh)%loc_r_X(l)
          y2 = listvar%cl(cell_neigh)%loc_r_Y(l)
          z2 = listvar%cl(cell_neigh)%loc_r_Z(l)
          !--- obtain displacements
          !--- apply minimum image here
          dx = x2-x1-minx
          dy = y2-y1-miny
          dz = z2-z1-minz
          dr2 = dx*dx+dy*dy+dz*dz
          if(dr2.lt.param%rcut2)then
            dr2i  = 1.0d0/dr2
            dr6i  = dr2i*dr2i*dr2i
            dr12i = dr6i*dr6i
            energy = energy + dr12i - dr6i
          endif
        enddo
      enddo         
    endif
  enddo
enddo

However, the conditional if test may interfere with vectorization. Consider changing

          if(dr2.lt.param%rcut2)then
            dr2i  = 1.0d0/dr2
            dr6i  = dr2i*dr2i*dr2i
            dr12i = dr6i*dr6i
            energy = energy + dr12i - dr6i
          endif

to

          dr2i  = 1.0d0/dr2
          dr6i  = dr2i*dr2i*dr2i
          dr12i = dr6i*dr6i - dr6i
          if(dr2.lt.param%rcut2) energy = energy + dr12i

The optimization could then perform a masked move. Although you may be "needlessly" executing 6 operations, this can be done as vector, which may reduce your loop trip count by a factor of 8 or 16.

Before you make the edit, verify that the compiler wasn't smart enough to figure this out for you.

Jim Dempsey

conor_p_ · ‎06-11-2014

If I took the allocatable arrays out of the data type and just have them declared as global allocatable array, can I pass them to the MIC then? Does the MIC only only static arrays to be passed to it?

Kevin_D_Intel · ‎06-11-2014

You can use global allocatable arrays. Arrays need not be only static. If you have not already, have a peek at the product samples under: C:\Program Files (x86)\Intel\Composer XE\Samples\en_US\Fortran\mic_samples.zip

jimdempseyatthecove · ‎06-11-2014

As Kevin stated earlier derived types (user defined types) with allocated components (IOW not POD/Plane Old Data) cannot be passed with an offload.

If these data objects live preponderantly inside the MIC, then perform the allocations inside the MIC. You would then need to construct a means to offload out these compound objects. This typically involves packing the contents into a contiguous blob, then transferring that. Not unlike what you would do to checkpoint the data to a file. In this case the "file" is a temporary array used to import/export the user defined type. It should be relatively easy to write the member functions to do this (at least easy after you have done it once).

Jim Dempsey