Software Archive
Read-only legacy content
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
17060 Discussions

vectorization on 64 byte windows for use on MIC

conor_p_
Beginner
1,388 Views

so I currently have a code in fortran whose premise is shown below. Recently I have tried to export my code to a MIC, and obtained the segfault warning. Upon some research, I learned this was most likely due to my vectors not being aligned properly. My goal here is to always have the cl, loc_r, and cv variables aligned on 64 byte boundaries for use on a xeon phi coprocessor. However, I don't know how to get the allocate command to do this. Assuming all the vectors are aligned, where is the appropriate place in dosomecalc() to place the !dir$ vector alligned statement? I have read both the fortran array data intel document, and the data alignment to assist vectorization document. However neither of them appear to answer the scenario I am in. I am a chemical engineer and not a computer scientist, and its quite possible that I just was not able to understand the documents. Any help or advice would greatly be appreciated.


    module mymod
      
      implicit none
    
      type cell_data
         double precision, allocatable :: loc_r(:,:)
      end type cell_data
    
      type vlist
         integer :: cnum
         double precision :: minx,miny,minz
      end type vlist
    
      type list
         type(cell_data), allocatable :: cl(:)
         type(vlist), allocatable :: cv(:,:)
      end type list
    
      type(list) :: listvar
    
      contains
    
        subroutine alloc()
          
    
          if(conditon1.eq.true)then
             deallocate(listvar%cl)
             allocate(listvar%cl(0:new_size)

             deallocate(listvar%cv)
             allocate(listvar%cv(0:new_size)
             
             do i = 1,new_size
                allocate(listvar%cl(i)%loc_r(3,someothersize)
             enddo
    
          elseif(condition2.eq.true)then
             do i = 1,cur_size
                deallocate(listvar%cl(i)%loc_r)
                allocate(listvar%cl(i)%loc_r(3,somenewsize)
             enddo
    
          end if
    
        end subroutine alloc
             
    
      end module mymod
    
    
      !--- this is another module
      subroutine dosomecalc()
        use mymod
        implicit none
    
        !--- loop over all cells
        do i = 0,listvar%ncellT-1
    
           do k=1,26
        
              !---determine this cell
              cell_neigh = listvar%cv(k,i)%cnum
              
              if(cell_neigh.gt.i)then
    
                 !---minimum image
                 minx =listvar%cv(k,i)%min_x
                 miny =listvar%cv(k,i)%min_y
                 minz =listvar%cv(k,i)%min_z
    
    
                 !--- number of particles in cell i
                 cell1_num = listvar%cl(i)%num
           
                 !--- loop over particles in each cell
             
                 do j = 1, cell1_num
                    
                                
                    x1 = listvar%cl(i)%loc_r(1,j)
                    y1 = listvar%cl(i)%loc_r(2,j)
                    z1 = listvar%cl(i)%loc_r(3,j)
                                                                                               
                    !--- number of particles in cell_neigh  
                    cell2_num = listvar%cl(cell_neigh)%num
                    
                   
                    do l= 1, cell2_num
                                     
    
                       x2 = listvar%cl(cell_neigh)%loc_r(1,l)
                       y2 = listvar%cl(cell_neigh)%loc_r(2,l)
                       z2 = listvar%cl(cell_neigh)%loc_r(3,l)
                       
                       !--- obtain displacements
                       !--- apply minimum image here
                       dx = x2-x1-minx
                       dy = y2-y1-miny
                       dz = z2-z1-minz
                       
           
                       dr2 = dx*dx+dy*dy+dz*dz
                       if(dr2.lt.param%rcut2)then
                                            
                          dr2i  = 1.0d0/dr2
                          dr6i  = dr2i*dr2i*dr2i
                          dr12i = dr6i*dr6i
                          energy = energy + dr12i - dr6i                 
                       endif
                    enddo
                 enddo         
              endif
           enddo
        enddo
    
      end subroutine dosomecalc

 

0 Kudos
12 Replies
TimP
Honored Contributor III
1,388 Views

These alignment issues aren't standard Fortran, so the Intel compiler user guide (in documents directory of compiler installation) is the reference you need.

!dir$ vector aligned

(as a prelude to vectorizable data assignment) doesn't bring about alignment; on MIC it would likely ensure segfault when encountering data which are misaligned.  Without alignment assertions, the compiler must build in alignment adjustment, so there seems room to doubt your suspicion that alignment is your problem (although it may be important for performance).

The feature in your quoted code which looks problematical is your declaration of type vlist which requires padding after the integer element to allow the following double precision to be aligned, even to an 8 byte boundary.  As the docs tell you, an attribute align directive inside the type declaration block would be the only way to correct this, if in fact this is your problem.  If the compiler didn't build in padding to an 8-byte boundary the way you have shown it, it may be a bug, but you have contributed to make that happen.

Read about !dir$ attributes align ..... in the compiler user guide .  The attributes directive is placed in the procedure which declares the array.

There is also the compile line option -align array64byte which would relieve you of the need to declare alignment of individual data objects and make it easy to optimize for other targets, e.g. array32byte for corei7.

0 Kudos
conor_p_
Beginner
1,388 Views

Ok, thanks for that advice! Unfortunately, it still did not resolve my issue. I am still receiving the following error:

offload error: process on the device was terminated by signal 11

The loop in subroutine dosomecalc() is the one I am attempting to export using

!$dir$ offload begin target(mic:0)

!omp parallel do schedule(dynamic) reduction(+:energy) default(private) shared(param,listvar)

Also, when I tried !$dir$ offload begin target(mic:0) IN(param,listvar) I got an error saying variables with intent in cannot have allocatable arrays. However when I got rid of the IN statement, I no longer received that error. Sorry if I am now shifting from the topic that is the title of this post,  let me know if I should just post a new question.

0 Kudos
conor_p_
Beginner
1,388 Views

Sorry, I should also note that I am using -align -align records  -align array64byte -align rec32byte

0 Kudos
Frances_R_Intel
Employee
1,388 Views

I'm afraid I'm with Tim. I do not believe unaligned data is what caused you to segfault.

I saw your post on StackOverflow yesterday and wasn't quite sure how to respond. I notice that you also posted your problem in another forum issue here on the Intel site (https://software.intel.com/en-us/forums/topic/516307). If it is ok with you we can refer that issue to this one and just deal with the issue here.

Your Fortran code has a very C++ feel to it. (Of course, that might be because I am an old Fortran programmer from the time when Fortran was simpler.) Is C++ your preferred language? That may be part of the problem you are having. In the line:

allocate(listvar%cl(0:new_size)

you are allocating new_size+1 elements of unknown size. (Notice I said new_size+1 elements - in Fortran, array syntax is (first_element:last_element) rather than (first_element:count).) I believe you are thinking of cl as an array of pointers to objects of cell_type. But actually, it is an array of objects of cell_type which have unknown size. Perhaps this is valid Fortran - as I say, I am an old fossil when it comes to this new fangled Fortran - but I don't believe the compiler is going to do what you want it to here. Perhaps if you actually made cl an array of pointers to cell_data?

If you do decide to try making cl an array of pointers, you will want to be careful about your deallocations - before you deallocate the cl array, you will want to deallocate the cell_data being pointed to.

Of course, being the old fossil that I am, I would probably get rid of these structures and just use simple arrays. So much easier for the compiler to vectorize (and for us old fossils to read.)

 

0 Kudos
conor_p_
Beginner
1,388 Views

Its interesting you think C++ is my default. I barely have any experience coding in C++. The code runs fine in serial or openmp across cores. However when I switch to the MIC, that is when I segfault. I don't know if that helps you determine the error at all. I have never debugged on a MIC before, so I am at a loss.

0 Kudos
conor_p_
Beginner
1,388 Views

Is it possible that listvar is not being copied to the MIC? I still don't understand why it would complain that I am using a data type which contain allocatable arrays when I use IN(listvar), but does not complain when I do not use this statement. Perhaps this is indicitave of something? 

0 Kudos
Kevin_D_Intel
Employee
1,388 Views

Where exactly did you place the offload/omp directives?

We do not support offloading of a derived-type with an allocatable component in the current release so that’s probably related to the seg-fault. Currently the derived-type must be bitwise-copyable without any allocatable component, so all fields must be numeric. We are adding support for this in our next release (currently in Beta) due out later this year; however, the support is only for a numeric allocatable component meaning it will not support nested allocations so what you have is not supported, but I don’t know if you cannot use offload for your case at all. I haven’t followed where you placed the offload/omp directives (on the inner-most loop?).  Can you clarify that and try posting the actual code showing that using the “{code}” insert tool to help with readability.

Also, the data exclusive to the coprocessor is not limited in terms of its complexity. If you are able to recreate the data corresponding to your structures on the coprocessor then you could use your structures on the coprocessor exclusively.

0 Kudos
conor_p_
Beginner
1,388 Views

Oh man, that is potentially game ending news for me. Monte  Carlo is such a pain to parallelize. Here is the code reposted!

!$dir$ offload begin target(mic:0)

!omp parallel do schedule(dynamic) reduction(+:energy) default(private) shared(param,listvar)

   !--- loop over all cells


        do i = 0,listvar%ncellT-1
           do k=1,26
              !---determine this cell
              cell_neigh = listvar%cv(k,i)%cnum
              if(cell_neigh.gt.i)then


                 !---minimum image

                 minx =listvar%cv(k,i)%min_x
                 miny =listvar%cv(k,i)%min_y
                 minz =listvar%cv(k,i)%min_z

   
                 !--- number of particles in cell i

                 cell1_num = listvar%cl(i)%num


                 !--- loop over particles in each cell
             
                 do j = 1, cell1_num
                               
                    x1 = listvar%cl(i)%loc_r(1,j)
                    y1 = listvar%cl(i)%loc_r(2,j)
                    z1 = listvar%cl(i)%loc_r(3,j)
                                                                                               
                    !--- number of particles in cell_neigh  

                    cell2_num = listvar%cl(cell_neigh)%num


                    do l= 1, cell2_num
                                   

                       x2 = listvar%cl(cell_neigh)%loc_r(1,l)
                       y2 = listvar%cl(cell_neigh)%loc_r(2,l)
                       z2 = listvar%cl(cell_neigh)%loc_r(3,l)
                       
                       !--- obtain displacements

                       !--- apply minimum image here

                       dx = x2-x1-minx
                       dy = y2-y1-miny
                       dz = z2-z1-minz
                      
                       dr2 = dx*dx+dy*dy+dz*dz

                       if(dr2.lt.param%rcut2)then
                                            
                          dr2i  = 1.0d0/dr2
                          dr6i  = dr2i*dr2i*dr2i
                          dr12i = dr6i*dr6i
                          energy = energy + dr12i - dr6i

                
                       endif
                    enddo
                 enddo         
              endif
           enddo
        enddo

      !$omp end parallel do

      !dir$ end offload
      end subroutine dosomecalc

 

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,388 Views

You might find better vectorization opportunities by changing from one 3D XYZ storage format to three 1D formats.

Logically this:

!--- loop over all cells
do i = 0,listvar%ncellT-1
  do k=1,26
    !---determine this cell
    cell_neigh = listvar%cv(k,i)%cnum
    if(cell_neigh.gt.i)then
      !---minimum image
      minx =listvar%cv(k,i)%min_x
      miny =listvar%cv(k,i)%min_y
      minz =listvar%cv(k,i)%min_z
      !--- number of particles in cell i
      cell1_num = listvar%cl(i)%num
      !--- loop over particles in each cell
      do j = 1, cell1_num
        x1 = listvar%cl(i)%loc_r_X(j)  ! independent 1D array of Xs
        y1 = listvar%cl(i)%loc_r_Y(j)  ! independent 1D array of Ys
        z1 = listvar%cl(i)%loc_r_Z(j)  ! independent 1D array of Zs
        !--- number of particles in cell_neigh  
        cell2_num = listvar%cl(cell_neigh)%num
        do l= 1, cell2_num
          x2 = listvar%cl(cell_neigh)%loc_r_X(l)
          y2 = listvar%cl(cell_neigh)%loc_r_Y(l)
          z2 = listvar%cl(cell_neigh)%loc_r_Z(l)
          !--- obtain displacements
          !--- apply minimum image here
          dx = x2-x1-minx
          dy = y2-y1-miny
          dz = z2-z1-minz
          dr2 = dx*dx+dy*dy+dz*dz
          if(dr2.lt.param%rcut2)then
            dr2i  = 1.0d0/dr2
            dr6i  = dr2i*dr2i*dr2i
            dr12i = dr6i*dr6i
            energy = energy + dr12i - dr6i
          endif
        enddo
      enddo         
    endif
  enddo
enddo

However, the conditional if test may interfere with vectorization. Consider changing

          if(dr2.lt.param%rcut2)then
            dr2i  = 1.0d0/dr2
            dr6i  = dr2i*dr2i*dr2i
            dr12i = dr6i*dr6i
            energy = energy + dr12i - dr6i
          endif

to

          dr2i  = 1.0d0/dr2
          dr6i  = dr2i*dr2i*dr2i
          dr12i = dr6i*dr6i - dr6i
          if(dr2.lt.param%rcut2) energy = energy + dr12i

The optimization could then perform a masked move. Although you may be "needlessly" executing 6 operations, this can be done as vector, which may reduce your loop trip count by a factor of 8 or 16.

Before you make the edit, verify that the compiler wasn't smart enough to figure this out for you.

Jim Dempsey

0 Kudos
conor_p_
Beginner
1,388 Views

If I took the allocatable arrays out of the data type and just have them declared as global allocatable array, can I pass them to the MIC then? Does the MIC only only static arrays to be passed to it?

0 Kudos
Kevin_D_Intel
Employee
1,388 Views

You can use global allocatable arrays. Arrays need not be only static. If you have not already, have a peek at the product samples under: C:\Program Files (x86)\Intel\Composer XE\Samples\en_US\Fortran\mic_samples.zip

0 Kudos
jimdempseyatthecove
Honored Contributor III
1,388 Views

As Kevin stated earlier derived types (user defined types) with allocated components (IOW not POD/Plane Old Data) cannot be passed with an offload.

If these data objects live preponderantly inside the MIC, then perform the allocations inside the MIC. You would then need to construct a means to offload out these compound objects. This typically involves packing the contents into a contiguous blob, then transferring that. Not unlike what you would do to checkpoint the data to a file. In this case the "file" is a temporary array used to import/export the user defined type. It should be relatively easy to write the member functions to do this (at least easy after you have done it once).

Jim Dempsey

0 Kudos
Reply