<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic double precision vs. single precision and vector alignment on MIC in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/double-precision-vs-single-precision-and-vector-alignment-on-MIC/m-p/1008699#M33103</link>
    <description>&lt;P&gt;Hello everyone, I have a subroutine that serves as the force calculation in an n-body simulation shown below. There a couple of arrays that I will describe which will hopefully make the subroutine a little more clear.&lt;/P&gt;

&lt;P&gt;1. array position. it contains positions of particles and atom types. it is of length number of particles np&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;type aos
    real*4 :: x,y,z
    integer :: type
end type aos

type(aos) :: position(np) &lt;/PRE&gt;

&lt;P&gt;2. array neigh. it contains a list of pointers to particles that are within a certain distance of a given particle. it is of type integer.&lt;/P&gt;

&lt;P&gt;3. array nnstart and nnfinish. nnstart(i) and nnfinish(i) gives the starting and ending locations for information for particle i in array neigh. If I am particle i, my neighbors are contained in the indices between nnstart(i) and nnfinish(i) in neigh. these are also integers&lt;/P&gt;

&lt;P&gt;4. array q. array of real*4s. charges of given atom&lt;/P&gt;

&lt;P&gt;5. lj1,lj2,lj3,lj4. these are all arrays of real*4s. they provide parameters for the atomic interactions.&lt;/P&gt;

&lt;P&gt;6. array ff. this is the array that stores the forces for each particle. it is of the following type&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;type ff_data
     real*4 :: x,y,z
end type ff_data&lt;/PRE&gt;

&lt;P&gt;all these arrays I have as globals and are defined for the mic as follows&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;!dir$ attributes offload:mic::position&lt;/PRE&gt;

&lt;P&gt;My code was compiled using the -align array64byte flag. However, the inclusion of the !dir$ vector aligned statement on line 43 to assist vectorization of the inner loop produces a seg fault. I am very curious as to why, despite being compiled using the align array64byte, my arrays are not aligned. Any suggestions?&lt;/P&gt;

&lt;P&gt;My second question is that I have ran this subroutine (shown below using single precision) using both full single precision and full double precision. In the single precision case, all array components are real*4 vs double precision. However, I only see about a 20% speed up. Now given that my bandwidth should double for SP vs. DP, and that my vector registers should be doing twice as many calculations, I expect a factor of two speed up. Is there something wrong with this reasoning, or is there something going on with this code?&lt;/P&gt;

&lt;P&gt;Finally, does anyone have any suggestions for my current nearest integer function on line 54? I was given it as a suggestion on this forum, but don't know if there are faster alternatives.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;subroutine lj_cut_coul_dsf_nonewton_SP(step,printflag)
    implicit none
    real*4 :: force,forcelj,forcecoul
    real*4 :: x1,y1,z1,x2,y2,z2
    real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i
    double precision ::  ffx,ffy,ffz
    real*4 :: fudge_factor
    real*4 :: qtmp,r,prefactor,erfcc,erfcd,t
    real*4 :: boxdx,boxdy,boxdz
    real*4 :: scalelj, scalecoul
    integer :: i,j,l,step
    integer :: itype,jtype,neigh
    integer :: tid,num
    integer :: offset
    integer :: printflag
    integer :: T1,T2,clock_rate,clock_max

    call system_clock(T1,clock_rate,clock_max)

    !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&amp;amp;
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&amp;amp;
    !dir$ in(nlist: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ in(nnfirst: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ in(nnlast: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))

    !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(private),&amp;amp;
    !$omp&amp;amp; shared(box,ibox,rcut2,cut_coulsq,alpha,EWALD_P,MY_PIS,np,numAtomType),&amp;amp;
    !$omp&amp;amp; shared(A1,A2,A3,A4,A5),&amp;amp;
    !$omp&amp;amp; shared(e_shift,f_shift),&amp;amp;
    !$omp&amp;amp; shared(position,ff,nlist,nnfirst,nnlast,q,lj1,lj2,lj3,lj4)
    do i = 1 ,np
&amp;nbsp;      !---load data for particle i: positions, type, and charge
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type
       qtmp = q(i)

       ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0

       !dir$ vector aligned
       !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz)
&amp;nbsp;      !... loop over neighbors for particle i
       do j= nnfirst(i),nnlast(i)
          
&amp;nbsp;         !---load neighboring particle index
          neigh = nlist(j)

&amp;nbsp;         !---calculate distance between particles using a vectorizable nearest integer function
          dx = x1-position(neigh)%x; dy = y1-position(neigh)%y; dz = z1-position(neigh)%z
          boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
          boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
          boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
          boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)
          dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz
          dr2 = dx*dx + dy*dy + dz*dz

          scalelj = 0.0d0
          if(dr2.lt.rcut2)scalelj = 1.0d0

          dr2i = 1.0d0/dr2
          dr6i = dr2i*dr2i*dr2i
          
          offset  = numAtomType*(itype-1)+jtype
          forcelj = scalelj*dr6i*(lj1(offset)*dr6i-lj2(offset))
          potential = potential + scalelj*dr6i*(dr6i*lj3(offset)-lj4(offset))      


          scalecoul = 0.0d0
          if(dr2.lt.cut_coulsq)scalecoul = 1.0d0
                 
          r = sqrt(dr2)
          prefactor =  scalecoul*qtmp*q(neigh)/r
          erfcd = exp(-alpha*alpha*r*r)
          t = 1.0 / (1.0 + EWALD_P*alpha*r)
          erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd
          forcecoul = prefactor * (erfcc/r + 2.0*alpha/MY_PIS * erfcd +&amp;amp;
               r*f_shift) * r
          
          e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift)             
          
          
          force   = (forcecoul+forcelj)*dr2i


          ffx = ffx + dx*force
          ffy = ffy + dy*force
          ffz = ffz + dz*force
          
       enddo
       ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz
    enddo
    !$omp end parallel do 
    !dir$ end offload

    potential = 0.50d0*potential
    e_coul = 0.50d0*e_coul

    call system_clock(T2,clock_rate,clock_max)

    if(printflag.eq.1)then
       print*,'elapsed time w/o newton SP',real(T2-T1)/real(clock_rate)
       print*,'calculated potentials:',potential*reps, e_coul*reps
    endif
    
  end subroutine lj_cut_coul_dsf_nonewton_SP&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 12 Feb 2015 21:35:42 GMT</pubDate>
    <dc:creator>conor_p_</dc:creator>
    <dc:date>2015-02-12T21:35:42Z</dc:date>
    <item>
      <title>double precision vs. single precision and vector alignment on MIC</title>
      <link>https://community.intel.com/t5/Software-Archive/double-precision-vs-single-precision-and-vector-alignment-on-MIC/m-p/1008699#M33103</link>
      <description>&lt;P&gt;Hello everyone, I have a subroutine that serves as the force calculation in an n-body simulation shown below. There a couple of arrays that I will describe which will hopefully make the subroutine a little more clear.&lt;/P&gt;

&lt;P&gt;1. array position. it contains positions of particles and atom types. it is of length number of particles np&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;type aos
    real*4 :: x,y,z
    integer :: type
end type aos

type(aos) :: position(np) &lt;/PRE&gt;

&lt;P&gt;2. array neigh. it contains a list of pointers to particles that are within a certain distance of a given particle. it is of type integer.&lt;/P&gt;

&lt;P&gt;3. array nnstart and nnfinish. nnstart(i) and nnfinish(i) gives the starting and ending locations for information for particle i in array neigh. If I am particle i, my neighbors are contained in the indices between nnstart(i) and nnfinish(i) in neigh. these are also integers&lt;/P&gt;

&lt;P&gt;4. array q. array of real*4s. charges of given atom&lt;/P&gt;

&lt;P&gt;5. lj1,lj2,lj3,lj4. these are all arrays of real*4s. they provide parameters for the atomic interactions.&lt;/P&gt;

&lt;P&gt;6. array ff. this is the array that stores the forces for each particle. it is of the following type&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;type ff_data
     real*4 :: x,y,z
end type ff_data&lt;/PRE&gt;

&lt;P&gt;all these arrays I have as globals and are defined for the mic as follows&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;!dir$ attributes offload:mic::position&lt;/PRE&gt;

&lt;P&gt;My code was compiled using the -align array64byte flag. However, the inclusion of the !dir$ vector aligned statement on line 43 to assist vectorization of the inner loop produces a seg fault. I am very curious as to why, despite being compiled using the align array64byte, my arrays are not aligned. Any suggestions?&lt;/P&gt;

&lt;P&gt;My second question is that I have ran this subroutine (shown below using single precision) using both full single precision and full double precision. In the single precision case, all array components are real*4 vs double precision. However, I only see about a 20% speed up. Now given that my bandwidth should double for SP vs. DP, and that my vector registers should be doing twice as many calculations, I expect a factor of two speed up. Is there something wrong with this reasoning, or is there something going on with this code?&lt;/P&gt;

&lt;P&gt;Finally, does anyone have any suggestions for my current nearest integer function on line 54? I was given it as a suggestion on this forum, but don't know if there are faster alternatives.&lt;/P&gt;

&lt;PRE class="brush:fortran;"&gt;subroutine lj_cut_coul_dsf_nonewton_SP(step,printflag)
    implicit none
    real*4 :: force,forcelj,forcecoul
    real*4 :: x1,y1,z1,x2,y2,z2
    real*4 :: dx,dy,dz,dr,dr2,dr2i,dr6i,dr12i
    double precision ::  ffx,ffy,ffz
    real*4 :: fudge_factor
    real*4 :: qtmp,r,prefactor,erfcc,erfcd,t
    real*4 :: boxdx,boxdy,boxdz
    real*4 :: scalelj, scalecoul
    integer :: i,j,l,step
    integer :: itype,jtype,neigh
    integer :: tid,num
    integer :: offset
    integer :: printflag
    integer :: T1,T2,clock_rate,clock_max

    call system_clock(T1,clock_rate,clock_max)

    !dir$ offload begin target(mic:0) in(position: alloc_if(.false.),free_if(.false.)),&amp;amp;
    !dir$ inout(ff: alloc_if(.false.), free_if(.false.)),&amp;amp;
    !dir$ in(nlist: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ in(nnfirst: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ in(nnlast: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ in(q: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj1: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj2: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj3: alloc_if(.false.) free_if(.false.)),&amp;amp;
    !dir$ nocopy(lj4: alloc_if(.false.) free_if(.false.))

    !$omp parallel do schedule(dynamic) reduction(+:potential,e_coul,ffx,ffy,ffz) default(private),&amp;amp;
    !$omp&amp;amp; shared(box,ibox,rcut2,cut_coulsq,alpha,EWALD_P,MY_PIS,np,numAtomType),&amp;amp;
    !$omp&amp;amp; shared(A1,A2,A3,A4,A5),&amp;amp;
    !$omp&amp;amp; shared(e_shift,f_shift),&amp;amp;
    !$omp&amp;amp; shared(position,ff,nlist,nnfirst,nnlast,q,lj1,lj2,lj3,lj4)
    do i = 1 ,np
&amp;nbsp;      !---load data for particle i: positions, type, and charge
       x1 = position(i)%x; y1 = position(i)%y; z1 = position(i)%z; itype = position(i)%type
       qtmp = q(i)

       ffx = 0.0d0; ffy = 0.0d0; ffz = 0.0d0

       !dir$ vector aligned
       !dir$ simd reduction(+:potential,e_coul,ffx,ffy,ffz)
&amp;nbsp;      !... loop over neighbors for particle i
       do j= nnfirst(i),nnlast(i)
          
&amp;nbsp;         !---load neighboring particle index
          neigh = nlist(j)

&amp;nbsp;         !---calculate distance between particles using a vectorizable nearest integer function
          dx = x1-position(neigh)%x; dy = y1-position(neigh)%y; dz = z1-position(neigh)%z
          boxdx = dx*ibox; boxdy = dy*ibox; boxdz = dz*ibox
          boxdx = (boxdx+sign(1/(epsilon(boxdx)),boxdx)) -sign(1/epsilon(boxdx),dx)
          boxdy = (boxdy+sign(1/(epsilon(boxdy)),boxdy)) -sign(1/epsilon(boxdy),dy)
          boxdz = (boxdz+sign(1/(epsilon(boxdz)),boxdz)) -sign(1/epsilon(boxdz),dz)
          dx = dx-box*boxdx; dy = dy-box*boxdy; dz = dz-box*boxdz
          dr2 = dx*dx + dy*dy + dz*dz

          scalelj = 0.0d0
          if(dr2.lt.rcut2)scalelj = 1.0d0

          dr2i = 1.0d0/dr2
          dr6i = dr2i*dr2i*dr2i
          
          offset  = numAtomType*(itype-1)+jtype
          forcelj = scalelj*dr6i*(lj1(offset)*dr6i-lj2(offset))
          potential = potential + scalelj*dr6i*(dr6i*lj3(offset)-lj4(offset))      


          scalecoul = 0.0d0
          if(dr2.lt.cut_coulsq)scalecoul = 1.0d0
                 
          r = sqrt(dr2)
          prefactor =  scalecoul*qtmp*q(neigh)/r
          erfcd = exp(-alpha*alpha*r*r)
          t = 1.0 / (1.0 + EWALD_P*alpha*r)
          erfcc = t * (A1+t*(A2+t*(A3+t*(A4+t*A5)))) * erfcd
          forcecoul = prefactor * (erfcc/r + 2.0*alpha/MY_PIS * erfcd +&amp;amp;
               r*f_shift) * r
          
          e_coul = e_coul + prefactor*(erfcc-r*e_shift-dr2*f_shift)             
          
          
          force   = (forcecoul+forcelj)*dr2i


          ffx = ffx + dx*force
          ffy = ffy + dy*force
          ffz = ffz + dz*force
          
       enddo
       ff(i)%x = ffx; ff(i)%y = ffy; ff(i)%z = ffz
    enddo
    !$omp end parallel do 
    !dir$ end offload

    potential = 0.50d0*potential
    e_coul = 0.50d0*e_coul

    call system_clock(T2,clock_rate,clock_max)

    if(printflag.eq.1)then
       print*,'elapsed time w/o newton SP',real(T2-T1)/real(clock_rate)
       print*,'calculated potentials:',potential*reps, e_coul*reps
    endif
    
  end subroutine lj_cut_coul_dsf_nonewton_SP&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 12 Feb 2015 21:35:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/double-precision-vs-single-precision-and-vector-alignment-on-MIC/m-p/1008699#M33103</guid>
      <dc:creator>conor_p_</dc:creator>
      <dc:date>2015-02-12T21:35:42Z</dc:date>
    </item>
    <item>
      <title>This is a duplicate of https:</title>
      <link>https://community.intel.com/t5/Software-Archive/double-precision-vs-single-precision-and-vector-alignment-on-MIC/m-p/1008700#M33104</link>
      <description>&lt;P&gt;This is a duplicate of&amp;nbsp;https://software.intel.com/en-us/forums/topic/541463. Please post all follow-ups to that thread.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Feb 2015 08:30:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/double-precision-vs-single-precision-and-vector-alignment-on-MIC/m-p/1008700#M33104</guid>
      <dc:creator>Frances_R_Intel</dc:creator>
      <dc:date>2015-02-18T08:30:41Z</dc:date>
    </item>
  </channel>
</rss>

