<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Sorry for the late reply,  in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766169#M19</link>
    <description>Sorry for the late reply,
&lt;P&gt;&lt;/P&gt;
Thank you for both of your answers,
&lt;P&gt;&lt;/P&gt;
@TimP: Thanks for pointing me to the loop count and alignment directives, I'll take a look at them now and let you know if they work!
@Martyn: I'm basically fiddling with some very ugly generated code produced by a source to source compiler to see how to change it in such a way that ifort will vectorize the instructions, below is the subroutine in question and the subroutine which calls the 'problematic' subroutine (both within the same source file): 
(BS is a macro defined at the beginning of the file, usually set to 4)
[fortran]
subroutine update_kernel_caller(qold, q, res, adt, rms, diff)
    implicit none
    integer  :: diff
    real(kind=8), dimension(BS,diff/BS,4) :: qold, q, res
    real(kind=8), dimension(BS,*)  :: adt
    real(kind=8), dimension(1)       :: rms
    real(kind=8), dimension(BS)    :: adti, del
    real(kind=8), dimension(BS)    :: acc
    integer :: i,j
    do j=1,diff/BS
        adti(:) = 1.0 / adt(:,j)
        do i=1,4
            del(:) = adti(:) * res(:,j,i)
            q(:,j,i) = qold(:,j,i) - del(:)
            res(:,j,i) = 0.0
            acc(:) = del(:) * del(:)
            rms(1) = rms(1) + sum(acc)
        end do
    end do
end subroutine
[/fortran]

This is called by :
(The opDats originally contain data in AoS form, the mkl method turns it into SoA and stores the result into the topDats, the system clocks are just for benchmarking purposes)
[fortran]
subroutine update_kernel(opDat1,opDat2,opDat3,opDat4,opDat5,sliceStart,sliceEnd)
    implicit none
    real(kind=8), dimension(0:*) :: opDat1
    real(kind=8), dimension(0:*) :: opDat2
    real(kind=8), dimension(0:*) :: opDat3
    real(kind=8), dimension(0:*) :: opDat4
    real(kind=8), dimension(0:*) :: opDat5
    integer(kind=4) :: sliceStart
    integer(kind=4) :: sliceEnd
    integer(kind=4) :: i1, diff
    integer(8) :: time1, time2, count_rate, count_max 
    real(kind=8), allocatable, dimension(:) :: topDat1, topDat2, topDat3
    diff = sliceEnd - sliceStart
    call system_clock(time1, count_rate, count_max)
    allocate (topDat1(diff*4))
    allocate (topDat2(diff*4))
    allocate (topDat3(diff*4))
    call mkl_domatcopy('r', 't', diff, 4, 1.d0, opDat1(4*sliceStart:4*sliceEnd-1), 4, topDat1, diff)
    call mkl_domatcopy('r', 't', diff, 4, 1.d0, opDat2(4*sliceStart:4*sliceEnd-1), 4, topDat2, diff)
    call mkl_domatcopy('r', 't', diff, 4, 1.d0, opDat3(4*sliceStart:4*sliceEnd-1), 4, topDat3, diff)
    call system_clock(time2, count_rate, count_max)
    print *, "### not counted", time2 - time1
    call update_kernel_caller(topDat1, topDat2, topDat3, opDat4(sliceStart:sliceStart+diff-1), opDat5, diff)
    call system_clock(time1, count_rate, count_max)
    call mkl_domatcopy('r', 't', 4, diff, 1.d0, topDat1, diff, opDat1(4*sliceStart:4*sliceEnd-1), 4)
    call mkl_domatcopy('r', 't', 4, diff, 1.d0, topDat2, diff, opDat2(4*sliceStart:4*sliceEnd-1), 4)
    call mkl_domatcopy('r', 't', 4, diff, 1.d0, topDat3, diff, opDat3(4*sliceStart:4*sliceEnd-1), 4)
    deallocate (topDat1) 
    deallocate (topDat2)
    deallocate (topDat3)
    call system_clock(time2, count_rate, count_max)
    print *, "### not counted", time2 - time1
end subroutine
[/fortran]

I apologize for the horrible spacing... in the forum editor it appears nicely, but somehow it becomes double spaced when posting... I'll see if I can do something about it...
Regards,
Alex</description>
    <pubDate>Mon, 17 Sep 2012 14:47:49 GMT</pubDate>
    <dc:creator>alzamos</dc:creator>
    <dc:date>2012-09-17T14:47:49Z</dc:date>
    <item>
      <title>Compiler decides to use SSE instead of AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766166#M16</link>
      <description>&lt;P&gt;Hello!&lt;/P&gt;
&lt;P&gt;I'm compiling some code in fortran with the -xAVX option (ifort version 12.1.0 20111011), and depending on whether or not the code is standalone or part of a subroutine, the compiler vectorizes differently and I was wondering why.The code is as follows:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;[fortran]program example&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; implicit none&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; double precision, dimension(4, 180000, 4) :: qold, q, res&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; double precision, dimension(4, 180000) :: adt&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; double precision, dimension(4) :: adti, del&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; integer :: i, j, diff, acc&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; diff = 180000&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; do j = 1, diff&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; adti(:) = 1.0 / adt(:,j) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# line 10&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do i = 1, 4&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; del(:) = adti(:) * res(:,j,i) &amp;nbsp; # line 12&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; q(:,j,i) = qold(:,j,i) - del(:) # line 13&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; res(:,j,i) = 0.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;# line 14&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; end do&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; end do&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; acc = sum(q)&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; print *, acc&lt;/P&gt;
&lt;P&gt;end program [/fortran]&lt;/P&gt;
&lt;P&gt;Using -vec-report it says that the loops were indeed vectorized.Furthermore, looking at the assembly, all the relevant instructions were using the ymm registers.However, if I were to encompass the code within a subroutine which takes qold, q, res, adt and diff as arguments and include it in a module, things go differently;This time -vec-report tells me that only lines 10 and 12 could be vectorized, whereas 13 and 14 the vectorization is "possible but seems inefficient".If I decide to force the vectorization to happen using the !DIR$ SIMD directive before lines 13 and 14, it then tells me that the SIMD loop was vectorized.However, looking at the assembly, instead of using AVX instructions it seems to be using SSE instructions.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN style="line-height: normal;"&gt;For line 13,&amp;nbsp;&lt;/SPAN&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;SPAN style="line-height: normal;"&gt;vsubpd &amp;nbsp; &amp;nbsp;generated_module_mp_update_kernel_caller_$DEL.0.2(%rip), %xmm6, %xmm7&amp;nbsp;&lt;/SPAN&gt;&lt;/LI&gt;
&lt;LI&gt;vsubpd &amp;nbsp; &amp;nbsp;16+generated_module_mp_update_kernel_caller_$DEL.0.2(%rip), %xmm8, %xmm9&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;LI&gt;For line 14,
&lt;UL&gt;
&lt;LI&gt;vxorpd &amp;nbsp; &amp;nbsp; &amp;nbsp;%xmm0, %xmm0, %xmm0 &amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;vmovupd &amp;nbsp; %xmm0, (%rbx,%r9) &amp;nbsp;&lt;/LI&gt;
&lt;LI&gt;vmovupd &amp;nbsp; %xmm0, 16(%rbx,%r9) &amp;nbsp;&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;
&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;SPAN style="line-height: 16px;"&gt;Because it does two similar instructions for both, with an offset of 16 bytes (128 bits) which is the size of an SSE register, it seems to be doing two SSE vector instructions instead of one AVX instruction.&lt;/SPAN&gt;&lt;SPAN style="line-height: 16px;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;Does anyone know why this happens and how to change this?&lt;/P&gt;
&lt;P&gt;Kind regards,&lt;/P&gt;
&lt;P&gt;Alex&lt;/P&gt;</description>
      <pubDate>Thu, 16 Aug 2012 17:06:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766166#M16</guid>
      <dc:creator>alzamos</dc:creator>
      <dc:date>2012-08-16T17:06:14Z</dc:date>
    </item>
    <item>
      <title>Compiler decides to use SSE instead of AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766167#M17</link>
      <description>Those are AVX-128 instructions.&amp;nbsp; With AVX-128, the compiler can take advantage of hardware handling misaligned data well enough that there is no need to peel iterations for alignment.&amp;nbsp; This strategy may be evaluated as superior if there isn't sufficient information for the compiler to make other than a default guess about loop count or alignment, as a very long loop may be required to establish a preference for AVX-256 with peeling.&amp;nbsp;&amp;nbsp; If you are interested, you could try loop count and alignment directives to see if they influence a choice between AVX-128 and AVX-256.&lt;BR /&gt;If you are interested in further esoterica, you might try to understand why CEAN notation in icc uses AVX-128 more frequently (if that happens in your cases).&lt;BR /&gt;Perhaps needless to say, gnu compilers frequently prefer AVX-128 in order to avoid extra code to deal with misalignment.</description>
      <pubDate>Sat, 18 Aug 2012 23:58:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766167#M17</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2012-08-18T23:58:42Z</dc:date>
    </item>
    <item>
      <title>Compiler decides to use SSE instead of AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766168#M18</link>
      <description>Hi Alex,&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;What the compiler does depends on the information available to it, which in turn depends on how the arguments are passed, whether caller and callee are in the same source file, etc. The compiler has to make conservative choices (wrt performance as well as correctness) if it doesn't have full information.&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; You haven't given us all the details, please could you include the source version with the module subroutine that does not vectorize? That will allow us to figure out whether the compiler could do better, or whether there's a more effective way to write the code. I think it will be more useful than me constructing an example that doesn't match yours.&lt;BR /&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;BR /&gt;Regards,&lt;BR /&gt;Martyn</description>
      <pubDate>Tue, 21 Aug 2012 22:52:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766168#M18</guid>
      <dc:creator>Martyn_C_Intel</dc:creator>
      <dc:date>2012-08-21T22:52:25Z</dc:date>
    </item>
    <item>
      <title>Sorry for the late reply,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766169#M19</link>
      <description>Sorry for the late reply,
&lt;P&gt;&lt;/P&gt;
Thank you for both of your answers,
&lt;P&gt;&lt;/P&gt;
@TimP: Thanks for pointing me to the loop count and alignment directives, I'll take a look at them now and let you know if they work!
@Martyn: I'm basically fiddling with some very ugly generated code produced by a source to source compiler to see how to change it in such a way that ifort will vectorize the instructions, below is the subroutine in question and the subroutine which calls the 'problematic' subroutine (both within the same source file): 
(BS is a macro defined at the beginning of the file, usually set to 4)
[fortran]
subroutine update_kernel_caller(qold, q, res, adt, rms, diff)
    implicit none
    integer  :: diff
    real(kind=8), dimension(BS,diff/BS,4) :: qold, q, res
    real(kind=8), dimension(BS,*)  :: adt
    real(kind=8), dimension(1)       :: rms
    real(kind=8), dimension(BS)    :: adti, del
    real(kind=8), dimension(BS)    :: acc
    integer :: i,j
    do j=1,diff/BS
        adti(:) = 1.0 / adt(:,j)
        do i=1,4
            del(:) = adti(:) * res(:,j,i)
            q(:,j,i) = qold(:,j,i) - del(:)
            res(:,j,i) = 0.0
            acc(:) = del(:) * del(:)
            rms(1) = rms(1) + sum(acc)
        end do
    end do
end subroutine
[/fortran]

This is called by :
(The opDats originally contain data in AoS form, the mkl method turns it into SoA and stores the result into the topDats, the system clocks are just for benchmarking purposes)
[fortran]
subroutine update_kernel(opDat1,opDat2,opDat3,opDat4,opDat5,sliceStart,sliceEnd)
    implicit none
    real(kind=8), dimension(0:*) :: opDat1
    real(kind=8), dimension(0:*) :: opDat2
    real(kind=8), dimension(0:*) :: opDat3
    real(kind=8), dimension(0:*) :: opDat4
    real(kind=8), dimension(0:*) :: opDat5
    integer(kind=4) :: sliceStart
    integer(kind=4) :: sliceEnd
    integer(kind=4) :: i1, diff
    integer(8) :: time1, time2, count_rate, count_max 
    real(kind=8), allocatable, dimension(:) :: topDat1, topDat2, topDat3
    diff = sliceEnd - sliceStart
    call system_clock(time1, count_rate, count_max)
    allocate (topDat1(diff*4))
    allocate (topDat2(diff*4))
    allocate (topDat3(diff*4))
    call mkl_domatcopy('r', 't', diff, 4, 1.d0, opDat1(4*sliceStart:4*sliceEnd-1), 4, topDat1, diff)
    call mkl_domatcopy('r', 't', diff, 4, 1.d0, opDat2(4*sliceStart:4*sliceEnd-1), 4, topDat2, diff)
    call mkl_domatcopy('r', 't', diff, 4, 1.d0, opDat3(4*sliceStart:4*sliceEnd-1), 4, topDat3, diff)
    call system_clock(time2, count_rate, count_max)
    print *, "### not counted", time2 - time1
    call update_kernel_caller(topDat1, topDat2, topDat3, opDat4(sliceStart:sliceStart+diff-1), opDat5, diff)
    call system_clock(time1, count_rate, count_max)
    call mkl_domatcopy('r', 't', 4, diff, 1.d0, topDat1, diff, opDat1(4*sliceStart:4*sliceEnd-1), 4)
    call mkl_domatcopy('r', 't', 4, diff, 1.d0, topDat2, diff, opDat2(4*sliceStart:4*sliceEnd-1), 4)
    call mkl_domatcopy('r', 't', 4, diff, 1.d0, topDat3, diff, opDat3(4*sliceStart:4*sliceEnd-1), 4)
    deallocate (topDat1) 
    deallocate (topDat2)
    deallocate (topDat3)
    call system_clock(time2, count_rate, count_max)
    print *, "### not counted", time2 - time1
end subroutine
[/fortran]

I apologize for the horrible spacing... in the forum editor it appears nicely, but somehow it becomes double spaced when posting... I'll see if I can do something about it...
Regards,
Alex</description>
      <pubDate>Mon, 17 Sep 2012 14:47:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Compiler-decides-to-use-SSE-instead-of-AVX/m-p/766169#M19</guid>
      <dc:creator>alzamos</dc:creator>
      <dc:date>2012-09-17T14:47:49Z</dc:date>
    </item>
  </channel>
</rss>

