<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Data alignment and vectorization question in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756514#M12001</link>
    <description>Have you allowed the compiler to generate vectorization reports? There may be useful information in those reports.&lt;BR /&gt;&lt;BR /&gt;&lt;I&gt;&amp;gt; Obviouslythe access pattern for all arrays is perfectly sequential.&lt;/I&gt;&lt;BR /&gt;&lt;BR /&gt;The vectorizer may have difficulty in seeing that statement as being applicable to x_in, when it sees the second index having values iy - 1, iy and iy +1 in the loop.&lt;BR /&gt;&lt;BR /&gt;&lt;I&gt;&amp;gt; I assume it's an alignment issue.&lt;BR /&gt;&lt;BR /&gt;&lt;/I&gt;That may be possible. I remember reading somewhere that later releases of CPUs were tweaked to make the penalty for unaligned access less significant. The penalty is quite high on IA64 (the CPU versions that I have tried, at least).&lt;BR /&gt;&lt;BR /&gt;Please state your OS version, the host and target CPUs, and specify the compiler flags in effect. I am curious about your questions.</description>
    <pubDate>Fri, 22 Jul 2011 11:07:33 GMT</pubDate>
    <dc:creator>mecej4</dc:creator>
    <dc:date>2011-07-22T11:07:33Z</dc:date>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756513#M12000</link>
      <description>The following code example (kernel ofa Red-Black SOR solver)is partially vectorized by ifort 11.1.056. While the arithmetic is done with true SIMD instructions (mulpd, subpd), the loads are still single operand (movsd, movhpd).&lt;BR /&gt;&lt;BR /&gt;Obviouslythe access pattern for all arrays is perfectly sequential. That should allow vectorization for the loads too.&lt;BR /&gt;&lt;BR /&gt;What could be done to get fully vectorized execution ?I assume it's an alignment issue. For the given example, even array copies may be acceptable as the data will be reused many times (hundreds of iterations). Thanks for any insight into this matter.&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;PRE&gt;[fortran]subroutine rb_row_a(nx, ny, iy, sps, spe, x_in, x_out, a, b, omega, errff)

  implicit none

  integer, intent(in) :: nx
  integer, intent(in) :: ny
  integer, intent(in) :: iy
  integer, intent(in) :: sps
  integer, intent(in) :: spe
  real, dimension(    0:nx,0:ny), intent(in)    :: x_in
  real, dimension(    0:nx,0:ny), intent(inout) :: x_out
  real, dimension(2:5,0:nx,0:ny), intent(in)    :: a
  real, dimension(    0:nx,0:ny), intent(in)    :: b
  real,                           intent(in)    :: omega
  real,                           intent(inout) :: errff

  integer  :: ix
  real     :: c
  real     :: xr
  real     :: xx
  real,    dimension(sps:spe) :: errs

!dec$ ivdep
  do ix = sps, spe
    xx = x_out(ix,iy)
    xr = 1.0 / xx
    c = b(ix,iy) - xx                &amp;amp;
      - a(2,ix,iy) * x_in(ix  ,iy-1) &amp;amp;
      - a(3,ix,iy) * x_in(ix-1,iy  ) &amp;amp;
      - a(4,ix,iy) * x_in(ix  ,iy  ) &amp;amp;
      - a(5,ix,iy) * x_in(ix  ,iy+1)
    c = omega * c
    errs(ix) = abs(c * xr)
    x_out(ix,iy) = xx + c
  end do  
  errff=max(maxval(errs), errff)

  return
end
[/fortran]&lt;/PRE&gt;</description>
      <pubDate>Fri, 22 Jul 2011 09:52:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756513#M12000</guid>
      <dc:creator>mriedman</dc:creator>
      <dc:date>2011-07-22T09:52:10Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756514#M12001</link>
      <description>Have you allowed the compiler to generate vectorization reports? There may be useful information in those reports.&lt;BR /&gt;&lt;BR /&gt;&lt;I&gt;&amp;gt; Obviouslythe access pattern for all arrays is perfectly sequential.&lt;/I&gt;&lt;BR /&gt;&lt;BR /&gt;The vectorizer may have difficulty in seeing that statement as being applicable to x_in, when it sees the second index having values iy - 1, iy and iy +1 in the loop.&lt;BR /&gt;&lt;BR /&gt;&lt;I&gt;&amp;gt; I assume it's an alignment issue.&lt;BR /&gt;&lt;BR /&gt;&lt;/I&gt;That may be possible. I remember reading somewhere that later releases of CPUs were tweaked to make the penalty for unaligned access less significant. The penalty is quite high on IA64 (the CPU versions that I have tried, at least).&lt;BR /&gt;&lt;BR /&gt;Please state your OS version, the host and target CPUs, and specify the compiler flags in effect. I am curious about your questions.</description>
      <pubDate>Fri, 22 Jul 2011 11:07:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756514#M12001</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2011-07-22T11:07:33Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756515#M12002</link>
      <description>Host and target OS are RH5, Harpertown CPUs, compiler flags and output:&lt;BR /&gt;&lt;BR /&gt;&amp;gt; ifort -openmp -r8 -O3 -axT -g -S -vec-report rb.f90&lt;BR /&gt;&lt;BR /&gt;rb.f90(32): (col. 5) remark: LOOP WAS VECTORIZED.&lt;BR /&gt;rb.f90(36): (col. 13) remark: PARTIAL LOOP WAS VECTORIZED.&lt;BR /&gt;&lt;BR /&gt;So the compiler does consider the loop vectorized even if the loads are not.&lt;BR /&gt;&lt;BR /&gt;The loop did not vectorize at all until I moved the max() operation out of the loop, which certainly costs extra memory accesses to array errs. Apparently reductions generally inhibit vectorization.</description>
      <pubDate>Fri, 22 Jul 2011 11:25:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756515#M12002</guid>
      <dc:creator>mriedman</dc:creator>
      <dc:date>2011-07-22T11:25:54Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756516#M12003</link>
      <description>If you want code specifically targeted to HTN, you will need -xSSE4.1. You asked for code paths targeted to Woodcrest and pre-SSE2 CPUs, so the compiler is correct in splitting the unaligned vectorized loads, as the 2 64-bit loads would be faster. In practice, I have found the split unaligned loads faster even on the current 6-core CPUs. You may see an advantage for movups on Nehalem or Istanbul, when cache locality is high.&lt;BR /&gt;Your chances for vectorization of the max() might be better if you didn't write in an explicit inversion; as you have promoted your data to double precision, vector divide would be faster anyway. You could experiment with !dir$ vector always, which eliminates the compiler's cost analysis.&lt;BR /&gt;If you need the IVDEP for vectorization, you should check as to why that is so.&lt;BR /&gt;</description>
      <pubDate>Fri, 22 Jul 2011 15:12:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756516#M12003</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-07-22T15:12:18Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756517#M12004</link>
      <description>Version 12.0 says "insufficient work" to vectorize the second loop, and I could not find a way to convince it to do so. An update later this year will vectorize it.</description>
      <pubDate>Fri, 22 Jul 2011 15:21:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756517#M12004</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2011-07-22T15:21:43Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756518#M12005</link>
      <description>Can you experiment with&lt;BR /&gt;&lt;BR /&gt;real, pointer :: x_in_iy_m1(:), x_in_m1_iy(:),x_in_iy(:),x_in_iy_p1(:)&lt;BR /&gt;x_in_iy_m1 =&amp;gt; x_in(sps,iy-1)&lt;BR /&gt;x_in_m1_iy =&amp;gt; x_in(sps-1,iy)&lt;BR /&gt;x_in_iy =&amp;gt; x_in(sps,iy)&lt;BR /&gt;x_in_iy_p1 =&amp;gt; x_in(sps,iy+1)&lt;BR /&gt;!dec$ ivdep&lt;BR /&gt;do ix=0,spe-sps-1&lt;BR /&gt; xx = x_out(ix+sps, iy)&lt;BR /&gt;...&lt;BR /&gt; - a(2,ix+sps,iy) * x_in_iy_m1(ix) &amp;amp;&lt;BR /&gt; - a(3,ix+sps,iy) * x_m1_iy(ix) &amp;amp;&lt;BR /&gt; - a(4,ix+sps,iy) * x_in_iy(ix) &amp;amp;&lt;BR /&gt; - a(5,ix+sps,iy) * x_in_iy_p1(ix)&lt;BR /&gt;...&lt;BR /&gt; x_out(ix+sps,iy) = ...&lt;BR /&gt;&lt;BR /&gt;If you do not like using the pointers then construct a subroutine that passes in the array slice references (1D array slice reference from 2D array). There should be little overhead in constructing the array descriptor. Hopefully no copying of slice, verify this as you cannot always tell if IVF will make a copy of the array slice or not.&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey</description>
      <pubDate>Fri, 22 Jul 2011 17:17:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756518#M12005</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-07-22T17:17:07Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756519#M12006</link>
      <description>It's the assignment to errff that is not vectorized here. The main loop is vectorized.</description>
      <pubDate>Fri, 22 Jul 2011 18:15:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756519#M12006</guid>
      <dc:creator>Steven_L_Intel1</dc:creator>
      <dc:date>2011-07-22T18:15:31Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756520#M12007</link>
      <description>Then (at least for Release build) remove the local&lt;BR /&gt;&lt;BR /&gt; real, dimension(sps:spe) :: errs&lt;BR /&gt;&lt;BR /&gt;replace&lt;BR /&gt;&lt;BR /&gt;&lt;P&gt; errs(ix) = abs(c * xr)&lt;/P&gt;&lt;P&gt;with&lt;BR /&gt; errff=max(abs(c * xr), errff)&lt;BR /&gt;&lt;BR /&gt;And remove the &lt;BR /&gt; errf = max(maxval(errs),errf)&lt;BR /&gt;&lt;BR /&gt;If you need the errs array for debugging purposes then conditionalize the code to provide it in debug mode&lt;BR /&gt;&lt;BR /&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Sun, 24 Jul 2011 14:22:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756520#M12007</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2011-07-24T14:22:50Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756521#M12008</link>
      <description>&lt;PRE&gt;[bash]&lt;PRE name="code" class="bash"&gt;&lt;PRE name="code" class="bash"&gt;...&lt;BR /&gt;real, dimension(0:nx,2:5,0:ny), intent(in)    :: a[/bash]&lt;/PRE&gt; ...&lt;BR /&gt;&lt;BR /&gt;      do ix = sps, spe
      !dir$ distribute point
        xx = x_out(ix,iy)
        c = b(ix,iy) - xx                &amp;amp;
          - a(ix,2,iy) * x_in(ix  ,iy-1) &amp;amp;
          - a(ix,3,iy) * x_in(ix-1,iy  ) &amp;amp;
          - a(ix,4,iy) * x_in(ix  ,iy  ) &amp;amp;
          - a(ix,5,iy) * x_in(ix  ,iy+1)
        c = omega * c
        errs(ix) = abs(c / xx)
        x_out(ix,iy) = xx + c
      errff=max((errs(ix)), errff)
      end do&lt;/PRE&gt; &lt;BR /&gt;&lt;/PRE&gt;&lt;BR /&gt;If the original question was meant to be how to persuade ifort to vectorize without splitting the loop, it appears you would require the DISTRIBUTE POINT directive. It does appear that the single vectorized loop ought to perform better than the split loops.&lt;BR /&gt;When you see multiple PARTIAL LOOP VECTORIZED diagnostics (in the absence of the DISTRIBUTE POINT to prevent splitting), in a case such as this where there is a partial vector loop for each array assignment, you can be assured that all partials are vectorized even without looking at the generated code.&lt;BR /&gt;If you wish to avoid scalar loads for the a() elements, you must swap the subscript order and compile for Nehalem (-xSSE4.2) (and hope that no Nehalem new instructions creep in). Clearly, the compiler is aware that the split unaligned loads are faster for Harpertown. In this case, you could take advantage of that when running on Westmere, by specifying the SSE4.1 (HTN) or SSE2 optimization.</description>
      <pubDate>Sun, 24 Jul 2011 16:48:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756521#M12008</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-07-24T16:48:41Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756522#M12009</link>
      <description>Thanks a lot, Tim and Jim,&lt;BR /&gt;&lt;BR /&gt;indeed I can merge the 2 loops anddiscard the intermediate &lt;EM&gt;errs &lt;/EM&gt;array if the explicit inversion is removed. That gets me a cleaner and slightly faster (~5%) code.The divide and the max() operations are then vectorized as they should be.&lt;BR /&gt;The scalar loads remain even if I compile with -xSSE4.1, this option just changes the ordering of instructions, but it does not useany new instructions.Assuming that there is no big penalty for scalar loads this solution is good enough for now.&lt;BR /&gt;The &lt;EM&gt;ivdep&lt;/EM&gt; directive is not needed nor is the &lt;EM&gt;vector &lt;/EM&gt;directive. I had to introduce &lt;EM&gt;ivdep&lt;/EM&gt; in the original code version where the red and black data (x_in, x_out)is in the same array, just separated by an extra dimension.&lt;BR /&gt;&lt;BR /&gt;I'm going to makefurther tests with ifort12 and AVX. Shouldn't vectorized loads be desirable with AVX ? Any experience someone ?</description>
      <pubDate>Mon, 25 Jul 2011 13:39:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756522#M12009</guid>
      <dc:creator>mriedman</dc:creator>
      <dc:date>2011-07-25T13:39:08Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756523#M12010</link>
      <description>I saw the same, that the compiler chose not to use movups for unaligned loads under -xSSE4.1. As I said, the unaligned load is expected to take longer than than the split scalar loads, on an HTN CPU (and, in my experience, on WSM also).&lt;BR /&gt;You can try the AVX options, simply to generate the .s file, without attempting to run. As is expected, the compiler uses avx-128 unaligned pairs of loads for my modified version of your source code (1 avx-128 unaligned store pair and 1 avx-256 aligned store). Only data which resides in L1 cache and works with avx-256 aligned loads could be expected to benefit from 256-bit loads on the Sandy Bridge.</description>
      <pubDate>Mon, 25 Jul 2011 14:58:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756523#M12010</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-07-25T14:58:48Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756524#M12011</link>
      <description>The AVX assembly code looks really weird with the original dimensioning of array a. Lots of operations needed to pack things into the 256-bit registers. But at least the loop is unrolled by 4 instead of 2 and the ymm registers are really used. &lt;BR /&gt;Assembly of your modified version with swapped dimensions looks much more straightforward. &lt;BR /&gt;&lt;BR /&gt;My final question is: What can a programmer do togetaligned instead of unaligned loads ? How does the compiler distinguish data accessand generatethe different load types?Would it helpto copy data into a static array ?Or is this finally something that can't be controlled from Fortran source ?</description>
      <pubDate>Mon, 25 Jul 2011 16:13:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756524#M12011</guid>
      <dc:creator>mriedman</dc:creator>
      <dc:date>2011-07-25T16:13:04Z</dc:date>
    </item>
    <item>
      <title>Data alignment and vectorization question</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756525#M12012</link>
      <description>As your code explicitly accesses array sections of origin differing by 1, at least one of those has to be misaligned. Since the alignment for stored arrays is more important, the compiler builds in a remainder loop to adjust one of the stored arrays for alignment, thus the use of avx-256 alignment for one array. Among the ways to allow the compiler to see the relative alignment of arrays would be COMMON, or, in principle, MODULE arrays.&lt;BR /&gt;In a situation where you use the same array repeatedly in an inner loop, and can copy it to an aligned stride 1 array in the outer loop, that can easily improve efficiency. If you use a local array, the compiler should attempt to make it aligned.</description>
      <pubDate>Mon, 25 Jul 2011 17:24:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/Data-alignment-and-vectorization-question/m-p/756525#M12012</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-07-25T17:24:03Z</dc:date>
    </item>
  </channel>
</rss>

