<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Padding does not help AVX in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Padding-does-not-help-AVX/m-p/923783#M3059</link>
    <description>&lt;P&gt;Hi all&lt;/P&gt;
&lt;P&gt;I have the following C function:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;void mass_ffc( double A[10][10], double x[3][2])&lt;BR /&gt;{&lt;BR /&gt; // Compute Jacobian of affine map from reference cell&lt;BR /&gt; const double J_00 = x[1][0] - x[0][0];&lt;BR /&gt;...&lt;BR /&gt; const double J_11 = x[2][1] - x[0][1];&lt;BR /&gt; &lt;BR /&gt; // Compute determinant of Jacobian&lt;BR /&gt; double detJ = J_00*J_11 - J_01*J_10;&lt;BR /&gt; const double det = fabs(detJ);&lt;BR /&gt; &lt;BR /&gt; // Array of quadrature weights.&lt;BR /&gt; const double W12[12] __attribute__((aligned(PADDING))) = { ....&amp;nbsp;};&lt;BR /&gt; &lt;BR /&gt; // Value of basis functions at quadrature points.&lt;BR /&gt; const double FE0[12][10] __attribute__((aligned(PADDING))) = \&lt;BR /&gt;{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};&lt;/P&gt;
&lt;P&gt;for (int ip = 0; ip &amp;lt; 12; ip++) &amp;nbsp;{&lt;BR /&gt;&amp;nbsp; &amp;nbsp; double tmp = W12[ip]*det; &lt;BR /&gt;&amp;nbsp; &amp;nbsp; for (int j=0; j&amp;lt;10; ++j) &amp;nbsp;{&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; double tmp2 = FE0[ip]&lt;J&gt;*tmp;&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; #pragma vector aligned&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; for (int k=0; k&amp;lt;10; ++k) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; A&lt;J&gt;&lt;K&gt; += FE0[ip]&lt;K&gt;*tmp2;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&amp;nbsp;// end loop over 'j'&lt;BR /&gt; } // end loop over 'ip'&lt;BR /&gt; &lt;BR /&gt;} // end function&lt;/K&gt;&lt;/K&gt;&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then,&amp;nbsp;I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.&lt;/P&gt;
&lt;P&gt;Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both &lt;STRONG&gt;before&lt;/STRONG&gt;&amp;nbsp;and &lt;STRONG&gt;after&amp;nbsp;&lt;/STRONG&gt;the piece of assembly code representing the&amp;nbsp;&lt;STRONG&gt;full unrolling of the loops over j and k&lt;/STRONG&gt;. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;vmovupd 32(%r15), %ymm2&amp;nbsp;&lt;BR /&gt; vmovupd 96(%r15), %ymm14&amp;nbsp;&lt;BR /&gt; vmovupd %ymm15, 1280(%rsp)&amp;nbsp;&lt;BR /&gt; vmovupd 608(%r15), %ymm15&amp;nbsp;&lt;BR /&gt; vmovupd %ymm1, 1792(%rsp)&lt;BR /&gt; vmovupd %ymm2, 1824(%rsp)&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;# compilation of the loop nests&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;1760(%rsp), %ymm3&amp;nbsp;&lt;BR /&gt; vmovupd %ymm15, 928(%r15)&lt;BR /&gt; vmovupd 1600(%rsp), %ymm15&amp;nbsp;&lt;BR /&gt; vmovupd %ymm0, 544(%r15)&amp;nbsp;&lt;BR /&gt; vmovupd %ymm1, 480(%r15)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In the end, the version with trip count 10 goes faster than that with trip count 12.&lt;/P&gt;
&lt;P&gt;By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for considering my (long) request.&lt;/P&gt;
&lt;P&gt;Fabio&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Fri, 25 Jan 2013 09:54:35 GMT</pubDate>
    <dc:creator>FabioL_</dc:creator>
    <dc:date>2013-01-25T09:54:35Z</dc:date>
    <item>
      <title>Padding does not help AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Padding-does-not-help-AVX/m-p/923783#M3059</link>
      <description>&lt;P&gt;Hi all&lt;/P&gt;
&lt;P&gt;I have the following C function:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;void mass_ffc( double A[10][10], double x[3][2])&lt;BR /&gt;{&lt;BR /&gt; // Compute Jacobian of affine map from reference cell&lt;BR /&gt; const double J_00 = x[1][0] - x[0][0];&lt;BR /&gt;...&lt;BR /&gt; const double J_11 = x[2][1] - x[0][1];&lt;BR /&gt; &lt;BR /&gt; // Compute determinant of Jacobian&lt;BR /&gt; double detJ = J_00*J_11 - J_01*J_10;&lt;BR /&gt; const double det = fabs(detJ);&lt;BR /&gt; &lt;BR /&gt; // Array of quadrature weights.&lt;BR /&gt; const double W12[12] __attribute__((aligned(PADDING))) = { ....&amp;nbsp;};&lt;BR /&gt; &lt;BR /&gt; // Value of basis functions at quadrature points.&lt;BR /&gt; const double FE0[12][10] __attribute__((aligned(PADDING))) = \&lt;BR /&gt;{{0.0463079953908666, 0.440268993398561, 0.0463079953908666, 0.402250914961474, -0.201125457480737, -0.0145210435563256, -0.0145210435563258, ...0.283453533784293}};&lt;/P&gt;
&lt;P&gt;for (int ip = 0; ip &amp;lt; 12; ip++) &amp;nbsp;{&lt;BR /&gt;&amp;nbsp; &amp;nbsp; double tmp = W12[ip]*det; &lt;BR /&gt;&amp;nbsp; &amp;nbsp; for (int j=0; j&amp;lt;10; ++j) &amp;nbsp;{&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; double tmp2 = FE0[ip]&lt;J&gt;*tmp;&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; #pragma vector aligned&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; for (int k=0; k&amp;lt;10; ++k) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; A&lt;J&gt;&lt;K&gt; += FE0[ip]&lt;K&gt;*tmp2;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&amp;nbsp;// end loop over 'j'&lt;BR /&gt; } // end loop over 'ip'&lt;BR /&gt; &lt;BR /&gt;} // end function&lt;/K&gt;&lt;/K&gt;&lt;/J&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Compiling it with ICC 2013 (flags: -xAVX, -O3) I end up with, let's say, a quite expected result: the innermost loop over k is fully unrolled, the first two iterations are peeled out and the remaining 8 are performed with avx instructions (mulpd, addpd). Then,&amp;nbsp;I padded the FE0 and A matrices to 12 elements and I increased the k trip count to 12. The idea is that this way I would have been able to get a fully unrolled k loop and to carry it out with just 3 "groups" (mulpd, addpd) of packed avx instructions, saving the time spent for peeling and, in general, with scalar instructions.&lt;/P&gt;
&lt;P&gt;Now the point is that if I compile the function with trip count 12, the compiler inserts a long sequence of movupd instructions both &lt;STRONG&gt;before&lt;/STRONG&gt;&amp;nbsp;and &lt;STRONG&gt;after&amp;nbsp;&lt;/STRONG&gt;the piece of assembly code representing the&amp;nbsp;&lt;STRONG&gt;full unrolling of the loops over j and k&lt;/STRONG&gt;. These movupd basically copy the elements in A to the stack (before) and from the stack back to A (after, and then the function returns). For example:&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;vmovupd 32(%r15), %ymm2&amp;nbsp;&lt;BR /&gt; vmovupd 96(%r15), %ymm14&amp;nbsp;&lt;BR /&gt; vmovupd %ymm15, 1280(%rsp)&amp;nbsp;&lt;BR /&gt; vmovupd 608(%r15), %ymm15&amp;nbsp;&lt;BR /&gt; vmovupd %ymm1, 1792(%rsp)&lt;BR /&gt; vmovupd %ymm2, 1824(%rsp)&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;# compilation of the loop nests&lt;/P&gt;
&lt;P&gt;...&lt;/P&gt;
&lt;P&gt;1760(%rsp), %ymm3&amp;nbsp;&lt;BR /&gt; vmovupd %ymm15, 928(%r15)&lt;BR /&gt; vmovupd 1600(%rsp), %ymm15&amp;nbsp;&lt;BR /&gt; vmovupd %ymm0, 544(%r15)&amp;nbsp;&lt;BR /&gt; vmovupd %ymm1, 480(%r15)&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Of course, you might say why caring about a so mild (potential?) optimization in such a small function? because the function is invoked millions of times.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count 12?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;In the end, the version with trip count 10 goes faster than that with trip count 12.&lt;/P&gt;
&lt;P&gt;By the way, if I increase the trip count to, let's say, 16, I don't get this weird behaviour.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for considering my (long) request.&lt;/P&gt;
&lt;P&gt;Fabio&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 25 Jan 2013 09:54:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Padding-does-not-help-AVX/m-p/923783#M3059</guid>
      <dc:creator>FabioL_</dc:creator>
      <dc:date>2013-01-25T09:54:35Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Padding-does-not-help-AVX/m-p/923784#M3060</link>
      <description>&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;My questions are: what that sequence of movupd instruction represents? And why is it inserted there with trip count &lt;STRONG&gt;12&lt;/STRONG&gt;? 
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;In the end, the version with trip count 10 goes faster than that with trip count &lt;STRONG&gt;12&lt;/STRONG&gt;.
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;By the way, if I increase the trip count to, let's say, &lt;STRONG&gt;16&lt;/STRONG&gt;, I don't get this weird behaviour.

Some optimization tricks are unexplained by software engineers however in a "battle" between numbers &lt;STRONG&gt;12&lt;/STRONG&gt; and &lt;STRONG&gt;16&lt;/STRONG&gt; Intel engineers prefer to use &lt;STRONG&gt;16&lt;/STRONG&gt;.

PS: Sorry for off the topic and here two examples:

- with Intel C/C++ compiler &lt;STRONG&gt;sizeof( long double ) = 16&lt;/STRONG&gt; when option /Qlong-double is used

- SIMD structures, like:

typedef union __declspec(intrin_type) &lt;STRONG&gt;_CRT_ALIGN(16)&lt;/STRONG&gt; __m128 {
     float               m128_f32[4];
     unsigned __int64    m128_u64[2];
     __int8              m128_i8[16];
     __int16             m128_i16[8];
     __int32             m128_i32[4];
     __int64             m128_i64[2];
     unsigned __int8     m128_u8[16];
     unsigned __int16    m128_u16[8];
     unsigned __int32    m128_u32[4];
 } __m128;</description>
      <pubDate>Fri, 25 Jan 2013 14:10:21 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Padding-does-not-help-AVX/m-p/923784#M3060</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-01-25T14:10:21Z</dc:date>
    </item>
  </channel>
</rss>

