<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic AVX sometimes slower than SSE in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820862#M1138</link>
    <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;.L_partially_aligned &lt;/P&gt;&lt;DIV id="_mcePaste"&gt;...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vmovups vec1[0:3], xmm0&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vinsertf128 $1, vec1[4:7], ymm0, ymm1 /* this block unrolled twice*/&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vaddps vec2[0:7], ymm1, ymm2&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vmovups ymm2, vec1[0:7]&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;....&lt;/DIV&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;if "vec1" isn't 32B aligned (and it looks like it's the case since you do 2 128-bit loads) it should be significantly faster to also split the final store in 2 x 128-bit store</description>
    <pubDate>Fri, 20 May 2011 23:16:37 GMT</pubDate>
    <dc:creator>bronxzv</dc:creator>
    <dc:date>2011-05-20T23:16:37Z</dc:date>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820856#M1132</link>
      <description>&lt;DIV&gt;&lt;DIV id="_mcePaste"&gt;Has anyone experienced a slow down by a factor of around 2 for certain functions that are converted from SSE to AVX-128?&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;My setup:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Intel Compiler icc V12.0.0.20101116&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Linux Kernel: 2.6.32-71.el6.x86_64&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;processor Intel Core i7-2600K CPU @ 3.4GHz&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Intel Speed Step *DISABLED*&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Affinity, locked to 1 core&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;Memory allocated 32 byte aligned&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;My compiler flags:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;SSE: -m64 -msse3 -axSSE3 -align&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;AVX: -m64 -xavx -align&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;I have compiled the following function:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;inline void vec_vec_add_overwrite( float *vec1, float *vec2, int n )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;{&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; long ii;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; for( ii = 0; ii &amp;lt; n; ii++ )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; {&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  vec1[ii] += vec2[ii];&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; }&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;}&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;My tests go along as follows:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;SetAffinity( core 0 )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;overhead = GetClockOverhead(NUMTESTS)&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;memset( clocks, 0, NUMTESTS *sizeof(clocks) )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;n = 5123 /*vector lengths*/&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;for( i = 0 ; i &amp;lt; NUMTESTS; i++ )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;{&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vec1 = malloc( aligned 32, n length )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vec2 = malloc( aligned 32, n length )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; fill_with_random( vec1 )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; fill_with_random( vec2 )&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; _mm_clflush( vec1 );&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; _mm_clflush( vec2 );&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; _mm_fence();&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; before = ReadTSC() /* uses assembly CPUID call */&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vec_vec_add_overwrite( vec1, vec2, n );&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; clocks&lt;I&gt; = ReadTSC() - before;&lt;/I&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;}&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;RemoveOverhead(clocks, NUMTESTS, overhead)&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;print average(clocks[IGNORED_START_INDEX : END]) /* I THROW OUT A HANDFUL OF BEGINNING RESULTS TO REMOVE INITIAL TRANSIENTS */&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;The SSE version looks roughly like this (unix style assembly dest on right):&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; movss vec1, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; addss vec2, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; movss xmm1, vec1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;.L_aligned:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; movaps vec1, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; addps vec1, xmm1 &lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; movaps xmm1, vec1     /*this block unrolled twice*/&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;.L_unaligned&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  movups vec1, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  movups vec2, xmm2 /*this block unrolled twice*/&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  addps vec2, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  movups xmm1, vec1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;.L_finishup:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  movss vec1, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  addss vec2, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  movss xmm1, vec1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ret&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; &lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; &lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;The AVX version looks roughly like this (unix style assembly dest on right):&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vmovss vec1, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vaddss vec2, xmm1, xmm2&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vmovss xmm2, vec1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;.L_partially_aligned&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  vmovups vec1[0:3], xmm0&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  vinsertf128 $1, vec1[4:7], ymm0, ymm1  /* this block unrolled twice*/&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  vaddps vec2[0:7], ymm1, ymm2&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  vmovups ymm2, vec1[0:7]&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ....&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; &lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;.L_finishup:&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vmovss vec1, xmm1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vaddss vec2, xmm1, xmm2&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vmovss xmm2, vec1&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;  ...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; vzeroupper&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt; ret&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 18 May 2011 19:22:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820856#M1132</guid>
      <dc:creator>Eric_Nuckols</dc:creator>
      <dc:date>2011-05-18T19:22:55Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820857#M1133</link>
      <description>Your C code is a little vague. Anyway, questions:&lt;OL&gt;&lt;LI&gt;You're benchmarking on a dataset size of 5123 Bytes?&lt;/LI&gt;&lt;LI&gt; You flush the first cacheline of that array before you start your measurement, why?&lt;/LI&gt;&lt;LI&gt;How do you really measure the overhead? My experience with the rdtscp call is to rather use a long-running test (&amp;gt;= 1ms). Overhead subtraction always gave funny numbers.&lt;/LI&gt;&lt;/OL&gt;And some answers:&lt;BR /&gt;&lt;OL&gt;&lt;LI&gt;Sandy-Bridge can do two 128-bit loads + one 128-bit store per cycle. Thus, with perfect ILP and unrolling, both loops (SSE and AVX) reach a 128-bit per cycle throughput.&lt;/LI&gt;&lt;LI&gt;You're doing one FLOP per two moves (load + store) per value. Even if your problem is contained in the L1 cache you're thus limited by the number of stores Sandy-Bridge can do. It could do four times more FLOP with AVX than your code could possibly reach.&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Thu, 19 May 2011 11:33:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820857#M1133</guid>
      <dc:creator>Matthias_Kretz</dc:creator>
      <dc:date>2011-05-19T11:33:57Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820858#M1134</link>
      <description>As Matthias pointed out, the upper limit for store performance of AVX on Sandy Bridge is the same as for SSE. That limit is approached only with nontemporal stores (which aren't applicable to your code), but the AVX compilation doesn't use nontemporal. I've asked for that to change, but it's not likely to change for the foreseeable future.&lt;BR /&gt;As you found that your code is unrolled by 2, the maximum length of the scalar remainder loops has increased from 7 to 15. With the Intel compilers, AVX performance requires attention to alignment and making the loop end as well as begin on a cache boundary. Other compilers may produce more efficient remainder loops.&lt;BR /&gt;You haven't communicated the alignment you assert to the compiler (e.g. by #pragma vector aligned). The use of unaligned move for aligned data in itself gives you no performance penalty, but you could avoid the vinsertf128 step if you persuaded the compiler to specialize to 32-byte alignment.</description>
      <pubDate>Thu, 19 May 2011 13:36:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820858#M1134</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-05-19T13:36:40Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820859#M1135</link>
      <description>&lt;DIV&gt;@&lt;A rel="/en-us/services/profile/quick_profile.php?is_paid=&amp;amp;user_id=416997" class="basic" href="http://software.intel.com/en-us/profile/416997/"&gt;Matthias Kretz&lt;/A&gt;&lt;IMG border="0" src="http://software.intel.com/media/images/brown.gif" /&gt;&lt;/DIV&gt;Answers/Reasoning&lt;DIV&gt;1. That particular number 5123 was the length of the floating point buffers. I chose smaller array lengths for a few reasons:&lt;/DIV&gt;&lt;DIV&gt;  a. I don't want worry about any kind of pre-emption or OS related noise in my results so I want to get in and out quickly.&lt;/DIV&gt;&lt;DIV&gt;  b. I have a lot of other functions and variations of each of the functions that I test repeatedly in a test bed and many times I just want to see if quick compiler options have any significance.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;2. I flush cache lines because I am trying to get apples to apples comparisons between my functions that are c code/compiler generated assembly, hand coded assembly, MKL api, IPP api, etc. and also I am comparing gcc and icc performance. Basically attempting to eliminate variables in the benchmark.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;3. I am using the ideas from Agner Fog's optimization guide and samples for removing overhead and measuring clock cycles. So it consists of something like this:&lt;/DIV&gt;&lt;DIV&gt;   loop&lt;/DIV&gt;&lt;DIV&gt;   {&lt;/DIV&gt;&lt;DIV&gt;    before = ReadTSC()&lt;/DIV&gt;&lt;DIV&gt;    after&lt;I&gt; = ReadTSC() - before&lt;/I&gt;&lt;/DIV&gt;&lt;DIV&gt;   }&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;   overhead = max of (after buffer)&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I used to get funny numbers but that seemed to be corrected by following Mr. Fog's directions and disabling the Speed Step and any other dynamic clocking things in the BIOS.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;@Tim:&lt;/DIV&gt;&lt;DIV&gt;I noticed that the compiler definitely did not do the same job for AVX of alignment tricks as it did with SSE.&lt;/DIV&gt;&lt;DIV&gt;I hand coded an AVX version that had a bit better alignment and was able to get it faster, but never was able to approach the SSE speed. I am obviously still learning.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;@All...&lt;/DIV&gt;&lt;DIV&gt;In my tests, I am seeing that operations that are heavy on streaming data and light on actually doing math on the data the effort required to jump from SSE to AVX does not bring reasonable returns in value to the table (currently). (i.e. the 2 loads, but 1 store per cycle -- not to mention the burden of devising new alignment tricks )&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Are we to expect the Intel pro compilers and MKL/IPP libraries to change quickly in the near future to address better alignment algorithms so that, at the least the auto generated AVX code doesn't drop below the SSE performance, or the compiler is sharp enough (with out #pragma awesomeness enabled) to use SSE where it is as fast or faster?&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 19 May 2011 16:35:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820859#M1135</guid>
      <dc:creator>Eric_Nuckols</dc:creator>
      <dc:date>2011-05-19T16:35:04Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820860#M1136</link>
      <description>I've already hinted that I expect the use of "#pragma vector aligned" to be necessary to take advantage of 32-byte alignment, along with measures to avoid spending more time in remainder loops when loop count isn't a multiple of 16. I've heard of efforts to improve performance of the remainder loops, but no assurance that it will appear in the "near future."&lt;BR /&gt;As you've seen, the compiler drops back to 128-bit memory access when it doesn't know the alignment.&lt;BR /&gt;I haven't checked the Sandy Bridge for an effect which is prominent on Westmere, where alignment at odd multiples of 8 bytes is handled much better by gnu compilers (by splitting memory access into 64-bit moves, similar to the way your code is split explicitly by the compler into 128-bit moves). For double precision, this can produce an effect where icc -msse2 is faster than icc -xhost. From all I've heard about architecture presentations, the compiler team has been directed not to look for or optimize for this situation.</description>
      <pubDate>Thu, 19 May 2011 17:26:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820860#M1136</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2011-05-19T17:26:59Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820861#M1137</link>
      <description>I have noticed the #pragma vector aligned statement doesn't produce much faster code than without the statement. &lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;The only significant difference that I can see is without the pragma, the compiler uses the aliased xmm* regs and with the pragma, it uses the full ymm* moves.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;It never generates separate loops for different alignment cases.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I guess I am just confused about why the performance wouldn't closly match that of SSE when all buffers are aligned and when the SSE is directly translated from movaps to vmovaps, and a vzeroupper is added before the ret.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;It doesn't seem to be related to remainder loops, because I have setup the length of the vector to be multiples of 32 bytes, so I'm getting all of my work done in the primary loop. Additionally there is less upfront logic in this version than the SSE because there is only 1 big loop followed by the remainder loop.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;There are about 42 bytes worth of instructions in the loop. It's unrolled twice (16 bytes worth of data in, 8 bytes out).&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Is there a glaring error in my approach? I know that I am througput limited on the stores, but that limitation should be the same regardless of SSE or AVX. The vzeroupper call is supposed to have no time penalty.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;Thanks for the comments and help thus far and for any further responses. If I'm doing everything right, I will stop beating on this and just fall back to the SSE for the time being.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt; &lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 20 May 2011 22:55:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820861#M1137</guid>
      <dc:creator>Eric_Nuckols</dc:creator>
      <dc:date>2011-05-20T22:55:24Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820862#M1138</link>
      <description>&lt;BLOCKQUOTE&gt;&lt;P&gt;.L_partially_aligned &lt;/P&gt;&lt;DIV id="_mcePaste"&gt;...&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vmovups vec1[0:3], xmm0&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vinsertf128 $1, vec1[4:7], ymm0, ymm1 /* this block unrolled twice*/&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vaddps vec2[0:7], ymm1, ymm2&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;vmovups ymm2, vec1[0:7]&lt;/DIV&gt;&lt;DIV id="_mcePaste"&gt;....&lt;/DIV&gt;&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;if "vec1" isn't 32B aligned (and it looks like it's the case since you do 2 128-bit loads) it should be significantly faster to also split the final store in 2 x 128-bit store</description>
      <pubDate>Fri, 20 May 2011 23:16:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820862#M1138</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2011-05-20T23:16:37Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820863#M1139</link>
      <description>yeah, the vinsertf128 was auto compiler generated since the compiler doesn't do those 32byte alignment optimizations without the #pragma vector aligned.&lt;DIV&gt;&lt;SPAN style="font-family: Verdana, Arial, Helvetica, sans-serif;"&gt;&lt;BR /&gt;&lt;/SPAN&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;my arrays are allocated on 32 byte boundaries and I've made the lengths multiples of 32bytes to avoid remainder loops.&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;I have also changed the code to:&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;.L_aligned:&lt;/DIV&gt;&lt;DIV&gt; vmovaps (vec1), ymm0&lt;/DIV&gt;&lt;DIV&gt; vaddps (vec2), ymm0, ymm1&lt;/DIV&gt;&lt;DIV&gt; vmovaps ymm1, vec1&lt;/DIV&gt;&lt;DIV&gt; vmovaps 32(vec1), ymm0&lt;/DIV&gt;&lt;DIV&gt; vaddps 32(vec2), ymm0, ymm1&lt;/DIV&gt;&lt;DIV&gt; vmovaps ymm1, 32(vec1)&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;and seen only a slight improvement...&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;on this particular box, my cycle count for SSE is ~ 2800, and my fastest AVX loop yet is ~3500&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Fri, 20 May 2011 23:30:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820863#M1139</guid>
      <dc:creator>Eric_Nuckols</dc:creator>
      <dc:date>2011-05-20T23:30:26Z</dc:date>
    </item>
    <item>
      <title>AVX sometimes slower than SSE</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820864#M1140</link>
      <description>&lt;DIV&gt;from my experience the best speedup from SSE to AVX-256 for such code is at best 1.5x with 100% L1D hit and something like 1.25x with a datasetfitting in L2, not sure about LLCand obviouslyno speedup at allif you're RAM bandwidth bound&lt;BR /&gt;&lt;BR /&gt;you'll be maybe able to improve itslightlyby grouping the adjacent moves like that? :&lt;BR /&gt;&lt;BR /&gt;vmovaps (vec1), ymm0&lt;BR /&gt;vmovaps 32(vec1), ymm2&lt;BR /&gt;vaddps (vec2), ymm0, ymm1&lt;BR /&gt;vaddps 32(vec2), ymm2, ymm3&lt;BR /&gt;vmovaps ymm1, vec1&lt;BR /&gt;vmovaps ymm3, 32(vec1)&lt;BR /&gt;&lt;BR /&gt;&lt;/DIV&gt;</description>
      <pubDate>Sat, 21 May 2011 00:20:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-sometimes-slower-than-SSE/m-p/820864#M1140</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2011-05-21T00:20:11Z</dc:date>
    </item>
  </channel>
</rss>

