<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;...I was expecting for the in Intel® Fortran Compiler</title>
    <link>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957620#M93897</link>
    <description>&amp;gt;&amp;gt;...I was expecting for the single precision an 8x peak performance improvement with AVX and a 4x improvement with SSE4.2
&amp;gt;&amp;gt;by vectorization. But the results I got didn't match my expection. I observed 4x using SSE comparing to novec version but
&amp;gt;&amp;gt;only &amp;lt;5x speedup for AVX. Did I miss anything?

Your expectations could be valid, especially with ...&lt;STRONG&gt;8x peak performance improvement with AVX&lt;/STRONG&gt;..., but only in case when some calculations are done for 8 single-precision values and there are No cache lines related overheads. As soon as size of a data set grows then performance decreases because of load and store operations.

In reality... We recently completed a set of tests &lt;STRONG&gt;SSE vs. AVX&lt;/STRONG&gt; on &lt;STRONG&gt;Sandy Bridge vs. Ivy Bridge&lt;/STRONG&gt; and a range of performance improvement was between &lt;STRONG&gt;~3x&lt;/STRONG&gt; and  &lt;STRONG&gt;~6x&lt;/STRONG&gt; ( for &lt;STRONG&gt;sqrt&lt;/STRONG&gt; operation ) and the codes ( C/C++ ) were agressively optimized by Intel C++ compiler 13.0.0.089 ( Initial Release ).</description>
    <pubDate>Fri, 01 Mar 2013 03:11:00 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2013-03-01T03:11:00Z</dc:date>
    <item>
      <title>AVX vs. SSE4.2 performance on Sandybridge</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957619#M93896</link>
      <description>&lt;P&gt;I took the sample vectorization&amp;nbsp;code matrix_vector_multiplication_f and&amp;nbsp;modifed it a little to use allocatable memory.&amp;nbsp; Then I&amp;nbsp;compiled the code using two options:&amp;nbsp;1) /QxSSE4.2 and /QaxAVX; 2) /QxSSE4.2 and ran both on E5-2690.&amp;nbsp; I was expecting for the single precision an 8x peak performance improvement with AVX and a 4x improvement with SSE4.2 by vectorization.&amp;nbsp; But the results I got didn't match my expection.&amp;nbsp;&amp;nbsp; I observed 4x&amp;nbsp;using SSE comparing to novec version but only&amp;nbsp;&amp;lt;5x speedup for AVX.&amp;nbsp; Did I miss anything?&lt;/P&gt;
&lt;P&gt;I used Fortran compiler XE 13.0.1.119 and Visual Studio 2008 Version 9.0.30729.1 SP.&amp;nbsp; The OS is Windows Server 2008 R2 Standard SP1.&amp;nbsp; 32-byte alignment and ipo&amp;nbsp;are applied.&amp;nbsp; The baseline is compiled&amp;nbsp;with -O1 and vectorized&amp;nbsp;versions&amp;nbsp;are compiled&amp;nbsp;using -O3.&amp;nbsp; I also varied the number of columns of the array and noticed performance drops as the total data size reaches 32KB and 256KB.&amp;nbsp; I guess that is due to L1 and L2 cache miss.&amp;nbsp; Is it correct?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Feb 2013 23:17:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957619#M93896</guid>
      <dc:creator>xman_hawkeye</dc:creator>
      <dc:date>2013-02-28T23:17:10Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...I was expecting for the</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957620#M93897</link>
      <description>&amp;gt;&amp;gt;...I was expecting for the single precision an 8x peak performance improvement with AVX and a 4x improvement with SSE4.2
&amp;gt;&amp;gt;by vectorization. But the results I got didn't match my expection. I observed 4x using SSE comparing to novec version but
&amp;gt;&amp;gt;only &amp;lt;5x speedup for AVX. Did I miss anything?

Your expectations could be valid, especially with ...&lt;STRONG&gt;8x peak performance improvement with AVX&lt;/STRONG&gt;..., but only in case when some calculations are done for 8 single-precision values and there are No cache lines related overheads. As soon as size of a data set grows then performance decreases because of load and store operations.

In reality... We recently completed a set of tests &lt;STRONG&gt;SSE vs. AVX&lt;/STRONG&gt; on &lt;STRONG&gt;Sandy Bridge vs. Ivy Bridge&lt;/STRONG&gt; and a range of performance improvement was between &lt;STRONG&gt;~3x&lt;/STRONG&gt; and  &lt;STRONG&gt;~6x&lt;/STRONG&gt; ( for &lt;STRONG&gt;sqrt&lt;/STRONG&gt; operation ) and the codes ( C/C++ ) were agressively optimized by Intel C++ compiler 13.0.0.089 ( Initial Release ).</description>
      <pubDate>Fri, 01 Mar 2013 03:11:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957620#M93897</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-03-01T03:11:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...I also varied the number</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957621#M93898</link>
      <description>&amp;gt;&amp;gt;...I also varied the number of columns of the array and noticed performance drops as the total data size reaches 32KB and
&amp;gt;&amp;gt;256KB. I guess that is due to L1 and L2 cache miss.  Is it correct?

Possibly Yes and please verify sizes of L1 and L2 cache lines for your CPU in a Datasheet ( Pdf-document / always on the right part of a web-page ) on Ark.intel.com.</description>
      <pubDate>Fri, 01 Mar 2013 14:02:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957621#M93898</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-03-01T14:02:32Z</dc:date>
    </item>
    <item>
      <title>Matrix multiplication is an</title>
      <link>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957622#M93899</link>
      <description>&lt;P&gt;Matrix multiplication is an ideal application for demonstration of AVX performance.&amp;nbsp; It depends strongly on tiling for L1 locality, thus the renewed emphasis on performance libraries such as MKL.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You may notice with -O3 compilation that the Intel Fortran could perform automatic unroll-and-jam transformation so as to reduce the number of data reads and writes, but will not do so as aggressively as the MKL library code.&amp;nbsp; In my experience, the MKL should begin to show an advantage as early as the case of minimum dimension 32.&lt;/P&gt;</description>
      <pubDate>Fri, 01 Mar 2013 14:21:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Fortran-Compiler/AVX-vs-SSE4-2-performance-on-Sandybridge/m-p/957622#M93899</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-03-01T14:21:09Z</dc:date>
    </item>
  </channel>
</rss>

