<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Sergey, in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946744#M4055</link>
    <description>&lt;P&gt;Sergey,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp; I recompiled my software on a machine that had gcc 4.8.2 and even updated my compiler flags to reflect the following:&lt;/P&gt;

&lt;P&gt;-O3 -march=core-avx-i -mtune=core-avx-i&lt;/P&gt;

&lt;P&gt;I am however getting on the average the exact same timing numbers as before...which to me is very odd. I can't help but to think I am missing something trivial...&lt;/P&gt;

&lt;P&gt;Thanks again for your help in this matter.&lt;/P&gt;</description>
    <pubDate>Mon, 09 Dec 2013 21:57:42 GMT</pubDate>
    <dc:creator>James_S_7</dc:creator>
    <dc:date>2013-12-09T21:57:42Z</dc:date>
    <item>
      <title>AVX Optimizations and Performance: VisualStudio vs GCC</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946717#M4028</link>
      <description>&lt;P&gt;Greetings,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; I have recently written some code using AVX function calls to perform a convolution in my software. I have compiled and run this code on two platforms with the following compilation settings of note:&lt;/P&gt;
&lt;P&gt;1. &lt;STRONG&gt;Windows 7 w/ Visual Studio 2010 on a i7-2760QM&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; Optimization: Maximize Speed (/O2)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; Inline Function Expansion: Only __inline(/Ob1)&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; Enable Intrinsic Functions: No&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; Favor Size or Speed: Favor fast code (/Ot)&lt;/P&gt;
&lt;P&gt;2. &lt;STRONG&gt;Fedora Linux 15 w/ gcc 4.6 on a i7-3612QE&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; Flags: -O3 -mavx -m64 -march=corei7-avx -mtune=corei7-avx&lt;/P&gt;
&lt;P&gt;For my testing I ran the C implementation and the AVX implementation on both platforms and got the following timing results:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In Visual Studio:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;C Implementation: 30ms&lt;/P&gt;
&lt;P&gt;AVX Implementation: 5ms&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;In GCC:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;C Implementation: 9ms&lt;/P&gt;
&lt;P&gt;AVX Implementation: 57ms&lt;/P&gt;
&lt;P&gt;As you can tell my AVX numbers on Linux are very large by comparison. My concern and reason for this post is that I may not have a proper understanding of using AVX and the settings to properly them in both scenarios. For example, take my Visual Studio run. If I change the flag Enable Intrinsics to &lt;STRONG&gt;Yes&lt;/STRONG&gt;, my AVX numbers go from 5ms to 59ms. Does that mean disabling the compiler to optimize with intrinsics and manually setting them in Visual Studio give that much better results? Last I checked there is nothing similar in gcc. Could Microsoft be that more capable of a better compile than gcc in this case? Any ideas why my AVX numbers on gcc are just that much larger? Any help is most appreciated. Cheers.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Oct 2013 01:46:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946717#M4028</guid>
      <dc:creator>James_S_7</dc:creator>
      <dc:date>2013-10-02T01:46:37Z</dc:date>
    </item>
    <item>
      <title>Sorry but I am confused.Did</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946718#M4029</link>
      <description>&lt;P&gt;Sorry but I am confused.Did you use inline AVX assembly in your code or SImd AVX intrinsics?&lt;/P&gt;</description>
      <pubDate>Wed, 02 Oct 2013 13:10:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946718#M4029</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-02T13:10:43Z</dc:date>
    </item>
    <item>
      <title>My apologies for not being</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946719#M4030</link>
      <description>&lt;P&gt;My apologies for not being more specific. I used SIMD AVX intrinsics...more specifically the functions: _mm256_loadu_ps, _mm256_mul_ps, _mm256_add_ps, and _mm256_storeu_ps.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 02 Oct 2013 13:14:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946719#M4030</guid>
      <dc:creator>James_S_7</dc:creator>
      <dc:date>2013-10-02T13:14:18Z</dc:date>
    </item>
    <item>
      <title>First question I see that you</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946720#M4031</link>
      <description>&lt;P&gt;First question I see that you are comparing compiled code on two different processor generations.How do you measure your code performance?&lt;/P&gt;</description>
      <pubDate>Wed, 02 Oct 2013 17:41:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946720#M4031</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-02T17:41:32Z</dc:date>
    </item>
    <item>
      <title>I am measuring performance</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946721#M4032</link>
      <description>&lt;P&gt;I am measuring performance with timing of the operation (operation is performing the convolution on the data). So, I am using native libraries to grab a timestamp and determine the length in milliseconds. Yes, they are different generations, but I would presume the newer generation would give better numbers on AVX than the older. This is why I am thinking this something wrong with the gcc version or how I have set the optimization flags with it.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Oct 2013 18:01:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946721#M4032</guid>
      <dc:creator>James_S_7</dc:creator>
      <dc:date>2013-10-02T18:01:49Z</dc:date>
    </item>
    <item>
      <title>Have you looked at</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946722#M4033</link>
      <description>&lt;P&gt;Have you looked at disassembled code as it was generated by those two compilers?Some of the intrinsics are not directly translated to single machine code instruction , but I presume that you are doing convolution on digital data so the intrinsics used mainly should be load ,store ,add and mul.Moreover there are an additional factors like memory and cache performance and overall load of the system at the time of measurement.&lt;/P&gt;</description>
      <pubDate>Wed, 02 Oct 2013 19:25:38 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946722#M4033</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-02T19:25:38Z</dc:date>
    </item>
    <item>
      <title>There are also additional</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946723#M4034</link>
      <description>&lt;P&gt;There are also additional factors like uncertainties related to thread being swapped in the middle of your code being measured.So basicly when the thread's execution is resumed the wait time can be also included.&lt;/P&gt;</description>
      <pubDate>Thu, 03 Oct 2013 19:43:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946723#M4034</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-03T19:43:08Z</dc:date>
    </item>
    <item>
      <title>iliypolak,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946724#M4035</link>
      <description>&lt;P&gt;iliypolak,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; Thank you very much for your responses. I retrieved the assembly code for gcc and Visual Studio for both the AVX and C implementations of what I am doing. The Visual Studio comparison was fairly clear- The AVX implementation showed the following assembly for where my AVX calls were made:&lt;/P&gt;
&lt;P&gt;; Line 190&lt;BR /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;vmovups&amp;nbsp;&amp;nbsp; &amp;nbsp;ymm3, YMMWORD PTR [eax]&lt;BR /&gt;; Line 192&lt;BR /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;vmulps&amp;nbsp;&amp;nbsp; &amp;nbsp;ymm3, ymm3, YMMWORD PTR [ecx]&lt;BR /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;add&amp;nbsp;&amp;nbsp; &amp;nbsp;eax, edi&lt;BR /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;add&amp;nbsp;&amp;nbsp; &amp;nbsp;ecx, 32&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;; 00000020H&lt;BR /&gt;; Line 194&lt;BR /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;vaddps&amp;nbsp;&amp;nbsp; &amp;nbsp;ymm0, ymm3, ymm0&lt;/P&gt;
&lt;P&gt;The C implementation was much larger by comparison (I will not post) and contained a plethora of moves, adds, and multiplies. Thus, it was clear to see that the Visual Studio compiler utilized the AVX intrinsics and reduced by code size considerably. The gcc assembly, however, was not as clear. The AVX version contains what I believe to be the AVX assembly, but it differs from what Visual Studio produced:&lt;/P&gt;
&lt;P&gt;vmulps&amp;nbsp;&amp;nbsp; &amp;nbsp;%ymm1, %ymm6, %ymm1&lt;/P&gt;
&lt;P&gt;vmulps&amp;nbsp;&amp;nbsp; &amp;nbsp;%ymm1, %ymm5, %ymm1&lt;/P&gt;
&lt;P&gt;etc... As this occurs 5 times over. I do notice that in visual studio the vmulps call referenced a pointer location with "YMMWORD PTR [ecx]" whereas gcc uses direct variables. The C implementation of gcc did not contain any of the AVX assembly, however, it was shorter in length than the AVX version in overall size.&lt;/P&gt;
&lt;P&gt;In regards to your second question, the code running on linux with gcc has its affinity set to avoid context switching if that is what you were referring to. Thanks again for all of your help.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Oct 2013 12:09:46 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946724#M4035</guid>
      <dc:creator>James_S_7</dc:creator>
      <dc:date>2013-10-04T12:09:46Z</dc:date>
    </item>
    <item>
      <title>VS implementation as seen in</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946725#M4036</link>
      <description>&lt;P&gt;VS implementation as seen in that assembly code snippet loads ( or dereferences a pointer to the array) [line:190] probably an input to your convolution function next at [line:192] there is a multiplication by convolution coefficients which is a part of the loop not seen in that code snippet and two lines below there is pointer arithmetics.At [line:194] there is a summation by not shown in code snippet load onto ymm0 register.GCC implementation probably preloads ymm registers and do multiplication on registers directly.&lt;/P&gt;</description>
      <pubDate>Fri, 04 Oct 2013 13:43:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946725#M4036</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-04T13:43:25Z</dc:date>
    </item>
    <item>
      <title>Do you think that this ("GCC</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946726#M4037</link>
      <description>&lt;P&gt;Do you think that this ("GCC implementation probably preloads ymm registers and do multiplication on registers directly.") is the reason gcc is performing so much slower than its Visual Studio counterpart?&lt;/P&gt;</description>
      <pubDate>Mon, 07 Oct 2013 13:53:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946726#M4037</guid>
      <dc:creator>James_S_7</dc:creator>
      <dc:date>2013-10-07T13:53:40Z</dc:date>
    </item>
    <item>
      <title>Hi James</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946727#M4038</link>
      <description>&lt;P&gt;Hi James&lt;/P&gt;
&lt;P&gt;I cannot answer it because you did not upload a full disassembly of GCC generated code.But I suppose that ymm register(s) must have been loaded with either with convolution function input or convolution function coefficients.On Haswell &amp;nbsp;two loads can performed in parallel.In VS code you have load of one data stream and mul of that stream with another stream loaded from the memory or cache &amp;nbsp;I think that two operations can be performed in parallel by using physical registers of register file.The last operation is dependent on the previous two operations.&lt;/P&gt;</description>
      <pubDate>Mon, 07 Oct 2013 14:57:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946727#M4038</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-10-07T14:57:05Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;In Visual Studio:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946728#M4039</link>
      <description>&amp;gt;&amp;gt;In Visual Studio:
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;C Implementation: 30ms
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;AVX Implementation: 5ms
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;In GCC:
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;C Implementation: 9ms
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;AVX Implementation: 57ms

In essence, your results are very different from my results based on performance evaluation of some linear algebra algorithms.

I would rate three the most widelly used C++ compilers as follows:

1. Intel C++ compiler ( versions 12.x and 13.x )
2. GCC-like MinGW ( version 4.8.1 )
3. Microsoft C++ compiler ( VS 2010 )

Take into account, that &lt;STRONG&gt;core&lt;/STRONG&gt; parts of these linear algebra algorithms &lt;STRONG&gt;individually&lt;/STRONG&gt; optimized for every C++ compiler in order to get as better as possible performance because every compiler uses different techniques to optimize codes, to do vectorization, etc. Another thing is compiler options and I've also tuned that as better as possible.</description>
      <pubDate>Fri, 22 Nov 2013 07:20:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946728#M4039</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-22T07:20:29Z</dc:date>
    </item>
    <item>
      <title>Did you rate compilers</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946729#M4040</link>
      <description>&lt;P&gt;Did you rate compilers according to&amp;nbsp;the &amp;nbsp;code optimization techniques?&lt;/P&gt;</description>
      <pubDate>Fri, 22 Nov 2013 14:56:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946729#M4040</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-22T14:56:05Z</dc:date>
    </item>
    <item>
      <title>The fastest execution is</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946730#M4041</link>
      <description>The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.</description>
      <pubDate>Sat, 23 Nov 2013 00:00:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946730#M4041</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-23T00:00:52Z</dc:date>
    </item>
    <item>
      <title>When comparing performance of</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946731#M4042</link>
      <description>&lt;P&gt;When comparing performance of AVX intrinsics against compiler's choice of AVX instructions, you must observe the recommendation that _mm256_loadu_ps must be used only on aligned data for Sandy Bridge.&amp;nbsp; Even on the newer generations, splitting unaligned loads, as the AVX compilation options do, will frequently run faster.&amp;nbsp; _mm256_storeu_ps requires aligned data for satisfactory performance on both Sandy and Ivy Bridge CPUs, so compilers will use peeling for alignment or split them to AVX-128 when permitted to do so.&lt;/P&gt;

&lt;P&gt;The CPU architects were aware of the tendency of VS2010 coders to use _mm256_loadu_ps and so put in a fix in Ivy Bridge to alleviate the penalty for unaligned data.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;VS2012 introduced a limited degree of auto-vectorization as an alternative to vectorization by intrinsics.&amp;nbsp; gcc 4.6 as well is a bit too old for use in evaluating AVX auto-vectorization.&lt;/P&gt;

&lt;P&gt;We never found out why so much emphasis was placed on reduced numbers of instructions with AVX when it was well known that this would produce little performance gain in many situations.&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 23 Nov 2013 12:42:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946731#M4042</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-11-23T12:42:10Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946732#M4043</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The fastest execution is better then slower. For example, older Intel C++ v12.x outperforms the most latest MinGW v4.8.1 by ~10-15%.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;Was the performance of Intel C++ version 12.x better than &amp;nbsp;MS VC++ compiler?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I bet that Intel compiler writers expertise could outperform competing compilers mainly in the area of code optimization as a function of specific microarchitecture and code parallelization and vectorization.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Sat, 23 Nov 2013 16:41:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946732#M4043</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-23T16:41:25Z</dc:date>
    </item>
    <item>
      <title>You could make up a benchmark</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946733#M4044</link>
      <description>&lt;P&gt;You could make up a benchmark entirely within the range of situations where MSVC++ (VS2012 or 2013) auto-vectorizes, and find that compiler performing fully as well as the others.&lt;/P&gt;

&lt;P&gt;You could set ground rules, as many people do, where you enable aggressive optimizations on one compiler and not another.&lt;/P&gt;

&lt;P&gt;Any percentage performance rankings are highly dependent on benchmark content.&lt;/P&gt;

&lt;P&gt;You might perhaps set up a table of which compilers perform selected categories of optimizations, according to compilation flags.&lt;/P&gt;</description>
      <pubDate>Sat, 23 Nov 2013 17:53:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946733#M4044</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-11-23T17:53:18Z</dc:date>
    </item>
    <item>
      <title>When I will receive my</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946734#M4045</link>
      <description>&lt;P&gt;When I will receive my Parallel Studio licence file I plan to test Intel, MSVC++ and MinGW compilers.&lt;/P&gt;

&lt;P&gt;Thanks for interesting advise on how to perform such test.&lt;/P&gt;</description>
      <pubDate>Sun, 24 Nov 2013 07:53:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946734#M4045</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-11-24T07:53:37Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Was the performance of</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946735#M4046</link>
      <description>&amp;gt;&amp;gt;...Was the performance of Intel C++ version 12.x better than  MS VC++ compiler?

Yes.</description>
      <pubDate>Mon, 25 Nov 2013 14:26:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946735#M4046</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-11-25T14:26:32Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946736#M4047</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;...Was the performance of Intel C++ version 12.x better than MS VC++ compiler?&lt;/P&gt;

&lt;P&gt;Yes.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;MSVC++ sometimes optimizes loop carried data dependency recursions and switch better than ICL.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;On the other side, in auto-vectorization (first implemented in VS2012, where ICL had it for well over a decade), the following optimizations seem to be missing in MSVC++:&lt;/P&gt;

&lt;P&gt;taking advantage of __RESTRICT to enable vectorization&lt;/P&gt;

&lt;P&gt;simd optimization of sum and inner_product reductions&lt;/P&gt;

&lt;P&gt;simd optimization based on assertions to overcome "protects exception"&lt;/P&gt;

&lt;P&gt;simd optimization of OpenMP for loops (some of these not introduced in ICL or gcc until this year)&lt;/P&gt;

&lt;P&gt;simd optimization of non-unitary strides&lt;/P&gt;

&lt;P&gt;vectorizable math functions&lt;/P&gt;

&lt;P&gt;simd optimization of STL transform()&lt;/P&gt;

&lt;P&gt;optimizations depending on non-overlapping array sections (for which ICL requires assertions, but gcc optimizes without assertion)&lt;/P&gt;

&lt;P&gt;simd optimizations depending on in-lining&lt;/P&gt;

&lt;P&gt;optimization based on "node splitting"&lt;/P&gt;

&lt;P&gt;optimization of std::max and min (g++ doesn't optimize these, although it seemingly could use gfortran machinery to do so)&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; g++ can optimize fmax/fmin when -ffinite-math-only is set (so why not std:max/min?)&lt;/P&gt;

&lt;P&gt;optimization based on data alignment assertion&lt;/P&gt;

&lt;P&gt;Of course, most of these optimizations are more relevant to floating point and parallelizable applications than to those for which MSVC++ is more directly targeted.&amp;nbsp; Even in the floating point applications, MSVC++ is likely to optimize at least 50% of vectorizable loops.&lt;/P&gt;</description>
      <pubDate>Mon, 25 Nov 2013 15:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-Optimizations-and-Performance-VisualStudio-vs-GCC/m-p/946736#M4047</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-11-25T15:05:00Z</dc:date>
    </item>
  </channel>
</rss>

