<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;Arthur, You need to take in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987231#M4756</link>
    <description>&amp;gt;&amp;gt;Arthur, You need to take into account an overhead of calls for all AVX intrinsic functions ( unless these calls are inlined! ):
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;&amp;gt;&amp;gt; gettimeofday(&amp;amp;t0, NULL);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; for (int i=0; i &amp;gt;&amp;gt; {
&amp;gt;&amp;gt;&amp;gt;&amp;gt; ymm0 = _mm256_load_pd(a+i);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; ymm1 = _mm256_load_pd(b+i);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; ymm2 = _mm256_mul_pd(ymm0, ymm1);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; _mm256_stream_pd(c+i, ymm2);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; }
&amp;gt;&amp;gt;&amp;gt;&amp;gt; gettimeofday(&amp;amp;t1, NULL);
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;These calls are affecting performance and that is why the scalar version is faster.

I'm experiencing a similar problem and I see when intrinsic functions are &lt;STRONG&gt;Not&lt;/STRONG&gt; inlined performance is really affected ( slower by ~4 times! ).</description>
    <pubDate>Fri, 12 Apr 2013 01:23:00 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2013-04-12T01:23:00Z</dc:date>
    <item>
      <title>Storing data is bottleneck?</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987219#M4744</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;
&lt;P&gt;I'm writing some example code of AVX like below:&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;double a[SIZE]__attribute__((aligned(32)));&lt;BR /&gt;&amp;nbsp; &amp;nbsp;double b[SIZE]__attribute__((aligned(32)));&lt;BR /&gt;&amp;nbsp; &amp;nbsp;double c[SIZE]__attribute__((aligned(32)));&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp; &amp;nbsp;srand(time(NULL));&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;for(inti=0; i&amp;lt;SIZE; i++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; a&lt;I&gt; = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; b&lt;I&gt; = rand()/(double)RAND_MAX + rand()/(double)RAND_MAX * pow(10,-8);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; __m256d ymm0, ymm1, ymm2;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&amp;nbsp; gettimeofday(&amp;amp;t0,NULL);&lt;BR /&gt;&amp;nbsp; for(inti=0; i&amp;lt;SIZE; i+=4) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ymm0 = _mm256_load_pd(a+i);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ymm1 = _mm256_load_pd(b+i);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ymm2 = _mm256_mul_pd(ymm0, ymm1);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; _mm256_store_pd(c+i, ymm2);&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&lt;BR /&gt;&amp;nbsp; &amp;nbsp; gettimeofday(&amp;amp;t1,NULL);&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; double time1;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; time1 = (t1.tv_sec - t0.tv_sec) + (t1.tv_usec - t0.tv_usec)*1.0E-6;&lt;BR /&gt;&lt;BR /&gt;&amp;nbsp; &amp;nbsp;double sum;&lt;BR /&gt;&amp;nbsp; &amp;nbsp;for(inti=0; i&amp;lt;SIZE; i++) {&lt;BR /&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; sum += c&lt;I&gt;;&lt;BR /&gt;&amp;nbsp; &amp;nbsp; }&lt;/I&gt;&lt;/P&gt;
&lt;P&gt;And the result of the time1 in the code was 6.750000e-04(sec) .&lt;BR /&gt;That is slower result than scalar version which recorded around 5.0e-04(sec)..&lt;/P&gt;
&lt;P&gt;Then, I've found that if I comment-out the storing part (_mm256_store_pd(c+i, ymm2); ), the results get more faster than before( time1 get 1.9300e-04(sec)).&lt;/P&gt;
&lt;P&gt;Acording to these results, I think that storing data from ymm register to memory is bottleneck... but, is that right?&lt;BR /&gt;Is there any good way to store data while preventing an increase in execution time?&amp;nbsp;&lt;/P&gt;
&lt;P&gt;(The actual code was attached.)&lt;BR /&gt;OS: Mac OSX 10,8,2&lt;BR /&gt;CPU: 2GHz Intel Core i7&lt;BR /&gt;Compiler: gcc 4.8&lt;BR /&gt;Compiler-options: -mavx (AVX version only)&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Wed, 09 Jan 2013 10:31:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987219#M4744</guid>
      <dc:creator>Arthur_U_</dc:creator>
      <dc:date>2013-01-09T10:31:45Z</dc:date>
    </item>
    <item>
      <title>As the 256-bit store is split</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987220#M4745</link>
      <description>As the 256-bit store is split by current hardware, it's easily possible that the store takes 50% of the time if you change it to nontemporal, 70% if you leave as is on account of read for ownership.  It would take more analysis to see if it's possible to explain why AVX intrinsics would slow it down.  The compiler might be expected to choose reasonable unrolling for C source, while icc doesn't unroll intrinsics (gcc will do so under -funroll-loops and associated options).
I assume you're not quoting the full compiler options, e.g. you must be using -O2 or -O3 (which implies auto-vectorization).</description>
      <pubDate>Wed, 09 Jan 2013 15:59:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987220#M4745</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-01-09T15:59:00Z</dc:date>
    </item>
    <item>
      <title>Thank you for your reply.</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987221#M4746</link>
      <description>Thank you for your reply.

At first, I tried to apply nontemporal storing by using _mm256_stream_pd() but there seemed to be no change in execution time. 
And then I tried to quote -O2 and -O3. That reduced execution time (around 3.7~5.0e-04(sec)) but scalar version still returns better results (around 3.4~4.2e-04(sec))..

What do you think about these results ?

Is auto-vectorization gcc does wiser than my intrinsic code?</description>
      <pubDate>Fri, 11 Jan 2013 10:45:34 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987221#M4746</guid>
      <dc:creator>Arthur_U_</dc:creator>
      <dc:date>2013-01-11T10:45:34Z</dc:date>
    </item>
    <item>
      <title>gcc 4.7 and newer do a good</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987222#M4747</link>
      <description>gcc 4.7 and newer do a good job with AVX auto-vectorization (invoked by -O3 or -O2 -ftree-vectorize) for simple cases, which this appears to be. gcc will often drop to AVX-128 if it doesn't see alignment, which is often a good decision for current platforms.  You'd want to examine the output code by -S or objdump -S, as well as the vectorizer report e.g. -ftree-vectorizer-verbose=2.  You'd also want to investigate gcc unrolling control, e.g. -funroll-loops --param max-unroll-times=4
Compilers are getting more complex with options to control loop optimization.</description>
      <pubDate>Fri, 11 Jan 2013 15:47:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987222#M4747</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-01-11T15:47:35Z</dc:date>
    </item>
    <item>
      <title>Hello,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987223#M4748</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;I have some ideas.&lt;/P&gt;
&lt;P&gt;First you might want to increase the size or repeat the test several times, to get more meaningful times. In my opinion times are too short for making predictions based on them.&lt;/P&gt;
&lt;P&gt;Secondly, the Intel Intrinsics guide gave me a hint.&lt;/P&gt;
&lt;P&gt;_mm256_mul_pd has a latency of 5 cycles. This is really very much in your loop. So you might want to try loop unrolling by yourself. Then I would do the loads and mul of iteration first. The same for iteration second. And then do the two stores. Or even better unroll 4 loop iterations. I think this should hide the latency and thus improve performance.&lt;/P&gt;</description>
      <pubDate>Sun, 13 Jan 2013 10:07:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987223#M4748</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-01-13T10:07:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;... I tried to quote -O2</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987224#M4749</link>
      <description>&amp;gt;&amp;gt;... I tried to quote -O2 and -O3. That reduced execution time (around 3.7~5.0e-04(sec)) but scalar version still returns
&amp;gt;&amp;gt;better results (around 3.4~4.2e-04(sec))..
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;What do you think about these results ?

Arthur, You need to take into account an &lt;STRONG&gt;overhead of calls&lt;/STRONG&gt; for all AVX intrinsic functions ( unless these calls are inlined! ):

&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;    gettimeofday(&amp;amp;t0, NULL);
&amp;gt;&amp;gt;    for (int i=0; i&lt;SIZE&gt;&amp;gt;   {
&amp;gt;&amp;gt;        ymm0 = &lt;STRONG&gt;_mm256_load_pd&lt;/STRONG&gt;(a+i);
&amp;gt;&amp;gt;        ymm1 = &lt;STRONG&gt;_mm256_load_pd&lt;/STRONG&gt;(b+i);
&amp;gt;&amp;gt;        ymm2 = &lt;STRONG&gt;_mm256_mul_pd&lt;/STRONG&gt;(ymm0, ymm1);
&amp;gt;&amp;gt;        &lt;STRONG&gt;_mm256_stream_pd&lt;/STRONG&gt;(c+i, ymm2);
&amp;gt;&amp;gt;    }
&amp;gt;&amp;gt;    gettimeofday(&amp;amp;t1, NULL);

These calls are affecting performance and that is why the scalar version is faster.&lt;/SIZE&gt;</description>
      <pubDate>Sun, 13 Jan 2013 23:45:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987224#M4749</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-01-13T23:45:00Z</dc:date>
    </item>
    <item>
      <title>I use a very simple "trick" (</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987225#M4750</link>
      <description>I use a very simple "trick" ( &lt;STRONG&gt;already suggested by Christian&lt;/STRONG&gt; ) to improve performance of some &lt;STRONG&gt;for&lt;/STRONG&gt; loops:

&lt;STRONG&gt;Instead of:&lt;/STRONG&gt;
...
double sum = 0.0L;
for( int i=0; i less than SIZE; i++ )
{
        sum += c&lt;I&gt;;
}
...

&lt;STRONG&gt;Use manual or #pragma directive based unrolling ( 4-in-1 or 8-in-1 ):&lt;/STRONG&gt;
...
double sum = 0.0L;
for( int i=0; i less than SIZE; i+4 )
{
        sum += ( c&lt;I&gt; + c[i+1] + c[i+2] + c[i+3] );
}
...

Note: I added initialization to 0.0L of &lt;STRONG&gt;sum&lt;/STRONG&gt; variable.

[ EDITED ] Due to well known problems with arrow-left and arrow-right characters&lt;/I&gt;&lt;/I&gt;</description>
      <pubDate>Sun, 13 Jan 2013 23:54:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987225#M4750</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-01-13T23:54:00Z</dc:date>
    </item>
    <item>
      <title>Here is a summary...</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987226#M4751</link>
      <description>Here is a summary...

&amp;gt;&amp;gt;...Storing data is bottleneck?

No. It is an overhead of 400,000 calls to AVX intrinsic functions.</description>
      <pubDate>Mon, 14 Jan 2013 00:06:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987226#M4751</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-01-14T00:06:03Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987227#M4752</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Here is a summary...&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;...Storing data is bottleneck?&lt;/P&gt;
&lt;P&gt;No. It is an overhead of 400,000 calls to AVX intrinsic functions.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.&lt;/P&gt;</description>
      <pubDate>Mon, 14 Jan 2013 05:35:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987227#M4752</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-01-14T05:35:15Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987228#M4753</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;Here is a summary...&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;...Storing data is bottleneck?&lt;/P&gt;
&lt;P&gt;No. It is an overhead of 400,000 calls to AVX intrinsic functions.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I think this is interesting. Visual Studio 2010 inlines the intrinsics. Generelly one might try using the option -Oi. If I am not mistaken, it tells the compiler to inline intrinsics generally. This should cut down the overhead.&lt;/P&gt;</description>
      <pubDate>Mon, 14 Jan 2013 19:58:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987228#M4753</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-01-14T19:58:30Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Visual Studio 2010</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987229#M4754</link>
      <description>&amp;gt;&amp;gt;...&lt;STRONG&gt;Visual Studio 2010 inlines the intrinsics&lt;/STRONG&gt;. Generelly one might try using the option -Oi. If I am not mistaken,
&amp;gt;&amp;gt;it tells the compiler to inline intrinsics generally. This should cut down the overhead...

Please try to do your own verification in the VS debugger and let us know. Thanks in advance.

Best regards,
Sergey</description>
      <pubDate>Tue, 15 Jan 2013 06:08:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987229#M4754</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-01-15T06:08:14Z</dc:date>
    </item>
    <item>
      <title>I compiled the following code</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987230#M4755</link>
      <description>&lt;P&gt;I compiled the following code in VS 2010&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt; for (int i=0; i &amp;gt;&amp;gt; {&lt;BR /&gt; &amp;gt;&amp;gt; ymm0 = &lt;STRONG&gt;_mm256_load_pd&lt;/STRONG&gt;(a+i);&lt;BR /&gt; &amp;gt;&amp;gt; ymm1 = &lt;STRONG&gt;_mm256_load_pd&lt;/STRONG&gt;(b+i);&lt;BR /&gt; &amp;gt;&amp;gt; ymm2 = &lt;STRONG&gt;_mm256_mul_pd&lt;/STRONG&gt;(ymm0, ymm1);&lt;BR /&gt; &amp;gt;&amp;gt; &lt;STRONG&gt;_mm256_stream_pd&lt;/STRONG&gt;(c+i, ymm2);&lt;BR /&gt; &amp;gt;&amp;gt; }&lt;/P&gt;
&lt;P&gt;In normal release build config, you directly get ssembler instructions. I checked the disassembly (using a breakpoint).&lt;/P&gt;
&lt;P&gt;The thing is that VS 2010 in standard release config has option /Oi on. Removing this option generated the same code (for this loop).&lt;/P&gt;</description>
      <pubDate>Fri, 18 Jan 2013 11:49:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987230#M4755</guid>
      <dc:creator>Christian_M_2</dc:creator>
      <dc:date>2013-01-18T11:49:48Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;Arthur, You need to take</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987231#M4756</link>
      <description>&amp;gt;&amp;gt;Arthur, You need to take into account an overhead of calls for all AVX intrinsic functions ( unless these calls are inlined! ):
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;&amp;gt;&amp;gt;...
&amp;gt;&amp;gt;&amp;gt;&amp;gt; gettimeofday(&amp;amp;t0, NULL);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; for (int i=0; i &amp;gt;&amp;gt; {
&amp;gt;&amp;gt;&amp;gt;&amp;gt; ymm0 = _mm256_load_pd(a+i);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; ymm1 = _mm256_load_pd(b+i);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; ymm2 = _mm256_mul_pd(ymm0, ymm1);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; _mm256_stream_pd(c+i, ymm2);
&amp;gt;&amp;gt;&amp;gt;&amp;gt; }
&amp;gt;&amp;gt;&amp;gt;&amp;gt; gettimeofday(&amp;amp;t1, NULL);
&amp;gt;&amp;gt;
&amp;gt;&amp;gt;These calls are affecting performance and that is why the scalar version is faster.

I'm experiencing a similar problem and I see when intrinsic functions are &lt;STRONG&gt;Not&lt;/STRONG&gt; inlined performance is really affected ( slower by ~4 times! ).</description>
      <pubDate>Fri, 12 Apr 2013 01:23:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987231#M4756</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-04-12T01:23:00Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987232#M4757</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Quote:&lt;/STRONG&gt;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;&lt;EM&gt;Sergey Kostrov&lt;/EM&gt;wrote:
&lt;P&gt;Here is a summary...&lt;/P&gt;
&lt;P&gt;&amp;gt;&amp;gt;...Storing data is bottleneck?&lt;/P&gt;
&lt;P&gt;No. It is an overhead of 400,000 calls to AVX intrinsic functions.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Reciprocal throughput of call instruction is 2 cpi, so muliplying 4 function calls by loop counter value(400000) so the total number of cycles spent on functions call is 3.2e6 cycles.There is a lot of wasted cycles.&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I have forgotten to add the overhead of ret instruction.&lt;/P&gt;</description>
      <pubDate>Fri, 12 Apr 2013 05:45:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987232#M4757</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2013-04-12T05:45:37Z</dc:date>
    </item>
    <item>
      <title>Hi Christian,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987233#M4758</link>
      <description>Hi Christian,

&amp;gt;&amp;gt;...From my experience I can only advice you to use VS2012. VS2010 produces sometimes code that has very poor performance...

The problem is applicable to Intel ( version 13 Update 2 ) and Microsoft C++ compilers with Visual Studio 2008. Also, if I don't use intrinsic functions and use /O2 and /fp:fast, or /O3 and /fp:fast=2 ( for Intel ) compiler options than the code is significantly faster (!).</description>
      <pubDate>Fri, 12 Apr 2013 16:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987233#M4758</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-04-12T16:05:00Z</dc:date>
    </item>
    <item>
      <title>Quote:Sergey Kostrov wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987234#M4759</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I'm experiencing a similar problem and I see when intrinsic functions are &lt;STRONG&gt;Not&lt;/STRONG&gt; inlined performance is really affected ( slower by ~4 times! ).&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;
&lt;P&gt;oughhh! that's scary&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sat, 13 Apr 2013 10:28:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987234#M4759</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2013-04-13T10:28:48Z</dc:date>
    </item>
    <item>
      <title>/O2 /fp:fast /arch:SSE2|AVX</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987235#M4760</link>
      <description>&lt;P&gt;/O2 /fp:fast /arch:SSE2|AVX in VS2012 is roughly equivalent to ICL /O2 /fp:source /Qansi-alias /arch:...&amp;nbsp; These will auto-vectorize some of the simpler situations without requiring intrinsics. I hate to comment again about /fp:fast having different meanings among these compilers.&lt;/P&gt;
&lt;P&gt;ICL will auto-vectorize more situations with /O3 and /fp:fast, or with substitution of CEAN or pragmas.&amp;nbsp; CEAN inherently includes effect of /Qansi-alias and pragmas vector always and ivdep.&lt;/P&gt;
&lt;P&gt;ICL /Qansi-alias /Qcomplex-limited-range /arch:SSE4.1 is roughly equivalent to gcc -O3 -ffast-math -march=corei7.&lt;/P&gt;
&lt;P&gt;I could believe that /Oi- or /Od (or debug build mode) disable in-line expansion of intrinsics in one or more compilers, but I haven't studied this.&amp;nbsp; I'm not clear if that was what was meant in this thread.&lt;/P&gt;</description>
      <pubDate>Mon, 15 Apr 2013 17:07:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987235#M4760</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2013-04-15T17:07:50Z</dc:date>
    </item>
    <item>
      <title>I've just completed</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987236#M4761</link>
      <description>I've just completed additional investigation and here is my report ( it is applicable to &lt;STRONG&gt;Intel&lt;/STRONG&gt; and &lt;STRONG&gt;Microsoft&lt;/STRONG&gt; C++ compilers ):

1. Let's say there is some &lt;STRONG&gt;Algorithm&lt;/STRONG&gt;.

2. Two versions of the &lt;STRONG&gt;Algorithm&lt;/STRONG&gt; are implemented. That is, &lt;STRONG&gt;Without Intrinsics Functions&lt;/STRONG&gt; ( Pure C / Version 1 ) and &lt;STRONG&gt;With Intrinsic Functions&lt;/STRONG&gt; ( Version 2 ).

3. When all optimizations are &lt;STRONG&gt;Disabled&lt;/STRONG&gt;, for example in Debug configuration, the Version 2 could outpertform Version 1.

4. When &lt;STRONG&gt;/O2&lt;/STRONG&gt; or &lt;STRONG&gt;/O3&lt;/STRONG&gt; and &lt;STRONG&gt;/fp:fast&lt;/STRONG&gt; or &lt;STRONG&gt;/fp:fast=2&lt;/STRONG&gt; optimizations are &lt;STRONG&gt;Enabled&lt;/STRONG&gt;, in Release configuration, then Version 1 outperforms Version 2 ( ~3.5 times for &lt;STRONG&gt;Intel&lt;/STRONG&gt; and ~2 times for &lt;STRONG&gt;Microsoft&lt;/STRONG&gt; C++ compilers ).

5. I verified generated *.asm files and I was able to see that both C++ compilers generated &lt;STRONG&gt;very efficient assembler codes&lt;/STRONG&gt; for Version 1 of the &lt;STRONG&gt;Algorithm&lt;/STRONG&gt; and it means, that application of intrinsic functions &lt;STRONG&gt;in some cases&lt;/STRONG&gt; doesn't help to improve performance (!).

6. I see that in my case I &lt;STRONG&gt;wasted time&lt;/STRONG&gt; on implementation and testing of some &lt;STRONG&gt;Algorithm&lt;/STRONG&gt; &lt;STRONG&gt;With Intrinsic Functions&lt;/STRONG&gt;.

7. Ideally, it always makes sense to implement two versions of some algorithm ( when a developer has time ) and compare performance of both versions when the most aggressive optimizations are Enabled.</description>
      <pubDate>Mon, 15 Apr 2013 23:52:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Storing-data-is-bottleneck/m-p/987236#M4761</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2013-04-15T23:52:00Z</dc:date>
    </item>
  </channel>
</rss>

