<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic I don't see where you in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029613#M5140</link>
    <description>&lt;P&gt;I don't see where you allocate memory for "total". Is this omitted in your code snipped?&lt;/P&gt;

&lt;P&gt;Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;_mm256_store_ps(total,acc); &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Alternatively, you can use a&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;__m256 total_m256&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;and then store your intermediate result to this variable:&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;_mm256_store_ps(&amp;amp;total_256,acc); &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Kind regards&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Thomas&lt;/STRONG&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 02 Jun 2014 08:20:08 GMT</pubDate>
    <dc:creator>Thomas_W_Intel</dc:creator>
    <dc:date>2014-06-02T08:20:08Z</dc:date>
    <item>
      <title>AVX _mm256_store_ps</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029611#M5138</link>
      <description>&lt;P&gt;Hi&lt;/P&gt;

&lt;P&gt;I am wanting to run the following code using the AVX instruction set,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;I compile without any problem but generates an error when I run:&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;./vec_avx.x&amp;nbsp;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;"Segmentation fault (core dumped)"&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;Reviewing the code the problem is in the instruction:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&lt;STRONG&gt; _mm256_store_ps(&amp;amp;total,acc); //Error&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Could someone point me to to be.&lt;/P&gt;

&lt;P&gt;Thank you&lt;/P&gt;

&lt;P&gt;pd:&lt;/P&gt;

&lt;P&gt;I compile with the following command:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;gcc -O3 vec_avx.c -mavx -o vec_avx.x&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;And the main code is as follows:&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp; int i;&lt;BR /&gt;
	&amp;nbsp; float *total;&lt;BR /&gt;
	&amp;nbsp; __m256 v1, v2, v3, acc;&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_setzero_ps(); &amp;nbsp;// acc = |0|0|0|0|0|0|0|0|&lt;BR /&gt;
	&amp;nbsp; for (i=0; i&amp;lt;(ARRAY_SZ-8); i+=8){&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v1 &amp;nbsp;= _mm256_loadu_ps(a+i);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v2 &amp;nbsp;= _mm256_loadu_ps(b+i);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v3 &amp;nbsp;= _mm256_mul_ps(v1, v2);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; acc = _mm256_add_ps(acc, v3);&lt;BR /&gt;
	&amp;nbsp; }&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_hadd_ps(acc,acc);&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_hadd_ps(acc,acc);&lt;BR /&gt;
	&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; _mm256_store_ps(&amp;amp;total,acc); /////////////ERROR///////////////////////&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; for (; i&amp;lt;ARRAY_SZ; i++)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; total += a&lt;I&gt; * b&lt;I&gt;;&lt;BR /&gt;
	&amp;nbsp; return total;&lt;BR /&gt;
	}&lt;/I&gt;&lt;/I&gt;&lt;/STRONG&gt;&lt;I&gt;&lt;I&gt;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 01 Jun 2014 19:21:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029611#M5138</guid>
      <dc:creator>Anonymous18</dc:creator>
      <dc:date>2014-06-01T19:21:06Z</dc:date>
    </item>
    <item>
      <title>Probably float *total is not</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029612#M5139</link>
      <description>&lt;P&gt;Probably float *total is not aligned on 32-byte boundary. Did you try zero filling total array?&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 05:18:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029612#M5139</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-06-02T05:18:25Z</dc:date>
    </item>
    <item>
      <title>I don't see where you</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029613#M5140</link>
      <description>&lt;P&gt;I don't see where you allocate memory for "total". Is this omitted in your code snipped?&lt;/P&gt;

&lt;P&gt;Apart from this, "total" is a pointer to float. You therefore can use it directly and shoudn't take its address:&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&amp;nbsp;_mm256_store_ps(total,acc); &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Alternatively, you can use a&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;__m256 total_m256&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;and then store your intermediate result to this variable:&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;_mm256_store_ps(&amp;amp;total_256,acc); &lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Kind regards&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;Thomas&lt;/STRONG&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 08:20:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029613#M5140</guid>
      <dc:creator>Thomas_W_Intel</dc:creator>
      <dc:date>2014-06-02T08:20:08Z</dc:date>
    </item>
    <item>
      <title>If you would like to see</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029614#M5141</link>
      <description>&lt;P&gt;If you would like to see disassembly float *total pointer will be probably declared, but not initialized. IIRC initialization could be done by loading &amp;amp;total[0] with the help of LEA REG,ADDR and filing it with 0.0 for example.&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 10:30:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029614#M5141</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-06-02T10:30:58Z</dc:date>
    </item>
    <item>
      <title>This is probably not</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029615#M5142</link>
      <description>&lt;P&gt;This is probably not alignment issue since _mm256_store_ps is probably translated to VMOVUPS which work as well for both aligned and unaligned addresses.&lt;/P&gt;

&lt;P&gt;The problem is that your total should be of type "float" (instead of float *, because it is no pointer just a scalar value to hold the intermediary result of your AVX accumulation) and the _mm256_store_ps should be replaced with a store scalar instruction (i don't know if there is one) or something like:&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="kwd"&gt;_mm256_maskstore_ps(&amp;amp;total, _mm256_set&lt;/SPAN&gt;_epi32(0, 0, 0, 0, 0, 0, 0, ~0), &lt;SPAN class="kwd"&gt;acc);&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Which is not very efficient but saves you from access violation (I let you find out more efficient way to write your algorithm)&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 12:22:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029615#M5142</guid>
      <dc:creator>emmanuel_attia</dc:creator>
      <dc:date>2014-06-02T12:22:00Z</dc:date>
    </item>
    <item>
      <title>Quote:emmanuel.attia wrote</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029616#M5143</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;emmanuel.attia wrote:&lt;BR /&gt;should be replaced with a store scalar instruction (i don't know if there is one) &lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="name"&gt;_mm_store_ss&lt;/SPAN&gt; (&lt;SPAN class="param_type"&gt;float*&lt;/SPAN&gt; &lt;SPAN class="param_name"&gt;mem_addr&lt;/SPAN&gt;, &lt;SPAN class="param_type"&gt;__m128&lt;/SPAN&gt; &lt;SPAN class="param_name"&gt;a&lt;/SPAN&gt;) is the one to use IMO&lt;/P&gt;

&lt;P&gt;float total = 0.0f; &lt;SPAN class="name"&gt;_mm_store_ss(&lt;/SPAN&gt;&amp;amp;total,acc);&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 14:19:42 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029616#M5143</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-06-02T14:19:42Z</dc:date>
    </item>
    <item>
      <title>Hi Enmanuel. Thank you for</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029617#M5144</link>
      <description>&lt;P&gt;Hi Enmanuel. Thank you for your help.&lt;/P&gt;

&lt;P&gt;Here is my final algorithm:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;{&lt;BR /&gt;
	&amp;nbsp; int i;&lt;BR /&gt;
	&amp;nbsp; float total;&lt;BR /&gt;
	&amp;nbsp; __m256 v1, v2, v3, acc;&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_setzero_ps(); &amp;nbsp;// acc = |0|0|0|0|0|0|0|0|&lt;BR /&gt;
	&amp;nbsp; for (i=0; i&amp;lt;(ARRAY_SZ-8); i+=8){&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v1 &amp;nbsp;= _mm256_loadu_ps(a+i);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v2 &amp;nbsp;= _mm256_loadu_ps(b+i);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v3 &amp;nbsp;= _mm256_mul_ps(v1, v2);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; acc = _mm256_add_ps(acc, v3);&lt;BR /&gt;
	&amp;nbsp; }&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_hadd_ps(acc,acc);&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_hadd_ps(acc,acc);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN style="font-weight: 700;"&gt;_mm256_maskstore_ps(&amp;amp;total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; for (; i&amp;lt;ARRAY_SZ; i++)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; total += a&lt;I&gt; * b&lt;I&gt;;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; return total;&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;--------------------------------------------------------------------------------------------------------------------------------&lt;/P&gt;

&lt;P&gt;I had 2 extra inquiries:&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;1)&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 1.15; font-size: 1em;"&gt;About the end result I show (&lt;/SPAN&gt;&lt;SPAN style="line-height: 1.15; font-size: 1em; font-weight: 700;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;1000016.000000&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN style="line-height: 1.15; font-size: 1em;"&gt;):&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Array datatype &amp;nbsp;: float&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;# of runs &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;: 1000&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Arrays size &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;: 500000&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Best Rate GB/s &amp;nbsp;: &amp;nbsp;19.93&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Avg &amp;nbsp;Rate GB/s &amp;nbsp;: &amp;nbsp;18.85&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Median Rate GB/s: &amp;nbsp;18.74&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Avg time &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;: &amp;nbsp;&amp;nbsp;0.00&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Min time &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;: &amp;nbsp;&amp;nbsp;0.00&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Max time &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;: &amp;nbsp;&amp;nbsp;0.00&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Product Result &amp;nbsp;: 1000016.000000&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;But the correct result of "&lt;SPAN style="font-weight: 700;"&gt;&lt;SPAN id="docs-internal-guid-f5bcdd55-5cf3-c062-8b87-2e9542205cac"&gt;&lt;SPAN style="font-size: 15px; font-family: Arial; color: rgb(0, 0, 0); vertical-align: baseline; white-space: pre-wrap; background-color: transparent;"&gt;Product Result&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;" should be (&lt;SPAN style="font-weight: 700; line-height: 14.959200859069824px;"&gt;Product Result &amp;nbsp;: 2000000.000000&lt;/SPAN&gt;) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;Kernel name &amp;nbsp; &amp;nbsp; : inner_prod&lt;BR /&gt;
	Array datatype &amp;nbsp;: float&lt;BR /&gt;
	# of runs &amp;nbsp; &amp;nbsp; &amp;nbsp; : 1000&lt;BR /&gt;
	Arrays size &amp;nbsp; &amp;nbsp; : 500000&lt;BR /&gt;
	Best Rate GB/s &amp;nbsp;: &amp;nbsp; 7.03&lt;BR /&gt;
	Avg &amp;nbsp;Rate GB/s &amp;nbsp;: &amp;nbsp; 6.68&lt;BR /&gt;
	Avg time &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;: &amp;nbsp; 0.00&lt;BR /&gt;
	Min time &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;: &amp;nbsp; 0.00&lt;BR /&gt;
	Max time &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;: &amp;nbsp; 0.00&lt;BR /&gt;
	&lt;STRONG&gt;Product Result &amp;nbsp;: 2000000.000000&lt;/STRONG&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;-------------------------------------------------------------------------------------------------------------------------------------------------&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&lt;STRONG&gt;2)&amp;nbsp;&lt;/STRONG&gt;About the most efficient way of:&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN style="font-weight: 700;"&gt;_mm256_maskstore_ps(&amp;amp;total, _mm256_set_epi32(0, 0, 0, 0, 0, 0, 0, ~0), acc);&lt;/SPAN&gt;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;It could use some CAST or CONVERT??&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;-------------------------------------------------------------------------------------------------------------------------------------------------&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;Thank you so much&lt;/P&gt;

&lt;P dir="ltr" style="line-height:1.15;margin-top:0pt;margin-bottom:0pt;"&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 14:31:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029617#M5144</guid>
      <dc:creator>Anonymous18</dc:creator>
      <dc:date>2014-06-02T14:31:56Z</dc:date>
    </item>
    <item>
      <title>Oh yes, right solution would</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029618#M5145</link>
      <description>&lt;P&gt;Oh yes, right solution would be:&lt;/P&gt;

&lt;P&gt;_mm_store_ss(&amp;amp;total, _mm256_extractf128_si256(acc, 0));&lt;/P&gt;

&lt;P&gt;Thanks for the improvement&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 15:21:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029618#M5145</guid>
      <dc:creator>emmanuel_attia</dc:creator>
      <dc:date>2014-06-02T15:21:00Z</dc:date>
    </item>
    <item>
      <title>Hi Emanuel,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029619#M5146</link>
      <description>&lt;P&gt;Hi Emanuel,&lt;/P&gt;

&lt;P&gt;I compile with:&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-weight: 700; font-size: 12px; line-height: 18px;"&gt;gcc -O3 vec_avx.c -mavx -o vec_avx.x&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;but with your instruction now i have the following error:&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;vec_avx.c: In function ‘inner_prod_vec’:&lt;BR /&gt;
	vec_avx.c:104: error: incompatible type for argument 1 of ‘_mm256_extractf128_si256’&lt;BR /&gt;
	/usr/lib/gcc/x86_64-redhat-linux/4.4.7/include/avxintrin.h:484: note: expected ‘__m256i’ but argument is of type ‘__m256’&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;My final code is:&lt;/P&gt;

&lt;P&gt;DATATYPE inner_prod_vec(DATATYPE* a, DATATYPE* b)&lt;BR /&gt;
	{&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; int i;&lt;BR /&gt;
	&amp;nbsp; float total;&lt;BR /&gt;
	&amp;nbsp; __m256 v1, v2, v3, acc;&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_setzero_ps();&lt;BR /&gt;
	&amp;nbsp; for (i=0; i&amp;lt;(ARRAY_SZ-8); i+=8){&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v1 &amp;nbsp;= _mm256_loadu_ps(a+i);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v2 &amp;nbsp;= _mm256_loadu_ps(b+i);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; v3 &amp;nbsp;= _mm256_mul_ps(v1, v2);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; acc = _mm256_add_ps(acc, v3);&lt;BR /&gt;
	&amp;nbsp; }&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_hadd_ps(acc,acc);&lt;BR /&gt;
	&amp;nbsp; acc = _mm256_hadd_ps(acc,acc);&lt;BR /&gt;
	&amp;nbsp; &lt;STRONG&gt;_mm_store_ss(&amp;amp;total, _mm256_extractf128_si256(acc, 0)); //////ERROR/////////////////////////////////////////////////////////////////&lt;/STRONG&gt;&lt;BR /&gt;
	&amp;nbsp; for (; i&amp;lt;ARRAY_SZ; i++)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; total += a&lt;I&gt; * b&lt;I&gt;;&lt;BR /&gt;
	&amp;nbsp; return total;&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;}&lt;/P&gt;

&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 19:46:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029619#M5146</guid>
      <dc:creator>Anonymous18</dc:creator>
      <dc:date>2014-06-02T19:46:31Z</dc:date>
    </item>
    <item>
      <title>You should use _mm256</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029620#M5147</link>
      <description>&lt;P&gt;You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.&lt;/P&gt;

&lt;P&gt;There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;</description>
      <pubDate>Mon, 02 Jun 2014 21:20:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029620#M5147</guid>
      <dc:creator>Thomas_W_Intel</dc:creator>
      <dc:date>2014-06-02T21:20:48Z</dc:date>
    </item>
    <item>
      <title>Quote:lex wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029621#M5148</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;lex wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;But the correct result of "Product Result" should be (Product Result &amp;nbsp;: 2000000.000000) as vectors to initialize the value of "2" (I am attaching the result when I run my algorithm without vectoring)&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;it looks like there is a missing horizontal add in your code&lt;/P&gt;</description>
      <pubDate>Wed, 04 Jun 2014 18:11:51 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029621#M5148</guid>
      <dc:creator>bronxzv</dc:creator>
      <dc:date>2014-06-04T18:11:51Z</dc:date>
    </item>
    <item>
      <title>Quote:Thomas Willhalm (Intel)</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029622#M5149</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Thomas Willhalm (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;You should use _mm256_extractf128_ps instead of _mm256_extractf128_si256. It takes __m256 as first argument instead of __m256i.&lt;/P&gt;

&lt;P&gt;There is also the possibility to use _mm_extract_ps to extract a float instead of storing it with _mm_store_ss..&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best regards&lt;/P&gt;

&lt;P&gt;Thomas&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Yes, sorry about my too quick answer.&lt;/P&gt;

&lt;P&gt;As for extract_ps vs store_ss there is no much difference with a good compiler (like Intel C++), but sometime extract is indeed more handy (doesn't force to put a float variable on the stack when you only need it as a return value for instance).&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jun 2014 16:49:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/AVX-mm256-store-ps/m-p/1029622#M5149</guid>
      <dc:creator>emmanuel_attia</dc:creator>
      <dc:date>2014-06-16T16:49:29Z</dc:date>
    </item>
  </channel>
</rss>

