<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Jim, in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062345#M54638</link>
    <description>&lt;P&gt;Hi Jim,&lt;/P&gt;

&lt;P&gt;Yes, that is what I mean. In each 4*4*4 section, the computation is just like the code gave in previous post. Some element in c[] will be used only once, some will be used twice or three times.&lt;/P&gt;

&lt;P&gt;Thanks!&lt;/P&gt;</description>
    <pubDate>Tue, 18 Apr 2017 14:04:36 GMT</pubDate>
    <dc:creator>zhen_j_</dc:creator>
    <dc:date>2017-04-18T14:04:36Z</dc:date>
    <item>
      <title>Efficiently use vector registers</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062323#M54616</link>
      <description>&lt;P style="margin-bottom: 1em; border: 0px; font-size: 15px; clear: both; color: rgb(36, 39, 41); font-family: Arial, 'Helvetica Neue', Helvetica, sans-serif; line-height: 19.5px;"&gt;I am considering to vectorize an application on Xeon Phi. The calculation part of the program looks like this (only a part of the code):&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
t[1] = (c[1]+c[17]+ c[33]) + (c[5]+c[21]+c[37]) +(c[9]+c[25]+c[41]);
t[2] = (c[2]+c[18]+ c[34]) + (c[6]+c[22]+c[38]) +(c[10]+c[26]+c[42]);
t[3] = (c[3]+c[19]+ c[35]) + (c[7]+c[23]+c[39]) +(c[11]+c[27]+c[43]);

t[4] = (c[4]+c[20]+ c[36]) - (c[8]+c[24]+c[40]) -(c[12]+c[28]+c[44]);
t[5] = (c[5]+c[21]+ c[37]) - (c[9]+c[25]+c[41]) -(c[13]+c[29]+c[45]);
t[6] = (c[6]+c[22]+ c[38]) - (c[10]+c[26]+c[42]) -(c[14]+c[30]+c[46]);
t[7] = (c[7]+c[23]+ c[39]) - (c[11]+c[27]+c[43]) -(c[15]+c[31]+c[47]);

t[8] = (c[16]-c[32]- c[48]) + (c[20]-c[36]-c[52]) +(c[24]-c[40]-c[56]);
t[9] = (c[17]-c[33]- c[49]) + (c[21]-c[37]-c[53]) +(c[25]-c[41]-c[57]);
t[10] = (c[18]-c[34]- c[50]) + (c[22]-c[38]-c[54]) +(c[26]-c[42]-c[58]);
t[11] = (c[19]-c[35]- c[51]) + (c[23]-c[39]-c[55]) +(c[27]-c[43]-c[59]);&lt;/PRE&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-size: 15px; clear: both; color: rgb(36, 39, 41); font-family: Arial, 'Helvetica Neue', Helvetica, sans-serif; line-height: 19.5px;"&gt;It loads data to an array c and then adds or substracts the elements in c; at last stores data in the array t. F&lt;SPAN style="color: rgb(36, 39, 41); font-family: Arial, 'Helvetica Neue', Helvetica, sans-serif; font-size: 13px; line-height: 16.9px;"&gt;or each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.&lt;/SPAN&gt;The length of c is 64. The problem is that in Xeon Phi, there is only 32 vector registers, so I can not put all the 64 elements of c in 32 vector registers. I need to load a part of data then calculate the results, then load some again. I may need to load some data several times in order to finsh the whole calculation. I notice that in the code, some intermediate data can be used several times. The whole procedure is like a tree.&lt;/P&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-size: 15px; clear: both; color: rgb(36, 39, 41); font-family: Arial, 'Helvetica Neue', Helvetica, sans-serif; line-height: 19.5px;"&gt;So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and store the intermediate data?&lt;/P&gt;

&lt;P style="margin-bottom: 1em; border: 0px; font-size: 15px; clear: both; color: rgb(36, 39, 41); font-family: Arial, 'Helvetica Neue', Helvetica, sans-serif; line-height: 19.5px;"&gt;Anthor question is that I need to interleave the computation and memsuggestionory operations so as to achieve better performance. Any&amp;nbsp;suggestion?&lt;/P&gt;</description>
      <pubDate>Tue, 21 Mar 2017 16:46:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062323#M54616</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-03-21T16:46:37Z</dc:date>
    </item>
    <item>
      <title>Hi Zhen Jia,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062324#M54617</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Hi Zhen Jia,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I think that the most appropriate tool to provide you an answer for vectorisation is Vectorization Advisor, included in Intel Advisor XE.&lt;/P&gt;

&lt;P&gt;The following link provides videos and tutorials that will answer most of your questions related to vectorisation and the usage of Vectorization Advisor:&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/vectorization-advisor-faq"&gt;https://software.intel.com/en-us/articles/vectorization-advisor-faq&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 01:47:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062324#M54617</guid>
      <dc:creator>gaston-hillar</dc:creator>
      <dc:date>2017-03-30T01:47:24Z</dc:date>
    </item>
    <item>
      <title>Thanks Gaston! I will try.</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062325#M54618</link>
      <description>&lt;P&gt;Thanks Gaston! I will try.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 01:50:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062325#M54618</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-03-30T01:50:50Z</dc:date>
    </item>
    <item>
      <title>Hi Zhen Jia,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062326#M54619</link>
      <description>&lt;P style="word-wrap: break-word; font-size: 12px;"&gt;Hi Zhen Jia,&lt;/P&gt;

&lt;DIV&gt;The following link provides a great video by Vadim Karpusenko and Andrey Vladimirov:&amp;nbsp;&lt;A href="https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation"&gt;https://software.intel.com/en-us/videos/episode-4-2-automatic-vectorization-and-array-notation&lt;/A&gt;&lt;/DIV&gt;

&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;

&lt;DIV&gt;They discuss automatic vectorization feature of the compilers, where it can be used, and how to diagnose it. I do believe this video will provide you valuable information.&lt;/DIV&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:03:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062326#M54619</guid>
      <dc:creator>gaston-hillar</dc:creator>
      <dc:date>2017-03-30T02:03:00Z</dc:date>
    </item>
    <item>
      <title>Hi Zhen Jia,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062327#M54620</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Hi Zhen Jia,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;I provided you tools and documentation that I do believe they will be useful for you to optimize not only the&amp;nbsp;piece of code you provided but also future pieces of code. The tools rock, believe me, I've been using them for a long time and once you start&amp;nbsp;using these tools, there is no way back. :)&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:05:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062327#M54620</guid>
      <dc:creator>gaston-hillar</dc:creator>
      <dc:date>2017-03-30T02:05:12Z</dc:date>
    </item>
    <item>
      <title>Great! Gaston,Thanks so much!</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062328#M54621</link>
      <description>&lt;P&gt;Great! Gaston,Thanks so much!&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;For tools, you mean&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 16.3636px;"&gt;Vectorization Advisor?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:08:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062328#M54621</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-03-30T02:08:31Z</dc:date>
    </item>
    <item>
      <title>Hi Zhen Jia,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062329#M54622</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Hi Zhen Jia,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Yeah, I mean Vectorization Advisor. In fact, it is a product with many tools&amp;nbsp;included. Basically, you get advice on how to&amp;nbsp;vectorize and it makes your life really easier in such a complex topic.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:13:27 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062329#M54622</guid>
      <dc:creator>gaston-hillar</dc:creator>
      <dc:date>2017-03-30T02:13:27Z</dc:date>
    </item>
    <item>
      <title>I see. I will ues it.</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062330#M54623</link>
      <description>&lt;P&gt;I see. I will ues it.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:15:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062330#M54623</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-03-30T02:15:08Z</dc:date>
    </item>
    <item>
      <title>BTW, I forgot to mention a</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062331#M54624</link>
      <description>&lt;P&gt;BTW, I forgot to mention a very important thing. You can download a free trial and check whether the tool is helpful for you. Here is the link:&amp;nbsp;&lt;A href="https://software.intel.com/en-us/intel-advisor-xe"&gt;https://software.intel.com/en-us/intel-advisor-xe&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:15:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062331#M54624</guid>
      <dc:creator>gaston-hillar</dc:creator>
      <dc:date>2017-03-30T02:15:22Z</dc:date>
    </item>
    <item>
      <title>Got it.</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062332#M54625</link>
      <description>&lt;P&gt;Got it.&lt;/P&gt;</description>
      <pubDate>Thu, 30 Mar 2017 02:20:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062332#M54625</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-03-30T02:20:17Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...So the question is that</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062333#M54626</link>
      <description>&amp;gt;&amp;gt;...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and
&amp;gt;&amp;gt;store the intermediate data?

A &lt;STRONG&gt;gather&lt;/STRONG&gt;-like approach is possible in your case and this is because to calculate:
...
t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);
...
indices &lt;STRONG&gt;0, 4, 8, 16, 20, 24, 32, 36, 40&lt;/STRONG&gt; need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one &lt;STRONG&gt;for-loop&lt;/STRONG&gt; three independent &lt;STRONG&gt;for-loops&lt;/STRONG&gt; could be used.

Take a look at an article: &lt;STRONG&gt;What to do when Auto Vectorization fails?&lt;/STRONG&gt;
.
&lt;A href="https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails" target="_blank"&gt;https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails&lt;/A&gt;</description>
      <pubDate>Thu, 06 Apr 2017 18:03:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062333#M54626</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-04-06T18:03:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;For each element of c, like</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062334#M54627</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.&lt;/P&gt;

&lt;P&gt;The way I read this is you have&lt;/P&gt;

&lt;P&gt;__m512 c[64];&lt;BR /&gt;
	__m512 t[16];&lt;BR /&gt;
	...&lt;BR /&gt;
	for(int i=0; i&amp;lt;4; ++i)&lt;BR /&gt;
	&amp;nbsp; t&lt;I&gt; = _mm512_store_ps(&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&amp;amp;c&lt;I&gt;),&amp;nbsp;&amp;nbsp; _mm512_load_ps(&amp;amp;c[i+16])), _mm512_load_ps(&amp;amp;c[i+32])),&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&amp;amp;c[i+4]), _mm512_load_ps(&amp;amp;c[i+20])), _mm512_load_ps(&amp;amp;c[i+36]))),&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&amp;amp;c[i+8]), _mm512_load_ps(&amp;amp;c[i+24])), _mm512_load_ps(&amp;amp;c[i+40])))));&lt;BR /&gt;
	for(int &amp;nbsp;i=5; i&amp;lt;8; ++I)&lt;BR /&gt;
	...&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Thu, 06 Apr 2017 19:13:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062334#M54627</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-04-06T19:13:00Z</dc:date>
    </item>
    <item>
      <title>Yes, exactly. That is what I</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062335#M54628</link>
      <description>&lt;P&gt;Yes, exactly. That is what I want to say.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;For each element of c, like c[0], it includes 16 floats. The data type of each element in c and t is __m512.The length of c is 64.&lt;/P&gt;

&lt;P&gt;The way I read this is you have&lt;/P&gt;

&lt;P&gt;__m512 c[64];&lt;BR /&gt;
	__m512 t[16];&lt;BR /&gt;
	...&lt;BR /&gt;
	for(int i=0; i&amp;lt;4; ++i)&lt;BR /&gt;
	&amp;nbsp; t&lt;I&gt; = _mm512_store_ps(&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&amp;amp;c&lt;I&gt;),&amp;nbsp;&amp;nbsp; _mm512_load_ps(&amp;amp;c[i+16])), _mm512_load_ps(&amp;amp;c[i+32])),&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&amp;amp;c[i+4]), _mm512_load_ps(&amp;amp;c[i+20])), _mm512_load_ps(&amp;amp;c[i+36]))),&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; _mm512_add_ps(_mm512_add_ps(_mm512_load_ps(&amp;amp;c[i+8]), _mm512_load_ps(&amp;amp;c[i+24])), _mm512_load_ps(&amp;amp;c[i+40])))));&lt;BR /&gt;
	for(int &amp;nbsp;i=5; i&amp;lt;8; ++I)&lt;BR /&gt;
	...&lt;/I&gt;&lt;/I&gt;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Apr 2017 02:43:33 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062335#M54628</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-04-10T02:43:33Z</dc:date>
    </item>
    <item>
      <title>Hi Sergey,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062336#M54629</link>
      <description>&lt;P&gt;Hi Sergey,&lt;/P&gt;

&lt;P&gt;Yes, the &amp;nbsp;question is what you said. For Xeon phi, the gather and scatter have long latency &amp;nbsp;and low throughput. I think they may be hard to help. The logic of the program is simple, just load, calculate and store. There is no if-else. &amp;nbsp;So will multiple loops help?&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Sergey Kostrov wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;...So the question is that is there any algorithm that can efficiently decide when to load data, perform computation and&lt;BR /&gt;
	&amp;gt;&amp;gt;store the intermediate data?&lt;/P&gt;

&lt;P&gt;A &lt;STRONG&gt;gather&lt;/STRONG&gt;-like approach is possible in your case and this is because to calculate:&lt;BR /&gt;
	...&lt;BR /&gt;
	t[0] = (c[0]+c[16]+ c[32]) + (c[4]+c[20]+c[36]) +(c[8]+c[24]+c[40]);&lt;BR /&gt;
	...&lt;BR /&gt;
	indices &lt;STRONG&gt;0, 4, 8, 16, 20, 24, 32, 36, 40&lt;/STRONG&gt; need to be used to access elements of the array. I don't guarantee you a performance boost but there is no need to try to do what you want to do in a "one-shoot". It means, a performance boost could be achieved by refactoring of that code. For example, instead of one &lt;STRONG&gt;for-loop&lt;/STRONG&gt; three independent &lt;STRONG&gt;for-loops&lt;/STRONG&gt; could be used.&lt;/P&gt;

&lt;P&gt;Take a look at an article: &lt;STRONG&gt;What to do when Auto Vectorization fails?&lt;/STRONG&gt;&lt;BR /&gt;
	.&lt;BR /&gt;
	&lt;A href="https://software.intel.com/en-us/articles/what-to-do-when-auto-vectorization-fails"&gt;https://software.intel.com/en-us/articles/what-to-do-when-auto-vectoriza...&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Apr 2017 02:49:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062336#M54629</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-04-10T02:49:55Z</dc:date>
    </item>
    <item>
      <title>zhen jia,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062337#M54630</link>
      <description>&lt;P&gt;zhen jia,&lt;/P&gt;

&lt;P&gt;You should be able to write operator overloads such that you can remove the _mm512_...'s&lt;BR /&gt;
	IOW to use the original expressions as in your first post.&lt;/P&gt;

&lt;P&gt;Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice.&amp;nbsp; If you are NOT&amp;nbsp;using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Mon, 10 Apr 2017 12:03:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062337#M54630</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-04-10T12:03:23Z</dc:date>
    </item>
    <item>
      <title>Hi Jim Dempsey,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062338#M54631</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;SPAN style="font-size: 13.008px; line-height: 17.7382px;"&gt;Jim Dempsey,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 17.7382px;"&gt;By&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px; line-height: 17.7382px;"&gt;using the source data (in place) that went into c, you mean that I can give the legth of c like 48 (not 64) and reuse some of them?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;zhen jia,&lt;/P&gt;

&lt;P&gt;You should be able to write operator overloads such that you can remove the _mm512_...'s&lt;BR /&gt;
	IOW to use the original expressions as in your first post.&lt;/P&gt;

&lt;P&gt;Not stated, you may want to look at how you placed data into c[64]. Some elements are used once, others only twice.&amp;nbsp; If you are NOT&amp;nbsp;using c elsewhere, you may be able to eliminate the array c by using the source data (in place) that went into c.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 10 Apr 2017 13:55:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062338#M54631</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-04-10T13:55:02Z</dc:date>
    </item>
    <item>
      <title>It might be clearest to the</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062339#M54632</link>
      <description>&lt;P&gt;It might be clearest to the readers here if you show the code that populates the array c.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 11 Apr 2017 14:09:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062339#M54632</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-04-11T14:09:56Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...So will multiple loops</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062340#M54633</link>
      <description>&amp;gt;&amp;gt;...So will multiple loops help?

You have codes and I think you could easily try that. I agree with Jim's point that a complete test case would help to everybody who is involved in that discussion.</description>
      <pubDate>Wed, 12 Apr 2017 19:24:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062340#M54633</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-04-12T19:24:19Z</dc:date>
    </item>
    <item>
      <title>HI Jim and Sergey,</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062341#M54634</link>
      <description>&lt;P&gt;HI Jim and Sergey,&lt;/P&gt;

&lt;P&gt;The current computation part of the code looks like this. It depends on the compiler to&amp;nbsp;optimize the peformance. I am wondering if there is other method to further optimize the code.&lt;/P&gt;

&lt;P&gt;#define SIMD_ADD &amp;nbsp; &amp;nbsp;_mm512_add_ps&lt;BR /&gt;
	#define SIMD_SUB &amp;nbsp; &amp;nbsp;_mm512_sub_ps&lt;/P&gt;

&lt;P&gt;#define SIMD_LOAD &amp;nbsp; _mm512_load_ps&lt;BR /&gt;
	#define SIMD_STORE &amp;nbsp;_mm512_store_ps&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[0] &amp;nbsp;=SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[0],c[16]), c[32]) , SIMD_ADD(SIMD_ADD(c[4],c[20]) ,c[36])) ,SIMD_ADD(SIMD_ADD(c[8] ,c[24]),c[40]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[1] &amp;nbsp;=SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[1],c[17]), c[33]) , SIMD_ADD(SIMD_ADD(c[5],c[21]) ,c[37])) ,SIMD_ADD(SIMD_ADD(c[9] ,c[25]),c[41]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[2] &amp;nbsp;=SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[2],c[18]), c[34]) , SIMD_ADD(SIMD_ADD(c[6],c[22]) ,c[38])) ,SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[3] &amp;nbsp;=SIMD_ADD(SIMD_ADD(SIMD_ADD(SIMD_ADD(c[3],c[19]), c[35]) , SIMD_ADD(SIMD_ADD(c[7],c[23]) ,c[39])) ,SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43]));&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[4] &amp;nbsp;=SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[4],c[20]), c[36]) , SIMD_ADD(SIMD_ADD(c[8],c[24]) ,c[40])), SIMD_ADD(SIMD_ADD(c[12],c[28]),c[44]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[5] &amp;nbsp;=SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[5],c[21]), c[37]) , SIMD_ADD(SIMD_ADD(c[9],c[25]) ,c[41])), SIMD_ADD(SIMD_ADD(c[13],c[29]),c[45]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[6] &amp;nbsp;=SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[6],c[22]), c[38]) , SIMD_ADD(SIMD_ADD(c[10],c[26]),c[42])), SIMD_ADD(SIMD_ADD(c[14],c[30]),c[46]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[7] &amp;nbsp;=SIMD_SUB(SIMD_SUB(SIMD_ADD(SIMD_ADD(c[7],c[23]), c[39]) , SIMD_ADD(SIMD_ADD(c[11],c[27]),c[43])), SIMD_ADD(SIMD_ADD(c[15],c[31]),c[47]));&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[8] &amp;nbsp;=SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[16],c[32]), c[48]) , SIMD_SUB(SIMD_SUB(c[20],c[36]),c[52])) ,SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[9] &amp;nbsp;=SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[17],c[33]), c[49]) , SIMD_SUB(SIMD_SUB(c[21],c[37]),c[53])) ,SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[10] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[18],c[34]), c[50]) , SIMD_SUB(SIMD_SUB(c[22],c[38]),c[54])) ,SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[11] =SIMD_ADD(SIMD_ADD(SIMD_SUB(SIMD_SUB(c[19],c[35]), c[51]) , SIMD_SUB(SIMD_SUB(c[23],c[39]),c[55])) ,SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59]));&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[12] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[20],c[36]), c[52]), SIMD_SUB(SIMD_SUB(c[24],c[40]),c[56])), SIMD_SUB(SIMD_SUB(c[28],c[44]),c[60]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[13] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[21],c[37]), c[53]), SIMD_SUB(SIMD_SUB(c[25],c[41]),c[57])), SIMD_SUB(SIMD_SUB(c[29],c[45]),c[61]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[14] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[22],c[38]), c[54]), SIMD_SUB(SIMD_SUB(c[26],c[42]),c[58])), SIMD_SUB(SIMD_SUB(c[30],c[46]),c[62]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; t[15] =SIMD_SUB(SIMD_SUB(SIMD_SUB(SIMD_SUB(c[23],c[39]), c[55]), SIMD_SUB(SIMD_SUB(c[27],c[43]),c[59])), SIMD_SUB(SIMD_SUB(c[31],c[47]),c[63]));&lt;/P&gt;

&lt;P&gt;m[0] = SIMD_ADD(t[0],SIMD_ADD(t[1],t[2]));&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[1] = SIMD_SUB(SIMD_SUB(t[1],t[2]),t[3]);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[2] = SIMD_ADD(SIMD_ADD(t[4],t[5]),t[6]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[3] = SIMD_SUB(SIMD_SUB(t[5],t[6]),t[7]);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[4] = SIMD_ADD(SIMD_ADD(t[8],t[9]),t[10]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[5] = SIMD_SUB(SIMD_SUB(t[9],t[10]),t[11]);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[6] = SIMD_ADD(SIMD_ADD(t[12],t[13]),t[14]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; m[7] = SIMD_SUB(SIMD_SUB(t[13],t[14]),t[15]);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[0]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[1]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[2]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + dd *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[3]);&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[4]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + i*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[5]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + j*SIMD_WIDTH, m[6]);&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;SIMD_STORE(data + (dd+1) *oW*oH*SIMD_WIDTH + (i+1)*ldo*SIMD_WIDTH + (j+1)*SIMD_WIDTH, m[7]);&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2017 01:38:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062341#M54634</guid>
      <dc:creator>zhen_j_</dc:creator>
      <dc:date>2017-04-14T01:38:00Z</dc:date>
    </item>
    <item>
      <title>You did not answer the</title>
      <link>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062342#M54635</link>
      <description>&lt;P&gt;You did not answer the question: &lt;EM&gt;How did the data get placed into the array c[]?&lt;/EM&gt;&lt;BR /&gt;
	And a related, equally important, question: &lt;EM&gt;How often is the data, or portions of data, in c[] reused without being updated?&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;The response may indicate if the array c is needed. IOW if you should use the data directly&amp;nbsp;in its prior locations.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 14 Apr 2017 12:04:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Efficiently-use-vector-registers/m-p/1062342#M54635</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-04-14T12:04:00Z</dc:date>
    </item>
  </channel>
</rss>

