<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic the variable I used for in Intel® ISA Extensions</title>
    <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010691#M4905</link>
    <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;&amp;gt;&amp;gt;&amp;gt;the variable I used for assigning is a reference of image buffer&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;Maybe in your code there are some pointers or references which are dereferencing/referencing that image buffer?&lt;/SPAN&gt;&lt;/P&gt;</description>
    <pubDate>Tue, 09 Dec 2014 20:46:00 GMT</pubDate>
    <dc:creator>Bernard</dc:creator>
    <dc:date>2014-12-09T20:46:00Z</dc:date>
    <item>
      <title>Huge time cost while assigning</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010682#M4896</link>
      <description>&lt;P&gt;Hello Guys:)&lt;/P&gt;

&lt;P&gt;It is very nice to have this forum. I'm a fresh on the ISA Extension and expect to have your insight:)&lt;/P&gt;

&lt;P&gt;My code snippet, which conducts a convolution computing, is attached as a figure. and here is my confusing issue:&lt;/P&gt;

&lt;P&gt;Time was consumed hugely when I tried to assign the computed result to image buffer. Computing time of extension sets(line 512~544) only takes about 7~8ms, but the assign work&lt;SPAN style="line-height: 19.5120010375977px;"&gt;(line 548)&lt;/SPAN&gt; takes about 25~26ms.&lt;/P&gt;

&lt;P&gt;The most confusing thing to me is that there is little time-cost while assign the image buffer with other value like loop control index(line 549) &amp;nbsp;or other register(line 552). As long as I try to assign the buffer with computed result(line 544), time-cost will raise hugely. I tried several ways(line 547~552) on assign work, all of these ways cost huge time as well.&lt;/P&gt;

&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper" image-alt="QforIntel.PNG"&gt;&lt;img src="https://community.intel.com/t5/image/serverpage/image-id/10233iA308F7BB33DA3C51/image-size/large?v=v2&amp;amp;px=999&amp;amp;whitelist-exif-data=Orientation%2CResolution%2COriginalDefaultFinalSize%2CCopyright" role="button" title="QforIntel.PNG" alt="QforIntel.PNG" /&gt;&lt;/span&gt;&lt;/P&gt;

&lt;P&gt;My env. info:&lt;/P&gt;

&lt;P&gt;compiler:&amp;nbsp;icpc version 12.1.0 (gcc version 4.4.5 compatibility)&lt;/P&gt;

&lt;P&gt;OS:&amp;nbsp;Linux version 2.6.32-220.4.1.el6.x86_64&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;if there is any unclear about the issue description, please kindly let me know. &amp;nbsp;Again, thanks a lot in advance!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2014 02:12:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010682#M4896</guid>
      <dc:creator>Xinjue_Z_</dc:creator>
      <dc:date>2014-12-04T02:12:05Z</dc:date>
    </item>
    <item>
      <title>Do you have VTune profiler</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010683#M4897</link>
      <description>&lt;P&gt;Do you have VTune profiler installed?&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2014 09:34:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010683#M4897</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-12-04T09:34:14Z</dc:date>
    </item>
    <item>
      <title>Hello Xinjue Z.,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010684#M4898</link>
      <description>&lt;P&gt;Hello&amp;nbsp;&lt;A href="https://software.intel.com/en-us/user/1114242" style="font-size: 11.2000017166138px; line-height: 13.2000026702881px; background-color: rgb(238, 238, 238);"&gt;Xinjue Z.&lt;/A&gt;,&lt;BR /&gt;
	&lt;BR /&gt;
	You could use either&lt;BR /&gt;
	*pTmp = _mm_extract_epi16(resI, 0);&lt;BR /&gt;
	&lt;BR /&gt;
	or even better&lt;BR /&gt;
	_mm_stream_si32((int *)&lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;pTmp,&amp;nbsp;_mm_extract_epi16(resI, 0))&lt;/SPAN&gt;;&lt;BR /&gt;
	&lt;BR /&gt;
	Don't forget that&amp;nbsp;&lt;SPAN style="font-size: 12.8000020980835px; line-height: 15.6096038818359px;"&gt;_mm_stream_si32() stores 4 bytes, and call&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="color: rgb(0, 100, 0); font-family: Consolas, Courier, monospace; font-size: 12.6666688919067px; line-height: 18px;"&gt;_mm_mfence() at the end.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 04 Dec 2014 12:39:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010684#M4898</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-12-04T12:39:49Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010685#M4899</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I would try to check disassembly of LOC #544 and LOC #548 and post it here. Do you have any kind of Store-Forwarding Stalls? Does your code operates on the same buffer?&lt;/P&gt;</description>
      <pubDate>Sat, 06 Dec 2014 09:09:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010685#M4899</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-12-06T09:09:48Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010686#M4900</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Do you have VTune profiler installed?&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello&amp;nbsp;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;iliyapolak,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;No. I didn't buy the tool, I'll check if there is trial version. Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 08 Dec 2014 02:42:19 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010686#M4900</guid>
      <dc:creator>Xinjue_Z_</dc:creator>
      <dc:date>2014-12-08T02:42:19Z</dc:date>
    </item>
    <item>
      <title>Quote:Vladimir Sedach wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010687#M4901</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Vladimir Sedach wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello&amp;nbsp;&lt;A href="https://software.intel.com/en-us/user/1114242"&gt;Xinjue Z.&lt;/A&gt;,&lt;/P&gt;

&lt;P&gt;You could use either&lt;BR /&gt;
	*pTmp = _mm_extract_epi16(resI, 0);&lt;/P&gt;

&lt;P&gt;or even better&lt;BR /&gt;
	_mm_stream_si32((int *)pTmp,&amp;nbsp;_mm_extract_epi16(resI, 0));&lt;/P&gt;

&lt;P&gt;Don't forget that&amp;nbsp;_mm_stream_si32() stores 4 bytes, and call&amp;nbsp;_mm_mfence() at the end.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello&amp;nbsp;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;Vladimir Sedach&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;Thank you very much for such info.&amp;nbsp;I think I missed some info and led you to wrong direction, but I still learned couple of new sets:)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;My compiler is enabled with O3 level optimization. The ways of assigning(line 549 and 552) didn't use computing results, then it seems the computing sets(line 512~544) were optimized. Is this possible?&lt;/P&gt;

&lt;P&gt;Such guess is based on my another try by commenting computing sets from line 531~538. And the time cost is about 19ms.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Dec 2014 03:03:57 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010687#M4901</guid>
      <dc:creator>Xinjue_Z_</dc:creator>
      <dc:date>2014-12-08T03:03:57Z</dc:date>
    </item>
    <item>
      <title>Quote:iliyapolak wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010688#M4902</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;iliyapolak wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I would try to check disassembly of LOC #544 and LOC #548 and post it here. Do you have any kind of Store-Forwarding Stalls? Does your code operates on the same buffer?&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello&amp;nbsp;&lt;SPAN style="line-height: 19.5120010375977px;"&gt;iliyapolak again:)&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Yes, I agree. Checking the disassembly will have some leads. I'll do so and come to you. For the Store-Forwarding Stalls, the variable I used for assigning is a reference of image buffer. Will the reference cause write &amp;amp; read issue? Anyway I'm on my way to check the disassembly.&lt;/P&gt;

&lt;P&gt;Again thanks a lot!&lt;/P&gt;</description>
      <pubDate>Mon, 08 Dec 2014 03:34:53 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010688#M4902</guid>
      <dc:creator>Xinjue_Z_</dc:creator>
      <dc:date>2014-12-08T03:34:53Z</dc:date>
    </item>
    <item>
      <title>Xinjue Z.,</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010689#M4903</link>
      <description>&lt;P&gt;Xinjue Z.,&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 10.4000015258789px; line-height: 12.3473672866821px;"&gt;&amp;gt;&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12.0000019073486px; line-height: 14.4000024795532px;"&gt;My compiler is enabled with O3 level optimization. The ways of assigning(line 549 and 552) didn't use computing results,&lt;BR /&gt;
	&amp;gt; then it seems the computing sets(line 512~544) were optimized. Is this possible?&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	&lt;SPAN style="font-size: 10.4000015258789px; line-height: 12.3473672866821px;"&gt;Not just possible - it is true. Compile omits statements with unused results.&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;BR /&gt;
	To speed it up a bit more you might use _mm_stream_si128() - combine the results of 8 16-bit values in a __m128i reg and "stream" it to the memory.&lt;BR /&gt;
	&lt;BR /&gt;
	The huge time could be either due to the cache issues or just time miscalculation )&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 08 Dec 2014 14:48:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010689#M4903</guid>
      <dc:creator>Vladimir_Sedach</dc:creator>
      <dc:date>2014-12-08T14:48:45Z</dc:date>
    </item>
    <item>
      <title>Quote:Xinjue Z. wrote:</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010690#M4904</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Xinjue Z. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG class="quote-header"&gt;Quote:&lt;/STRONG&gt;&lt;/P&gt;

&lt;BLOCKQUOTE class="quote-msg quote-nest-1 odd"&gt;
	&lt;DIV class="quote-author"&gt;&lt;EM class="placeholder"&gt;iliyapolak&lt;/EM&gt; wrote:&lt;/DIV&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;

	&lt;P&gt;Do you have VTune profiler installed?&lt;/P&gt;

	&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Hello&amp;nbsp;iliyapolak,&lt;/P&gt;

&lt;P&gt;No. I didn't buy the tool, I'll check if there is trial version. Thanks!&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;You can download and use trial version.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Dec 2014 20:29:31 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010690#M4904</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-12-09T20:29:31Z</dc:date>
    </item>
    <item>
      <title>the variable I used for</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010691#M4905</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;&amp;gt;&amp;gt;&amp;gt;the variable I used for assigning is a reference of image buffer&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 14.3999996185303px;"&gt;Maybe in your code there are some pointers or references which are dereferencing/referencing that image buffer?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 09 Dec 2014 20:46:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010691#M4905</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2014-12-09T20:46:00Z</dc:date>
    </item>
    <item>
      <title>Once you figure out the</title>
      <link>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010692#M4906</link>
      <description>&lt;P&gt;Once you figure out the memory issue, you might consider tweaking the code a little:&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;
// remove
// __m128 res = _mm_setzero_ps();
...
// change
// Dot B and C
row0 = _mm_dp_ps(row0, rC, 0xf1);
row1 = _mm_dp_ps(row1, rC, 0xf2);
row2 = _mm_dp_ps(row2, rC, 0xf4);
__m128 res = _mm_add_ps(row0, row1);
row3 = _mm_dp_ps(row3, rC, 0xf8);
res = _mm_add_ps(res, row2);
res = _mm_add_ps(res, row3);

// Dot A and BC
...&lt;/PRE&gt;

&lt;P&gt;The above saves two instructions and attempts to overlap the adds with the multiply. You might get a few clock cycles back.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 10 Dec 2014 13:57:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-ISA-Extensions/Huge-time-cost-while-assigning/m-p/1010692#M4906</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2014-12-10T13:57:09Z</dc:date>
    </item>
  </channel>
</rss>

