<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Several factors you haven't in Intel® Moderncode for Parallel Architectures</title>
    <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097782#M7280</link>
    <description>&lt;P&gt;Several factors you haven't addressed might enter into this comparison. &amp;nbsp;It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism. &amp;nbsp;Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.&lt;/P&gt;</description>
    <pubDate>Wed, 20 Apr 2016 12:33:49 GMT</pubDate>
    <dc:creator>TimP</dc:creator>
    <dc:date>2016-04-20T12:33:49Z</dc:date>
    <item>
      <title>Question about performance of Intel cilk sample code</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097781#M7279</link>
      <description>&lt;P&gt;Dear all:&lt;/P&gt;

&lt;P&gt;When I look at the cilk sample code under the path "IntelSWTools\samples_2016\en\compiler_c\psxe\cilk.zip\matrix-multiply\matrix-multiply.cpp", I found there are some comments in the source code:&lt;/P&gt;

&lt;PRE style="font-family:NSimSun;font-size:13;color:black;background:white;"&gt;&amp;nbsp;&lt;SPAN style="color:green;"&gt;//&amp;nbsp;This&amp;nbsp;is&amp;nbsp;the&amp;nbsp;only&amp;nbsp;Intel(R)&amp;nbsp;Cilk(TM)&amp;nbsp;Plus&amp;nbsp;keyword&amp;nbsp;used&amp;nbsp;in&amp;nbsp;this&amp;nbsp;program&lt;/SPAN&gt;
		&lt;SPAN style="color:green;"&gt;//&amp;nbsp;Note&amp;nbsp;the&amp;nbsp;order&amp;nbsp;of&amp;nbsp;the&amp;nbsp;loops&amp;nbsp;and&amp;nbsp;the&amp;nbsp;code&amp;nbsp;motion&amp;nbsp;of&amp;nbsp;the&amp;nbsp;i&amp;nbsp;*&amp;nbsp;n&amp;nbsp;and&amp;nbsp;k&amp;nbsp;*&amp;nbsp;n&lt;/SPAN&gt;
		&lt;SPAN style="color:green;"&gt;//&amp;nbsp;computation.&amp;nbsp;This&amp;nbsp;gives&amp;nbsp;a&amp;nbsp;5-10&amp;nbsp;performance&amp;nbsp;improvment&amp;nbsp;over&amp;nbsp;exchanging&lt;/SPAN&gt;
		&lt;SPAN style="color:green;"&gt;//&amp;nbsp;the&amp;nbsp;j&amp;nbsp;and&amp;nbsp;k&amp;nbsp;loops.&lt;/SPAN&gt;
&lt;/PRE&gt;

&lt;P&gt;but why?&lt;/P&gt;

&lt;P&gt;I wrote some codes without cilk, and exchange the j and k loops order, the test result is&amp;nbsp; just the opposite, function with normal order has better performance than exchanged order, about double times better&lt;/P&gt;

&lt;P&gt;It makes me confused , I want to know why, anybody can help me?&lt;/P&gt;

&lt;P&gt;Below is the codes without cilk that I wrote for testing the influence of loops order:&lt;/P&gt;

&lt;PRE style="font-family:NSimSun;font-size:13;color:black;background:white;"&gt;&lt;SPAN style="color:blue;"&gt;void&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:#880000;"&gt;matrix_multiply_without_cilk_with_normal_loop_order&lt;/SPAN&gt;(&lt;SPAN style="color:blue;"&gt;double&lt;/SPAN&gt;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;A&lt;/SPAN&gt;,&amp;nbsp;&lt;SPAN style="color:blue;"&gt;double&lt;/SPAN&gt;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;B&lt;/SPAN&gt;,&amp;nbsp;&lt;SPAN style="color:blue;"&gt;double&lt;/SPAN&gt;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;C&lt;/SPAN&gt;,&amp;nbsp;&lt;SPAN style="color:blue;"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;)
{
	&lt;SPAN style="color:blue;"&gt;for&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;0;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;&amp;nbsp;&amp;lt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;&amp;nbsp;++&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;)&amp;nbsp;{
		&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;itn&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;&amp;nbsp;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;
		&lt;SPAN style="color:blue;"&gt;for&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;0;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;&amp;nbsp;&amp;lt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;&amp;nbsp;++&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;)&amp;nbsp;{
			&lt;SPAN style="color:blue;"&gt;for&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;0;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;&amp;nbsp;&amp;lt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;&amp;nbsp;++&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;)&amp;nbsp;{
				&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;ktn&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;&amp;nbsp;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;
				&lt;SPAN style="color:navy;"&gt;A&lt;/SPAN&gt;[&lt;SPAN style="color:navy;"&gt;itn&lt;/SPAN&gt;&amp;nbsp;+&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;]&amp;nbsp;+=&amp;nbsp;&lt;SPAN style="color:navy;"&gt;B&lt;/SPAN&gt;[&lt;SPAN style="color:navy;"&gt;itn&lt;/SPAN&gt;&amp;nbsp;+&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;]&amp;nbsp;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;C&lt;/SPAN&gt;[&lt;SPAN style="color:navy;"&gt;ktn&lt;/SPAN&gt;&amp;nbsp;+&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;];
			}
		}
	}
}
 
&lt;SPAN style="color:blue;"&gt;void&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:#880000;"&gt;matrix_multiply_without_cilk_with_exchanged_loop_order&lt;/SPAN&gt;(&lt;SPAN style="color:blue;"&gt;double&lt;/SPAN&gt;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;A&lt;/SPAN&gt;,&amp;nbsp;&lt;SPAN style="color:blue;"&gt;double&lt;/SPAN&gt;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;B&lt;/SPAN&gt;,&amp;nbsp;&lt;SPAN style="color:blue;"&gt;double&lt;/SPAN&gt;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;C&lt;/SPAN&gt;,&amp;nbsp;&lt;SPAN style="color:blue;"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;)
{
	&lt;SPAN style="color:blue;"&gt;for&lt;/SPAN&gt;(&lt;SPAN style="color:blue;"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;0;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;&amp;nbsp;&amp;lt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;&amp;nbsp;++&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;)&amp;nbsp;{
		&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;itn&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;&lt;SPAN style="color:navy;"&gt;i&lt;/SPAN&gt;&amp;nbsp;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;
		&lt;SPAN style="color:blue;"&gt;for&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN style="color:blue;"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;0;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;&amp;nbsp;&amp;lt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;&amp;nbsp;++&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;)&amp;nbsp;{
			&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;ktn&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;&amp;nbsp;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;
			&lt;SPAN style="color:blue;"&gt;for&lt;/SPAN&gt;&amp;nbsp;(&lt;SPAN style="color:blue;"&gt;unsigned&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:blue;"&gt;int&lt;/SPAN&gt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;&amp;nbsp;=&amp;nbsp;0;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;&amp;nbsp;&amp;lt;&amp;nbsp;&lt;SPAN style="color:navy;"&gt;n&lt;/SPAN&gt;;&amp;nbsp;++&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;)&amp;nbsp;{
				&lt;SPAN style="color:navy;"&gt;A&lt;/SPAN&gt;[&lt;SPAN style="color:navy;"&gt;itn&lt;/SPAN&gt;&amp;nbsp;+&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;]&amp;nbsp;+=&amp;nbsp;&lt;SPAN style="color:navy;"&gt;B&lt;/SPAN&gt;[&lt;SPAN style="color:navy;"&gt;itn&lt;/SPAN&gt;&amp;nbsp;+&amp;nbsp;&lt;SPAN style="color:navy;"&gt;k&lt;/SPAN&gt;]&amp;nbsp;*&amp;nbsp;&lt;SPAN style="color:navy;"&gt;C&lt;/SPAN&gt;[&lt;SPAN style="color:navy;"&gt;ktn&lt;/SPAN&gt;&amp;nbsp;+&amp;nbsp;&lt;SPAN style="color:navy;"&gt;j&lt;/SPAN&gt;];
			}
		}
	}
}&lt;/PRE&gt;</description>
      <pubDate>Wed, 20 Apr 2016 12:06:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097781#M7279</guid>
      <dc:creator>Raymond_S_</dc:creator>
      <dc:date>2016-04-20T12:06:13Z</dc:date>
    </item>
    <item>
      <title>Several factors you haven't</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097782#M7280</link>
      <description>&lt;P&gt;Several factors you haven't addressed might enter into this comparison. &amp;nbsp;It's certainly likely that a "normal" dot product organization might be most efficient, particularly for larger problems with thread parallelism. &amp;nbsp;Early implementations of Cilk(TM) plus had a poor implementation of sum_reduction. I still wouldn't bet on Cilk_for when thread affinity is needed.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Apr 2016 12:33:49 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097782#M7280</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2016-04-20T12:33:49Z</dc:date>
    </item>
    <item>
      <title>Please check out this article</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097783#M7281</link>
      <description>&lt;P&gt;Please check out this article: &lt;A href="https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1"&gt;https://software.intel.com/en-us/articles/putting-your-data-and-code-in-order-optimization-and-memory-part-1&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;It may be able to explain what's going on.&lt;/P&gt;</description>
      <pubDate>Wed, 20 Apr 2016 16:48:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097783#M7281</guid>
      <dc:creator>MikeP_Intel</dc:creator>
      <dc:date>2016-04-20T16:48:50Z</dc:date>
    </item>
    <item>
      <title>What is the compiler version?</title>
      <link>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097784#M7282</link>
      <description>&lt;P&gt;What is the compiler version? &amp;nbsp;What is the optimization settings (and other compiler arguments)? &amp;nbsp;How big is n?&lt;/P&gt;</description>
      <pubDate>Thu, 21 Apr 2016 05:53:04 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-Moderncode-for-Parallel/Question-about-performance-of-Intel-cilk-sample-code/m-p/1097784#M7282</guid>
      <dc:creator>Bradley_K_</dc:creator>
      <dc:date>2016-04-21T05:53:04Z</dc:date>
    </item>
  </channel>
</rss>

