<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Tim/illyapolak, in Software Tuning, Performance Optimization &amp; Platform Monitoring</title>
    <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061137#M5165</link>
    <description>&lt;P&gt;Hi Tim/illyapolak,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Thanks a lot for your help.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the Back-End Bound value is 0.785 when dataCopy size is 1200, a little higher than dataCopy =600.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Recently, I tried openMP, with that enabled, I see it' linear relatitonship comapred dataCopy =600. Its cycle count is 359 for dataCopy =1200(cycle count= 187 for dataCopy =600).&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; But weird I see below report for openMP, looks it's still poor performance?&lt;/P&gt;

&lt;P&gt;CPI Rate:1.162&lt;BR /&gt;
	Back-End Bound:1.0&lt;BR /&gt;
	Memory Bandwidth:0.56&lt;BR /&gt;
	Memory Latency:0.322&lt;BR /&gt;
	Store Bound:0.275&lt;BR /&gt;
	Cycles of 0 pots Utilized:0.27&lt;BR /&gt;
	Cycles of 1 pots Utilized:0.168&lt;BR /&gt;
	Cycles of 2 pots Utilized:0.392&lt;BR /&gt;
	Cycles of 3 pots Utilized:0.224&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Thank you&lt;/P&gt;

&lt;P&gt;John&lt;/P&gt;</description>
    <pubDate>Tue, 07 Apr 2015 04:01:12 GMT</pubDate>
    <dc:creator>Wei_Z_Intel</dc:creator>
    <dc:date>2015-04-07T04:01:12Z</dc:date>
    <item>
      <title>an issue on performance optimization by Intel compiler</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061126#M5154</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;I am learning to use&amp;nbsp;Intel C++ Compiler XE 15.0 integrated with VS 2013, I wrote a simple example as below to look into its performance .&lt;/P&gt;

&lt;P&gt;void dataCopy(float *codeWord0Ptr, float *codeWord1Ptr, int numDataCopy, float *outputPtr)&lt;BR /&gt;
	{&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; float *outputPtr1 = &amp;amp;outputPtr[numDataCopy];&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__assume_aligned(codeWord0Ptr, 64);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__assume_aligned(codeWord1Ptr, 64);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__assume_aligned(outputPtr, 64);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;__assume_aligned(outputPtr1, 64);&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;#pragma ivdep&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;#pragma vector aligned&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;for (idxData = 0; idxData &amp;lt; numDataCopy; idxData++)&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;{&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;outputPtr[idxData] = codeWord0Ptr[idxData];&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;outputPtr1[idxData] = codeWord1Ptr[idxData];&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;}&lt;BR /&gt;
	}&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;I enabled &amp;nbsp;release and x64 mode, &amp;nbsp;and enabled related optimization, AVX etc settings in project properties.&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;I also enabled optimization report in&amp;nbsp;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;project properties, I see it reports loop was vectorized.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;When I run it on my host PC(core is&amp;nbsp;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;i5-3320M)&lt;/SPAN&gt; and do some profiling on function&amp;nbsp;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;dataCopy&lt;/SPAN&gt;, I see some weird issue as below:&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; When numDataCopy = 300, I see its cycles is around 270, looks reasonable.&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;When numDataCopy = 600, its &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;cycles&lt;/SPAN&gt; is around 530,&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;looks reasonable.&lt;/SPAN&gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;When numDataCopy = 800, its &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;cycles&lt;/SPAN&gt; is around 780,&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp;looks reasonable too&lt;/SPAN&gt;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; but &amp;nbsp;When numDataCopy = 1200, its &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;cycles&lt;/SPAN&gt; is around 3100, around 6 times compared to&amp;nbsp;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;numDataCopy=600.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; I tried using VTune to look into the reasons:&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;When numDataCopy=1200, VTune &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;has below&lt;/SPAN&gt;&amp;nbsp;summary report &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; CPI rate:0.933&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; L1 Bound:0.264&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Store Bound:0.201&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 0 Ports Utilized:0.429&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 1 Port Utilized:0.265&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 2 Ports Utilized:0.107&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 3 Ports Utilized:0.159&lt;BR /&gt;
	&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;When numDataCopy=600, VTune &lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;has below&amp;nbsp;&lt;/SPAN&gt;summary report&amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; CPI rate:0.348&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Back-End Bound: 0.709&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; L1 Bound:0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Store Bound:0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 0 Ports Utilized:0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 1 Port Utilized:0.417&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 2 Ports Utilized:0.073&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 3 Ports Utilized:0.943&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; It looks that when&amp;nbsp;numDataCopy=1200, there is&amp;nbsp;L1 Bound, store Bound issue, and&amp;nbsp;Ports usage efficiency is much lower, and CPI rate increase a lot.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Can you tell me what the reason is for this?&lt;/P&gt;

&lt;P&gt;Thank you&lt;/P&gt;

&lt;P&gt;John&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Mar 2015 12:45:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061126#M5154</guid>
      <dc:creator>Wei_Z_Intel</dc:creator>
      <dc:date>2015-03-26T12:45:01Z</dc:date>
    </item>
    <item>
      <title>As you appear to incur</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061127#M5155</link>
      <description>&lt;P&gt;As you appear to incu&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;r performance issues when your data set spans multiple small pages, you may need to look into whether transparent huge pages might work or explicit prefetch could help.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Are you seeing streaming stores e.g. in optreport?&lt;/P&gt;</description>
      <pubDate>Thu, 26 Mar 2015 14:10:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061127#M5155</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-03-26T14:10:00Z</dc:date>
    </item>
    <item>
      <title>Hi Tim,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061128#M5156</link>
      <description>&lt;P&gt;Hi Tim,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Thanks a lot for the quick replies.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; What do you mean streaming stores, I only see below reports, did not see information on streaming stores.&amp;nbsp;I did alignment&amp;nbsp;declaration for buffers with __assume_aligned, but looks it still reports unaligned access&lt;/P&gt;

&lt;P&gt;remark #15389: vectorization support: reference outputPtr has unaligned access&lt;BR /&gt;
	remark #15389: vectorization support: reference codeWord0Ptr has unaligned access&lt;BR /&gt;
	remark #15389: vectorization support: reference outputPtr1 has unaligned access&lt;BR /&gt;
	remark #15389: vectorization support: reference codeWord1Ptr has unaligned access&lt;BR /&gt;
	remark #15381: vectorization support: unaligned access used inside loop body&lt;BR /&gt;
	remark #15300: LOOP WAS VECTORIZED&lt;BR /&gt;
	remark #15448: unmasked aligned unit stride loads: 2&lt;BR /&gt;
	remark #15449: unmasked aligned unit stride stores: 2&lt;BR /&gt;
	remark #15475: --- begin vector loop cost summary ---&lt;BR /&gt;
	remark #15476: scalar loop cost: 13&lt;BR /&gt;
	remark #15477: vector loop cost: 1.500&lt;BR /&gt;
	remark #15478: estimated potential speedup: 8.660&lt;BR /&gt;
	remark #15479: lightweight vector operations: 6&lt;BR /&gt;
	remark #15488: --- end vector loop cost summary ---&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; I tried #pragma prefetch previously, it does not work, not sure if it's what you mean by explicit prefetch. What do you mean by transparent huge pages,could you tell me how to do that?&lt;/P&gt;

&lt;P&gt;Thanks a lot&lt;/P&gt;

&lt;P&gt;John&lt;/P&gt;</description>
      <pubDate>Thu, 26 Mar 2015 15:37:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061128#M5156</guid>
      <dc:creator>Wei_Z_Intel</dc:creator>
      <dc:date>2015-03-26T15:37:26Z</dc:date>
    </item>
    <item>
      <title>my comment about huge pages</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061129#M5157</link>
      <description>&lt;P&gt;my comment about huge pages is more applicable to Linux.&lt;/P&gt;

&lt;P&gt;if the compiler option opt-streaming-stores is not taking effect, you might try #pragma vector aligned nontemporal.&lt;/P&gt;

&lt;P&gt;it's difficult to optimize software prefetch either with intrinsic or pragmatic, it probably needs tinkering with unroll and prefetch distance.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Mar 2015 16:03:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061129#M5157</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-03-26T16:03:16Z</dc:date>
    </item>
    <item>
      <title>If your code is accessing an</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061130#M5158</link>
      <description>&lt;P&gt;If your code is accessing an array in linear manner or put it differently when the array index calculation is linear then software prefetching should be effective. As Tim said you must find the exact prefetch distance.&lt;/P&gt;

&lt;P&gt;For Streaming Stores you may read following link&lt;/P&gt;

&lt;P&gt;&lt;A href="https://blogs.fau.de/hager/archives/2103" target="_blank"&gt;https://blogs.fau.de/hager/archives/2103&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 26 Mar 2015 18:14:30 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061130#M5158</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-03-26T18:14:30Z</dc:date>
    </item>
    <item>
      <title>Hi Tim/iliyapolak,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061131#M5159</link>
      <description>&lt;P&gt;Hi Tim/iliyapolak,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; I'm just back from other stuff&amp;nbsp;to read your&amp;nbsp;suggestion&amp;nbsp;. Where do you click opt-streaming-stores in project setting, I don't find it. I tried #pragma vector aligned nontemporal, it masked the #pragma vector aligned, which worsens the performance even for numDataCopy=600.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; What do you mean by prefetch distance?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; From the link &lt;A href="https://blogs.fau.de/hager/archives/2103"&gt;https://blogs.fau.de/hager/archives/2103,&lt;/A&gt; it looks that NT has more obvious effect when N is smaller,&amp;nbsp;I should see the same&amp;nbsp;effect with streaming stores for numDataCopy=1200&lt;/P&gt;

&lt;P&gt;Thank you&lt;/P&gt;

&lt;P&gt;John&lt;/P&gt;</description>
      <pubDate>Mon, 30 Mar 2015 14:38:23 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061131#M5159</guid>
      <dc:creator>Wei_Z_Intel</dc:creator>
      <dc:date>2015-03-30T14:38:23Z</dc:date>
    </item>
    <item>
      <title>It might help if you would</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061132#M5160</link>
      <description>&lt;P&gt;It might help if you would check compiler documentation.&amp;nbsp; A default setting is /Qopt-streaming-stores:auto&amp;nbsp; meaning that the compiler will choose according to expected loop count and whether it can see multiple access whether to use nontemporal streaming stores.&amp;nbsp; In your example, as there are 2 arrays stored, if the compiler doesn't heed your alignment assertions, it can use streaming stores for only one of them.&amp;nbsp; You could set /Qopt-streaming-stores:always in your additional command line options, in which case the compiler will use the streaming stores as much as possible (still subject to observing alignment assertions).&lt;/P&gt;

&lt;P&gt;If you are seeing worse performance with #pragma vector aligned nontemporal it means that your application is benefiting from keeping the stored arrays in cache, and probably that it is in fact observing alignment, as you could check in the compiler reports.&amp;nbsp; Also, if the compiler is seeing a reason for not using streaming stores with the auto setting, it is doing the right thing.&lt;/P&gt;

&lt;P&gt;When your report shows both aligned and unaligned loads for the same array, it leads to suspicion it is not observing the alignment assertions, but pragma vector aligned will require alignment (except possibly if you have set AVX code generation; if you didn't set this, or QxHost, why not?).&amp;nbsp; The important thing is that the accesses inside your vectorized loop are aligned.&lt;/P&gt;

&lt;P&gt;If you look at the prefetch examples in &lt;A href="https://software.intel.com/en-us/node/511958" target="_blank"&gt;https://software.intel.com/en-us/node/511958&lt;/A&gt; you will see that you must specify an array element some distance (probably multiple cache lines) ahead of where your code is working.&amp;nbsp; It would do little good to prefetch in the currently active cache line.&amp;nbsp; At the other extreme, with a large prefetch distance, you could be accessing data beyond the end of your loop or data which can't remain long enough in cache for your loop to reach them.&amp;nbsp; As you are running on an out-of-order processor, it's guesswork as to the extent to which prefetches and data loads will get reordered.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Mar 2015 15:46:50 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061132#M5160</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-03-30T15:46:50Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;&gt;  What do you mean by</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061133#M5161</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;gt;&amp;gt;&amp;gt; &amp;nbsp;What do you mean by prefetch distance?&amp;gt;&amp;gt;&amp;gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;I think that Tim explained this pretty well.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Mar 2015 15:59:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061133#M5161</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-03-30T15:59:22Z</dc:date>
    </item>
    <item>
      <title>Hi Tim/iliyapolak,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061134#M5162</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Hi Tim/iliyapolak,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Thanks a lot for your clear illustration, it helps a lot.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;I also tried adding&amp;nbsp;/Qopt-streaming-stores in command line option, &amp;nbsp;unfortunately does not see it helps to improve. Btw, with intel c++ compiler enabled in VS environment,&amp;nbsp;sometimes it will report compiling issue when&amp;nbsp;adding&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;/Qopt-streaming-stores, sometimes it will not, is it expected?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Error&amp;nbsp;&amp;nbsp; &amp;nbsp;5&amp;nbsp;&amp;nbsp; &amp;nbsp;error #10037: could not find 'llvm_com' &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	Error&amp;nbsp;&amp;nbsp; &amp;nbsp;6&amp;nbsp;&amp;nbsp; &amp;nbsp;error #10014: problem during multi-file optimization compilation (code -1) &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	Error&amp;nbsp;&amp;nbsp; &amp;nbsp;7&amp;nbsp;&amp;nbsp; &amp;nbsp;error #10014: problem during multi-file optimization compilation (code -1) &amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;I tried some prefetch distance example &amp;nbsp;as below, but looks could not find the appropriate distance value to make it work. Still need to look at it&lt;/P&gt;

&lt;P&gt;#pragma prefetch codeWord0Ptr:1:&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;600 &amp;nbsp; &amp;nbsp;/ /use&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5; font-family: Consolas, 'Lucida Console', Menlo, Monaco, 'DejaVu Sans Mono', monospace, sans-serif;"&gt;_MM_HINT_T1, since it's floating data copy&lt;/SPAN&gt;&lt;BR /&gt;
	#pragma prefetch codeWord1Ptr:1:&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;600 &amp;nbsp; &amp;nbsp;/ /use&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5; font-family: Consolas, 'Lucida Console', Menlo, Monaco, 'DejaVu Sans Mono', monospace, sans-serif;"&gt;_MM_HINT_T1, since it's floating data copy&lt;/SPAN&gt;&lt;BR /&gt;
	#pragma prefetch &amp;nbsp;outputPtr:1:&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;600 &amp;nbsp;&amp;nbsp;/ /use&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5; font-family: Consolas, 'Lucida Console', Menlo, Monaco, 'DejaVu Sans Mono', monospace, sans-serif;"&gt;_MM_HINT_T1, since it's floating data copy&lt;/SPAN&gt;&lt;BR /&gt;
	#pragma prefetch &amp;nbsp;outputPtr1:1:&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;600 &amp;nbsp;/ /use&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-family: Consolas, 'Lucida Console', Menlo, Monaco, 'DejaVu Sans Mono', monospace, sans-serif; font-size: 1em; line-height: 1.5;"&gt;_MM_HINT_T1, since it's floating data copy&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.0080003738403px; line-height: 19.5120010375977px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;When checked the Vtune profiling as below, I see that&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;cycles of 3 Ports Utilized is 0.159 for&amp;nbsp;numDataCopy=1200, it's quite lower compared to numDataCopy=600, &amp;nbsp;looks ports resources issue here, can we presume it's caused by the latency of L1/store bound&amp;nbsp;issue?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; When numDataCopy=1200&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 1 Port Utilized:0.265&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 2 Ports Utilized:0.107&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 3 Ports Utilized:0.159&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;&amp;nbsp;&amp;nbsp; &amp;nbsp;When numDataCopy=600, VTune has below&amp;nbsp;summary report&amp;nbsp;&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 0 Ports Utilized:0&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 1 Port Utilized:0.417&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 2 Ports Utilized:0.073&lt;/SPAN&gt;&lt;BR style="font-size: 12px; line-height: 18px;" /&gt;
	&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; cycles of 3 Ports Utilized:0.943&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;Thank you&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;John&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2015 16:07:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061134#M5162</guid>
      <dc:creator>Wei_Z_Intel</dc:creator>
      <dc:date>2015-03-31T16:07:24Z</dc:date>
    </item>
    <item>
      <title>You would expect adding opt</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061135#M5163</link>
      <description>&lt;P&gt;You would expect adding opt-streaming-stores to the options to make a difference only for the case /Qopt-streaming-stores:always which ought to replicate your findings with #pragma vector aligned nontemporal.&amp;nbsp; I don't know what the compiler will do when you omit the argument to streaming-stores.&amp;nbsp; I've used streaming-stores:always along with profiling to find out where to add pragma vector nontemporal.&lt;/P&gt;

&lt;P&gt;In view of the apparent association of your performance issue with page crossing, DTLB events might be interesting for further confirmation.&amp;nbsp; A prefetch distance sufficient to deal with that might be excessive, but you could see whether it can affect the event counting.&lt;/P&gt;

&lt;P&gt;I don't know whether there is a way to look up whether the choice of prefetch hints should make a difference on your CPU model. Is that covered in the architecture manual? With a very large prefetch distance, your preference might be to fetch to the highest cache level.&lt;/P&gt;</description>
      <pubDate>Tue, 31 Mar 2015 16:37:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061135#M5163</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-03-31T16:37:29Z</dc:date>
    </item>
    <item>
      <title>@WEI</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061136#M5164</link>
      <description>&lt;P&gt;@WEI&lt;/P&gt;

&lt;P&gt;What is the Back-End Bound value when dataCopy size is 1200?&lt;/P&gt;</description>
      <pubDate>Fri, 03 Apr 2015 08:51:47 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061136#M5164</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-04-03T08:51:47Z</dc:date>
    </item>
    <item>
      <title>Hi Tim/illyapolak,</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061137#M5165</link>
      <description>&lt;P&gt;Hi Tim/illyapolak,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Thanks a lot for your help.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; the Back-End Bound value is 0.785 when dataCopy size is 1200, a little higher than dataCopy =600.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Recently, I tried openMP, with that enabled, I see it' linear relatitonship comapred dataCopy =600. Its cycle count is 359 for dataCopy =1200(cycle count= 187 for dataCopy =600).&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; But weird I see below report for openMP, looks it's still poor performance?&lt;/P&gt;

&lt;P&gt;CPI Rate:1.162&lt;BR /&gt;
	Back-End Bound:1.0&lt;BR /&gt;
	Memory Bandwidth:0.56&lt;BR /&gt;
	Memory Latency:0.322&lt;BR /&gt;
	Store Bound:0.275&lt;BR /&gt;
	Cycles of 0 pots Utilized:0.27&lt;BR /&gt;
	Cycles of 1 pots Utilized:0.168&lt;BR /&gt;
	Cycles of 2 pots Utilized:0.392&lt;BR /&gt;
	Cycles of 3 pots Utilized:0.224&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; Thank you&lt;/P&gt;

&lt;P&gt;John&lt;/P&gt;</description>
      <pubDate>Tue, 07 Apr 2015 04:01:12 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061137#M5165</guid>
      <dc:creator>Wei_Z_Intel</dc:creator>
      <dc:date>2015-04-07T04:01:12Z</dc:date>
    </item>
    <item>
      <title>With OpenMP enabled there</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061138#M5166</link>
      <description>&lt;P&gt;With OpenMP enabled there will be some number of CPU cycles spent on threads creation and synchronization.&lt;/P&gt;</description>
      <pubDate>Tue, 07 Apr 2015 09:49:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061138#M5166</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-04-07T09:49:18Z</dc:date>
    </item>
    <item>
      <title>Thank you for the</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061139#M5167</link>
      <description>&lt;P&gt;Thank you for the illustration, iliyapolak&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;John&lt;/P&gt;</description>
      <pubDate>Wed, 08 Apr 2015 06:56:26 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061139#M5167</guid>
      <dc:creator>Wei_Z_Intel</dc:creator>
      <dc:date>2015-04-08T06:56:26Z</dc:date>
    </item>
    <item>
      <title>Btw, you can profile OpenMP</title>
      <link>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061140#M5168</link>
      <description>&lt;P&gt;Btw, you can profile OpenMP overhead with the help of VTune. You will see an activity of the master thread, threads creation and threads execution time. Moreover consider to unroll by 2 your copying loop. Although Haswell core can sustain 2 loads and &amp;nbsp;1 store per clock , by using unrolling you will have the load uops probably decoded and placed in waiting queue.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 08 Apr 2015 08:26:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Tuning-Performance/an-issue-on-performance-optimization-by-Intel-compiler/m-p/1061140#M5168</guid>
      <dc:creator>Bernard</dc:creator>
      <dc:date>2015-04-08T08:26:41Z</dc:date>
    </item>
  </channel>
</rss>

