<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic &amp;gt;&amp;gt;...Our application is not in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107667#M70962</link>
    <description>&amp;gt;&amp;gt;...Our application is not memory bandwidth hungry and we've tested many different &lt;STRONG&gt;OMP_NUM_THREAD&lt;/STRONG&gt; configurations...

&lt;STRONG&gt;1&lt;/STRONG&gt;. Analyze how your &lt;STRONG&gt;OpenMP&lt;/STRONG&gt; threads pinned to cores / processors.
&lt;STRONG&gt;2&lt;/STRONG&gt;. Execute &lt;STRONG&gt;cpuinfo&lt;/STRONG&gt; utility. This is how &lt;STRONG&gt;Cache sharing&lt;/STRONG&gt; part of the report looks like for &lt;STRONG&gt;Intel(R) Xeon Phi(TM)  7210&lt;/STRONG&gt;
...
Processor name     : &lt;STRONG&gt;Intel(R) Xeon Phi(TM)  7210&lt;/STRONG&gt;
Packages (sockets) : 1
Cores              : 64
Processors (CPUs)  : 256
Cores per package  : 64
Threads per core   : 4
...
=====  &lt;STRONG&gt;Cache sharing&lt;/STRONG&gt;  =====
Cache	Size		Processors
L1	32  KB		(0,64,128,192)(1,65,129,193)(2,66,130,194)(3,67,131,195)(4,68,132,196)(5,69,133,197)(6,70,134,198)(7,71,135,199)(8,72,136,200)(9,73,137,201)(10,74,138,202)(11,75,139,203)(12,76,140,204)(13,77,141,205)(14,78,142,206)(15,79,143,207)(16,80,144,208)(17,81,145,209)(18,82,146,210)(19,83,147,211)(20,84,148,212)(21,85,149,213)(22,86,150,214)(23,87,151,215)(24,88,152,216)(25,89,153,217)(26,90,154,218)(27,91,155,219)(28,92,156,220)(29,93,157,221)(30,94,158,222)(31,95,159,223)(32,96,160,224)(33,97,161,225)(34,98,162,226)(35,99,163,227)(36,100,164,228)(37,101,165,229)(38,102,166,230)(39,103,167,231)(40,104,168,232)(41,105,169,233)(42,106,170,234)(43,107,171,235)(44,108,172,236)(45,109,173,237)(46,110,174,238)(47,111,175,239)(48,112,176,240)(49,113,177,241)(50,114,178,242)(51,115,179,243)(52,116,180,244)(53,117,181,245)(54,118,182,246)(55,119,183,247)(56,120,184,248)(57,121,185,249)(58,122,186,250)(59,123,187,251)(60,124,188,252)(61,125,189,253)(62,126,190,254)(63,127,191,255)

L2	1   MB		(0,1,64,65,128,129,192,193)(2,3,66,67,130,131,194,195)(4,5,68,69,132,133,196,197)(6,7,70,71,134,135,198,199)(8,9,72,73,136,137,200,201)(10,11,74,75,138,139,202,203)(12,13,76,77,140,141,204,205)(14,15,78,79,142,143,206,207)(16,17,80,81,144,145,208,209)(18,19,82,83,146,147,210,211)(20,21,84,85,148,149,212,213)(22,23,86,87,150,151,214,215)(24,25,88,89,152,153,216,217)(26,27,90,91,154,155,218,219)(28,29,92,93,156,157,220,221)(30,31,94,95,158,159,222,223)(32,33,96,97,160,161,224,225)(34,35,98,99,162,163,226,227)(36,37,100,101,164,165,228,229)(38,39,102,103,166,167,230,231)(40,41,104,105,168,169,232,233)(42,43,106,107,170,171,234,235)(44,45,108,109,172,173,236,237)(46,47,110,111,174,175,238,239)(48,49,112,113,176,177,240,241)(50,51,114,115,178,179,242,243)(52,53,116,117,180,181,244,245)(54,55,118,119,182,183,246,247)(56,57,120,121,184,185,248,249)(58,59,122,123,186,187,250,251)(60,61,124,125,188,189,252,253)(62,63,126,127,190,191,254,255)
...

&lt;STRONG&gt;3&lt;/STRONG&gt;. Best performance is achieved when &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; is set to &lt;STRONG&gt;scatter&lt;/STRONG&gt; or &lt;STRONG&gt;balanced&lt;/STRONG&gt; and &lt;STRONG&gt;OMP_NUM_THREAD&lt;/STRONG&gt; is set to &lt;STRONG&gt;64&lt;/STRONG&gt;.

I've marked processor numbers to demonstrate it:
...
=====  &lt;STRONG&gt;Cache sharing&lt;/STRONG&gt;  =====
Cache	Size		Processors
L1	32  KB		(&lt;STRONG&gt;**0**&lt;/STRONG&gt;,64,128,192)(&lt;STRONG&gt;**1**&lt;/STRONG&gt;,65,129,193)(&lt;STRONG&gt;**2**&lt;/STRONG&gt;,66,130,194)(&lt;STRONG&gt;**3**&lt;/STRONG&gt;,67,131,195)(&lt;STRONG&gt;**4**&lt;/STRONG&gt;,68,132,196)(&lt;STRONG&gt;**5**&lt;/STRONG&gt;,69,133,197)...
...
L2	1   MB		(&lt;STRONG&gt;**0,1**&lt;/STRONG&gt;,64,65,128,129,192,193)(&lt;STRONG&gt;**2,3**&lt;/STRONG&gt;,66,67,130,131,194,195)(&lt;STRONG&gt;**4,5**&lt;/STRONG&gt;,68,69,132,133,196,197)...
...</description>
    <pubDate>Fri, 03 Mar 2017 17:43:00 GMT</pubDate>
    <dc:creator>SergeyKostrov</dc:creator>
    <dc:date>2017-03-03T17:43:00Z</dc:date>
    <item>
      <title>KNC to KNL - 2x Slower Performance - Same Code</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107657#M70952</link>
      <description>&lt;P&gt;We have an application that's currently running great in native mode on the &lt;STRONG&gt;KNC&lt;/STRONG&gt; platform.&lt;/P&gt;

&lt;P&gt;We now have a &lt;STRONG&gt;KNL &lt;/STRONG&gt;system for R&amp;amp;D and have recompiled our native &lt;STRONG&gt;KNC &lt;/STRONG&gt;application for the &lt;STRONG&gt;KNL &lt;/STRONG&gt;platform. When testing this unmodified codebase, we're noticing a &lt;STRONG&gt;2x performance degradation&lt;/STRONG&gt; on &lt;STRONG&gt;KNL&lt;/STRONG&gt;. &lt;STRONG&gt;KNL &lt;/STRONG&gt;is setup in Quadrant cluster mode and Cache mode for memory.&lt;/P&gt;

&lt;P&gt;Our application is not memory bandwidth hungry and we've tested many different &lt;STRONG&gt;OMP_NUM_THREAD &lt;/STRONG&gt;configurations to no avail.&lt;BR /&gt;
	The main loop of the application is using OMP with a single critical section at the end. However, this runs very fast natively on the KNC platform.&lt;/P&gt;

&lt;P&gt;(Intel Compiler - &lt;STRONG&gt;icpc&lt;/STRONG&gt;)&lt;BR /&gt;
	&lt;STRONG style="font-size: 1em;"&gt;KNC &lt;/STRONG&gt;&lt;SPAN style="font-size: 1em;"&gt;Compiler flags =&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;-O3 -std=c++11 -openmp -mmic&lt;/SPAN&gt;&lt;BR /&gt;
	&lt;STRONG style="font-size: 1em;"&gt;KNL &lt;/STRONG&gt;&lt;SPAN style="font-size: 1em;"&gt;Compiler flags =&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;-O3 -std=c++11 -qopenmp -xMIC-AVX512 -fma -align -finline-functions&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 13.008px;"&gt;We've run the standard tests and we know we can do a better job vectorizing loops but we were expecting better performance out of the box with an application that is already running great on&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 13.008px;"&gt;KNC&lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;What could be causing this? Is it a pure vectorization issue?&lt;/P&gt;</description>
      <pubDate>Tue, 28 Feb 2017 21:53:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107657#M70952</guid>
      <dc:creator>Eugene_G_</dc:creator>
      <dc:date>2017-02-28T21:53:29Z</dc:date>
    </item>
    <item>
      <title>It may be a vectorization</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107658#M70953</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;It may be a vectorization issue, though it is difficult to give any definite answer without looking at the code (maybe it is publically available?). &lt;/SPAN&gt;&lt;SPAN style="font-size: 13.008px;"&gt;Did you try to check the compiler vectorization report or use Intel Advisor?&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;To rule out any platform configuration issues you can use micperf tool from Intel Xeon Phi Processor Software Package (https://software.intel.com/en-us/articles/xeon-phi-software).&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 16:42:55 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107658#M70953</guid>
      <dc:creator>Jan_Z_Intel</dc:creator>
      <dc:date>2017-03-01T16:42:55Z</dc:date>
    </item>
    <item>
      <title>Thanks for the reply. Yes, we</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107659#M70954</link>
      <description>&lt;P&gt;Thanks for the reply. Yes, we've been using the Intel Advisor tool and it's a hit or miss sometimes with suggestions.&lt;BR /&gt;
	We ran &lt;STRONG&gt;micperf &lt;/STRONG&gt;against one node and everything looks tip-top in terms of performance. It's pretty impressive.&lt;/P&gt;

&lt;P&gt;As I stated previously, our application (proprietary) runs very well under &lt;STRONG&gt;KNC&lt;/STRONG&gt;. Unfortunately, we cannot share the codebase.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 17:05:29 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107659#M70954</guid>
      <dc:creator>Eugene_G_</dc:creator>
      <dc:date>2017-03-01T17:05:29Z</dc:date>
    </item>
    <item>
      <title>Debugging performance</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107660#M70955</link>
      <description>&lt;P&gt;Debugging performance problems is a balance of opportunism and systematic analysis.&lt;/P&gt;

&lt;P&gt;As a quick "opportunistic" check:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;If the code fits into the GDDR5 memory on KNC, then it should fit into MCDRAM in "Flat" mode on KNL.&amp;nbsp;&amp;nbsp; Testing your code on a system booted in Flat-Quadrant mode would eliminate uncertainties relating to using the MCDRAM as cache.&amp;nbsp;&amp;nbsp;&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;If the Flat mode test is not useful, then you need to start gathering data for systematic analyses.&amp;nbsp; Useful data typically includes:&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;Parallel scaling for each code using 1 thread per core, 2 threads per core, 3 threads per core, 4 threads per core.&lt;/LI&gt;
	&lt;LI&gt;Whole-program performance counter measurements where available.
		&lt;UL&gt;
			&lt;LI&gt;VTune is the easiest way to get these analyses, but "perf stat" may be available.&lt;/LI&gt;
			&lt;LI&gt;Repeat these for each core and thread count on each platform &amp;amp; compare the scaling.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Sampling-based runtime profile comparisons.
		&lt;UL&gt;
			&lt;LI&gt;Historically this has been done with "gprof", but VTune provides a much more integrated approach.&lt;/LI&gt;
			&lt;LI&gt;For OpenMP codes, it is important to monitor OpenMP overheads (that typically indicate load imbalance).&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;Sampling-based performance-counter profile comparisons.
		&lt;UL&gt;
			&lt;LI&gt;VTune is the preferred approach here.&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 17:29:05 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107660#M70955</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-03-01T17:29:05Z</dc:date>
    </item>
    <item>
      <title>I am still concerned that the</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107661#M70956</link>
      <description>&lt;P&gt;I am still concerned that the&amp;nbsp;"&lt;SPAN style="font-size: 12px;"&gt;we've tested many different&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 12px;"&gt;OMP_NUM_THREAD&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;configurations" may not have achieved what you need to.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;In particular you very likely need to be using KMP_HW_SUBSET instead of, or as well as, OMP_NUM_THREADS. (Documentation of KMP_HW_SUBSET [complete with calling it KMP_HW_SUBSETS throughout :-(] is at&amp;nbsp;&lt;/SPAN&gt;&lt;A href="https://software.intel.com/en-us/node/694293)&amp;nbsp;" target="_blank"&gt;https://software.intel.com/en-us/node/694293)&amp;nbsp;&lt;/A&gt;;&lt;/P&gt;

&lt;P&gt;Also, it may be worth checking out &lt;A href="https://software.intel.com/en-us/blogs/2016/12/02/how-to-plot-openmp-scaling-results"&gt;my rant on plotting scaling results&lt;/A&gt; :-)&lt;/P&gt;

&lt;P&gt;Bear in mind that on KNC you needed two thread/core to achieve maximum issue rate, whereas on KNL that is no longer true, so running one or two threads/core rather than four is more likely to perform well on KNL, and also remember that the replicated entity is a tile of two cores sharing L2 cache, so locality can affect up to eight threads.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 17:31:02 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107661#M70956</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2017-03-01T17:31:02Z</dc:date>
    </item>
    <item>
      <title>For OpenMP codes, it is</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107662#M70957</link>
      <description>&lt;BLOCKQUOTE&gt;
	&lt;P&gt;For OpenMP codes, it is important to monitor OpenMP overheads (that typically indicate load imbalance).&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;

&lt;P&gt;VTune's (relatively new) OpenMP analyses now show these as load imbalance attribute them to parallel regions, and show you what performance you could achieve if you could fix the problem. &lt;A href="https://software.intel.com/en-us/node/544172" target="_blank"&gt;https://software.intel.com/en-us/node/544172&lt;/A&gt; should help.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 17:34:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107662#M70957</guid>
      <dc:creator>James_C_Intel2</dc:creator>
      <dc:date>2017-03-01T17:34:56Z</dc:date>
    </item>
    <item>
      <title>Thank you for the suggestions</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107663#M70958</link>
      <description>&lt;P&gt;Thank you for the suggestions. I will go investigate and report back with my findings.&lt;/P&gt;</description>
      <pubDate>Wed, 01 Mar 2017 17:51:43 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107663#M70958</guid>
      <dc:creator>Eugene_G_</dc:creator>
      <dc:date>2017-03-01T17:51:43Z</dc:date>
    </item>
    <item>
      <title>Please also share the details</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107664#M70959</link>
      <description>&lt;P&gt;Please also share the details of your system software (OS with exact kernel version and if Intel Xeon Phi Software Package is installed).&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2017 10:09:32 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107664#M70959</guid>
      <dc:creator>Jan_Z_Intel</dc:creator>
      <dc:date>2017-03-02T10:09:32Z</dc:date>
    </item>
    <item>
      <title>OS - CentOS Linux release 7.3</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107665#M70960</link>
      <description>&lt;P&gt;OS -&amp;nbsp;CentOS Linux release 7.3.1611 (Core)&lt;/P&gt;

&lt;P&gt;Kernel - 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Intel Xeon Phi Software Package is installed (version -&amp;nbsp;&lt;/SPAN&gt;1.5.0)&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2017 16:33:18 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107665#M70960</guid>
      <dc:creator>Eugene_G_</dc:creator>
      <dc:date>2017-03-02T16:33:18Z</dc:date>
    </item>
    <item>
      <title>It looks perfect. I'm afraid</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107666#M70961</link>
      <description>&lt;P&gt;It looks perfect. I'm afraid that was the last 'fast&amp;amp;easy' check before diving into more systematic approach suggested by John and Jim.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 02 Mar 2017 20:04:25 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107666#M70961</guid>
      <dc:creator>Jan_Z_Intel</dc:creator>
      <dc:date>2017-03-02T20:04:25Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...Our application is not</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107667#M70962</link>
      <description>&amp;gt;&amp;gt;...Our application is not memory bandwidth hungry and we've tested many different &lt;STRONG&gt;OMP_NUM_THREAD&lt;/STRONG&gt; configurations...

&lt;STRONG&gt;1&lt;/STRONG&gt;. Analyze how your &lt;STRONG&gt;OpenMP&lt;/STRONG&gt; threads pinned to cores / processors.
&lt;STRONG&gt;2&lt;/STRONG&gt;. Execute &lt;STRONG&gt;cpuinfo&lt;/STRONG&gt; utility. This is how &lt;STRONG&gt;Cache sharing&lt;/STRONG&gt; part of the report looks like for &lt;STRONG&gt;Intel(R) Xeon Phi(TM)  7210&lt;/STRONG&gt;
...
Processor name     : &lt;STRONG&gt;Intel(R) Xeon Phi(TM)  7210&lt;/STRONG&gt;
Packages (sockets) : 1
Cores              : 64
Processors (CPUs)  : 256
Cores per package  : 64
Threads per core   : 4
...
=====  &lt;STRONG&gt;Cache sharing&lt;/STRONG&gt;  =====
Cache	Size		Processors
L1	32  KB		(0,64,128,192)(1,65,129,193)(2,66,130,194)(3,67,131,195)(4,68,132,196)(5,69,133,197)(6,70,134,198)(7,71,135,199)(8,72,136,200)(9,73,137,201)(10,74,138,202)(11,75,139,203)(12,76,140,204)(13,77,141,205)(14,78,142,206)(15,79,143,207)(16,80,144,208)(17,81,145,209)(18,82,146,210)(19,83,147,211)(20,84,148,212)(21,85,149,213)(22,86,150,214)(23,87,151,215)(24,88,152,216)(25,89,153,217)(26,90,154,218)(27,91,155,219)(28,92,156,220)(29,93,157,221)(30,94,158,222)(31,95,159,223)(32,96,160,224)(33,97,161,225)(34,98,162,226)(35,99,163,227)(36,100,164,228)(37,101,165,229)(38,102,166,230)(39,103,167,231)(40,104,168,232)(41,105,169,233)(42,106,170,234)(43,107,171,235)(44,108,172,236)(45,109,173,237)(46,110,174,238)(47,111,175,239)(48,112,176,240)(49,113,177,241)(50,114,178,242)(51,115,179,243)(52,116,180,244)(53,117,181,245)(54,118,182,246)(55,119,183,247)(56,120,184,248)(57,121,185,249)(58,122,186,250)(59,123,187,251)(60,124,188,252)(61,125,189,253)(62,126,190,254)(63,127,191,255)

L2	1   MB		(0,1,64,65,128,129,192,193)(2,3,66,67,130,131,194,195)(4,5,68,69,132,133,196,197)(6,7,70,71,134,135,198,199)(8,9,72,73,136,137,200,201)(10,11,74,75,138,139,202,203)(12,13,76,77,140,141,204,205)(14,15,78,79,142,143,206,207)(16,17,80,81,144,145,208,209)(18,19,82,83,146,147,210,211)(20,21,84,85,148,149,212,213)(22,23,86,87,150,151,214,215)(24,25,88,89,152,153,216,217)(26,27,90,91,154,155,218,219)(28,29,92,93,156,157,220,221)(30,31,94,95,158,159,222,223)(32,33,96,97,160,161,224,225)(34,35,98,99,162,163,226,227)(36,37,100,101,164,165,228,229)(38,39,102,103,166,167,230,231)(40,41,104,105,168,169,232,233)(42,43,106,107,170,171,234,235)(44,45,108,109,172,173,236,237)(46,47,110,111,174,175,238,239)(48,49,112,113,176,177,240,241)(50,51,114,115,178,179,242,243)(52,53,116,117,180,181,244,245)(54,55,118,119,182,183,246,247)(56,57,120,121,184,185,248,249)(58,59,122,123,186,187,250,251)(60,61,124,125,188,189,252,253)(62,63,126,127,190,191,254,255)
...

&lt;STRONG&gt;3&lt;/STRONG&gt;. Best performance is achieved when &lt;STRONG&gt;KMP_AFFINITY&lt;/STRONG&gt; is set to &lt;STRONG&gt;scatter&lt;/STRONG&gt; or &lt;STRONG&gt;balanced&lt;/STRONG&gt; and &lt;STRONG&gt;OMP_NUM_THREAD&lt;/STRONG&gt; is set to &lt;STRONG&gt;64&lt;/STRONG&gt;.

I've marked processor numbers to demonstrate it:
...
=====  &lt;STRONG&gt;Cache sharing&lt;/STRONG&gt;  =====
Cache	Size		Processors
L1	32  KB		(&lt;STRONG&gt;**0**&lt;/STRONG&gt;,64,128,192)(&lt;STRONG&gt;**1**&lt;/STRONG&gt;,65,129,193)(&lt;STRONG&gt;**2**&lt;/STRONG&gt;,66,130,194)(&lt;STRONG&gt;**3**&lt;/STRONG&gt;,67,131,195)(&lt;STRONG&gt;**4**&lt;/STRONG&gt;,68,132,196)(&lt;STRONG&gt;**5**&lt;/STRONG&gt;,69,133,197)...
...
L2	1   MB		(&lt;STRONG&gt;**0,1**&lt;/STRONG&gt;,64,65,128,129,192,193)(&lt;STRONG&gt;**2,3**&lt;/STRONG&gt;,66,67,130,131,194,195)(&lt;STRONG&gt;**4,5**&lt;/STRONG&gt;,68,69,132,133,196,197)...
...</description>
      <pubDate>Fri, 03 Mar 2017 17:43:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107667#M70962</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-03T17:43:00Z</dc:date>
    </item>
    <item>
      <title>After a lot of digging I have</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107668#M70963</link>
      <description>&lt;P&gt;After a lot of digging I have narrowed in on what seems to be the problem. My application has one line that multiplies 24 floating-point numbers. From what I have seen it seems like the magnitude of the numbers can dramatically hurt the runtime. One of my numbers has a much lower magnitude than the rest. When I exclude this number from my calculation I get about a 30x speed up. I am playing with the different compiler options but non of them seem to help.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Any insight would be very helpful. Thanks. &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 07 Mar 2017 00:14:41 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107668#M70963</guid>
      <dc:creator>Eugene_G_</dc:creator>
      <dc:date>2017-03-07T00:14:41Z</dc:date>
    </item>
    <item>
      <title>Is this number stored as a</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107669#M70964</link>
      <description>&lt;P&gt;Is this number stored as a subnormal (denormal)?&lt;/P&gt;

&lt;P&gt;See:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/forums/intel-c-compiler/topic/611390"&gt;https://software.intel.com/en-us/forums/intel-c-compiler/topic/611390&lt;/A&gt;&lt;BR /&gt;
	&lt;A href="https://software.intel.com/pt-br/node/680305"&gt;https://software.intel.com/pt-br/node/680305&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Edit:&lt;/P&gt;

&lt;P&gt;&lt;A href="https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/705927"&gt;https://software.intel.com/en-us/forums/intel-fortran-compiler-for-linux-and-mac-os-x/topic/705927&lt;/A&gt;&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 07 Mar 2017 18:45:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107669#M70964</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-03-07T18:45:00Z</dc:date>
    </item>
    <item>
      <title>It looks like some FPU</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107670#M70965</link>
      <description>It looks like some FPU exceptions are affecting your processing.

&amp;gt;&amp;gt;...From what I have seen it seems like the magnitude of the numbers can dramatically hurt the runtime. One of my numbers has
&amp;gt;&amp;gt;a much lower magnitude than the rest. When I exclude this number from my calculation I get about a 30x speed up...

Q1: Could you post a couple of FP numbers ( a good one and a bad one ) to demonstrate their ranges?

Q2: Did you try to turn on 'Flush Denormal Results to Zero' ( -ftz ) compiler option?</description>
      <pubDate>Wed, 08 Mar 2017 17:10:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107670#M70965</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-08T17:10:58Z</dc:date>
    </item>
    <item>
      <title>Jim and Sergey, </title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107671#M70966</link>
      <description>&lt;P&gt;Jim and Sergey,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Turns out I was chasing my tail with that last post. I didn't realize at the time that the speed up from removing the value came from downstream. I am going back to the drawing board and will report back with what I find.&lt;/P&gt;

&lt;P&gt;Thanks, I am learning a lot from everybody's suggestions.&lt;/P&gt;</description>
      <pubDate>Wed, 08 Mar 2017 17:27:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107671#M70966</guid>
      <dc:creator>Eugene_G_</dc:creator>
      <dc:date>2017-03-08T17:27:07Z</dc:date>
    </item>
    <item>
      <title>In my experience the "-O3"</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107672#M70967</link>
      <description>&lt;P&gt;In my experience the "-O3" flag hurts performance for KNL systems. Use "-Os" instead and see how it performs. The bottleneck in the micro-architecture for KNL is in the instruction decoding( or so Agner Fogs' manual claims and it seems to be true), so you want to minimize the size of the binary in terms of instructions instead of keeping things in registers at the cost of more instructions(greater binary size).&amp;nbsp;&lt;/P&gt;

&lt;P&gt;As for the affinity business, try logging into a node and run htop. See how well things are distributed, and the pattern of threads ramping up an closing down are. For fun and giggles, try running your program as "perf stat -d &amp;lt;Your program&amp;gt;" and post the stats on that. That usually helps diagnosis.&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em;"&gt;Just general suggestions. This might or might not help your situation.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Update: if you try "-Os", and you don't care about the IEEE standard for floating point arithmetic and a few other technical things make sure to add the "-ffast-math" flag for better performance at the cost of ignoring the IEEE standard, and possibly a lot of nightmares. I can't remember if "-O3" turns on this flag automatically(this is true either for GCC or ICC, can't remember which).&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Cheers.&lt;/P&gt;</description>
      <pubDate>Sun, 12 Mar 2017 07:24:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107672#M70967</guid>
      <dc:creator>Chronus_Taizen</dc:creator>
      <dc:date>2017-03-12T07:24:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...or so Agner Fogs' manual</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107673#M70968</link>
      <description>&amp;gt;&amp;gt;...or so &lt;STRONG&gt;Agner Fogs&lt;/STRONG&gt;' manual claims and it seems to be true..

It is not a good recommendation to believe in somebody's claims without verifying in a set of &lt;STRONG&gt;real&lt;/STRONG&gt; tests outcomes of using &lt;STRONG&gt;-O3&lt;/STRONG&gt; and &lt;STRONG&gt;-Os&lt;/STRONG&gt; options.

Also, I didn't have any performance issues with &lt;STRONG&gt;-O3&lt;/STRONG&gt; option on a &lt;STRONG&gt;KNL&lt;/STRONG&gt; system but I will verify if &lt;STRONG&gt;-Os&lt;/STRONG&gt; option improves performance.</description>
      <pubDate>Wed, 15 Mar 2017 17:22:56 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107673#M70968</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-15T17:22:56Z</dc:date>
    </item>
    <item>
      <title>Here are results of a very</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107674#M70969</link>
      <description>Here are results of a very simple verification for matrix multiplication using &lt;STRONG&gt;MKL cblas_sgemm&lt;/STRONG&gt; and &lt;STRONG&gt;Classic Matrix Multiplication algorithm&lt;/STRONG&gt; ( CMMA / transposed based ):

&lt;STRONG&gt;[ Test  with -O3 option ]&lt;/STRONG&gt;

[guest@... WorkTest]$ icpc &lt;STRONG&gt;-O3&lt;/STRONG&gt; -xMIC-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 test13.c -o test13.out
[guest@... WorkTest]$ 
[guest@... WorkTest]$ ./test13.out
Matrix A[ 16384 x 16384 ]
Matrix B[ 16384 x 16384 ]
Matrix C[ 16384 x 16384 ]
Number of OpenMP threads:  64
	&lt;STRONG&gt;MKL&lt;/STRONG&gt;  - Completed in: &lt;STRONG&gt;6.6331376&lt;/STRONG&gt; seconds
	&lt;STRONG&gt;CMMA&lt;/STRONG&gt; - Completed in: &lt;STRONG&gt;99.3659613&lt;/STRONG&gt; seconds
[guest@... WorkTest]$ 
[guest@... WorkTest]$ ls -l
total 232
-rw-r--r-- 1 guest guest  10812 Mar 15 11:21 test13.c
-rwxrwxr-x 1 guest guest &lt;STRONG&gt;210979&lt;/STRONG&gt; Mar 15 11:21 test13.out

&lt;STRONG&gt;[ Test  with -Os option ]&lt;/STRONG&gt;

[guest@... WorkTest]$ icpc &lt;STRONG&gt;-Os&lt;/STRONG&gt; -xMIC-AVX512 -qopenmp -mkl -fp-model fast=2 -fma -unroll=4 test13.c -o test13.out
[guest@... WorkTest]$ 
[guest@... WorkTest]$ ./test13.out
Matrix A[ 16384 x 16384 ]
Matrix B[ 16384 x 16384 ]
Matrix C[ 16384 x 16384 ]
Number of OpenMP threads:  64
	&lt;STRONG&gt;MKL&lt;/STRONG&gt;  - Completed in: &lt;STRONG&gt;6.6278768&lt;/STRONG&gt; seconds
	&lt;STRONG&gt;CMMA&lt;/STRONG&gt; - Completed in: &lt;STRONG&gt;90.3714654&lt;/STRONG&gt; seconds
[guest@... WorkTest]$ 
[guest@... WorkTest]$ ls -l
total 224
-rw-r--r-- 1 guest guest  10812 Mar 15 11:21 test13.c
-rwxrwxr-x 1 guest guest &lt;STRONG&gt;202685&lt;/STRONG&gt; Mar 15 11:27 test13.out

&lt;STRONG&gt;[ Conclusion ]&lt;/STRONG&gt;

With option -Os processing completed by ~9% faster then with option -O3. So, it is faster but Not by 2x!... Please do your own verifications if interested.</description>
      <pubDate>Wed, 15 Mar 2017 18:54:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107674#M70969</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-15T18:54:48Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;...so you want to minimize</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107675#M70970</link>
      <description>&amp;gt;&amp;gt;...so you want to minimize the size of the binary...

&lt;STRONG&gt;[ Test with -O3 option - Binary Size ]&lt;/STRONG&gt;
...
-rwxrwxr-x 1 guest guest &lt;STRONG&gt;210979&lt;/STRONG&gt; Mar 15 11:21 test13.out
...

&lt;STRONG&gt;[ Test with -Os option - Binary Size ]&lt;/STRONG&gt;
...
-rwxrwxr-x 1 guest guest &lt;STRONG&gt;202685&lt;/STRONG&gt; Mar 15 11:27 test13.out
...

With option &lt;STRONG&gt;-Os&lt;/STRONG&gt; the binary size is only ~&lt;STRONG&gt;3.9&lt;/STRONG&gt;% smaller and, as I've already mentioned, processing was completed ~&lt;STRONG&gt;9&lt;/STRONG&gt;% faster.</description>
      <pubDate>Wed, 15 Mar 2017 19:05:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107675#M70970</guid>
      <dc:creator>SergeyKostrov</dc:creator>
      <dc:date>2017-03-15T19:05:00Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt;It is not a good</title>
      <link>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107676#M70971</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;gt;&amp;gt;It is not a good recommendation to believe in somebody's claims without verifying in a set of&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 12px;"&gt;real&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;tests outcomes of using&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 12px;"&gt;-O3&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;and&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-weight: 700; font-size: 12px;"&gt;-Os&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;options.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I agree. At no point did I claim Agner made suggestions about optimization flags, he did NOT. He did, however, comment that the decoding part of the KNL Micro-architecture is the bottleneck. Reading that gave me a reason, perhaps the reason, why compiling with -Os instead of -O3 had been making my performance slightly, but noticeably, better.&lt;/P&gt;

&lt;P&gt;&amp;gt;&amp;gt;&lt;SPAN style="font-size: 12px;"&gt;With option -Os processing completed by ~9% faster then with option -O3. So, it is faster but Not by 2x!... Please do your own verifications if interested.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Again, no such claim was made. I did not claim that -Os would improve performance by 2x, or that it would improve performance at all. Just that testing it is worth the time of the potential slight, but noticeable gain.&lt;/P&gt;

&lt;P&gt;Having said all that, thank you for giving us real tangible numbers for both the speed and the binary size. I appreciate your work; which, in this instance, happens to validate my views.&lt;/P&gt;</description>
      <pubDate>Thu, 16 Mar 2017 04:25:17 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/KNC-to-KNL-2x-Slower-Performance-Same-Code/m-p/1107676#M70971</guid>
      <dc:creator>Chronus_Taizen</dc:creator>
      <dc:date>2017-03-16T04:25:17Z</dc:date>
    </item>
  </channel>
</rss>

