<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Thanks for the reply. For our in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142853#M78339</link>
    <description>&lt;P&gt;Thanks for the reply. For our specific situation,&amp;nbsp;&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;an application running an OpenMP application on a KNL node with 68 cores in cache/SNC-4 mode,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;what is the appropriate numactl invocation of the app to get optimal performance?&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV&gt;
	&lt;DIV&gt;
		&lt;DIV&gt;&lt;SPAN style="font-size: 1em;"&gt;i.e. is this best?:&lt;/SPAN&gt;&lt;/DIV&gt;
	&lt;/DIV&gt;

	&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
	numactl -l --cpunodebind=0,1,2,3 myapp&lt;BR /&gt;
	&amp;nbsp;&lt;/DIV&gt;

&lt;P&gt;or is this better?:&lt;BR /&gt;
	&lt;BR /&gt;
	numactl -l --interleave=0,1,2,3 myapp&lt;/P&gt;

&lt;P&gt;do you guys have any suggestions? thanks&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;yah&lt;/P&gt;</description>
    <pubDate>Fri, 23 Feb 2018 04:46:07 GMT</pubDate>
    <dc:creator>h__y</dc:creator>
    <dc:date>2018-02-23T04:46:07Z</dc:date>
    <item>
      <title>numactl in cache mode</title>
      <link>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142850#M78336</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;Quick numactl question.&amp;nbsp; Do I need to use numactl to run in snc-4/snc-2?&amp;nbsp; I know from examples I've seen is that if you are in flat memory mode, you can pin the threads to fast memory with numactl, like this:&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;numactl -m 4,5,6,7&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;But if I am in cache mode, there is no fast memory option.&amp;nbsp; So snc-4/snc-2 in cache mode is equivalent to quadrant and hemishphere, no?&lt;/P&gt;

&lt;P&gt;thanks for any clarification,&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;yah&lt;/P&gt;</description>
      <pubDate>Thu, 22 Feb 2018 14:04:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142850#M78336</guid>
      <dc:creator>h__y</dc:creator>
      <dc:date>2018-02-22T14:04:54Z</dc:date>
    </item>
    <item>
      <title>Hi,</title>
      <link>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142851#M78337</link>
      <description>&lt;P&gt;Hi,&lt;BR /&gt;
	you are right that in cache mode there is no "fast memory" visible by operating system.&lt;BR /&gt;
	However snc modes expose arrangement while&amp;nbsp;&lt;SPAN style="font-size: 12px;"&gt;quadrant and hemishphere don't.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Cache mode + SNC should be used the same as multi-socket Xeons are with only sockets or groups of cpus exposed by numa.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 12px;"&gt;Hope it helps.&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Feb 2018 14:42:39 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142851#M78337</guid>
      <dc:creator>Sebastian_S_Intel</dc:creator>
      <dc:date>2018-02-22T14:42:39Z</dc:date>
    </item>
    <item>
      <title>On a KNL systems in Cached</title>
      <link>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142852#M78338</link>
      <description>&lt;P&gt;On a KNL systems in Cached-SNC-2 or Cached-SNC-4 mode, you would use "numactl" in exactly the same way (and for exactly the same reasons) as on a 2-socket or 4-socket Xeon server -- the MCDRAM cache is invisible in these cases.&lt;/P&gt;

&lt;P&gt;Just as on any other multi-socket system, the default local memory allocation policy means that binding each of your processors to a specific NUMA node is usually all you need to do to maintain process-data affinity.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 22 Feb 2018 22:55:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142852#M78338</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-02-22T22:55:22Z</dc:date>
    </item>
    <item>
      <title>Thanks for the reply. For our</title>
      <link>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142853#M78339</link>
      <description>&lt;P&gt;Thanks for the reply. For our specific situation,&amp;nbsp;&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;an application running an OpenMP application on a KNL node with 68 cores in cache/SNC-4 mode,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em;"&gt;what is the appropriate numactl invocation of the app to get optimal performance?&lt;/SPAN&gt;&lt;/P&gt;

&lt;DIV&gt;
	&lt;DIV&gt;
		&lt;DIV&gt;&lt;SPAN style="font-size: 1em;"&gt;i.e. is this best?:&lt;/SPAN&gt;&lt;/DIV&gt;
	&lt;/DIV&gt;

	&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;
	numactl -l --cpunodebind=0,1,2,3 myapp&lt;BR /&gt;
	&amp;nbsp;&lt;/DIV&gt;

&lt;P&gt;or is this better?:&lt;BR /&gt;
	&lt;BR /&gt;
	numactl -l --interleave=0,1,2,3 myapp&lt;/P&gt;

&lt;P&gt;do you guys have any suggestions? thanks&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;yah&lt;/P&gt;</description>
      <pubDate>Fri, 23 Feb 2018 04:46:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142853#M78339</guid>
      <dc:creator>h__y</dc:creator>
      <dc:date>2018-02-23T04:46:07Z</dc:date>
    </item>
    <item>
      <title>For OpenMP jobs, you need</title>
      <link>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142854#M78340</link>
      <description>&lt;P&gt;For OpenMP jobs, you need separate binding to each set of threads.&amp;nbsp; This can't be done with numactl.&lt;/P&gt;

&lt;P&gt;All OpenMP implementations support thread binding using environment variables.&lt;/P&gt;

&lt;P&gt;For the 68-core KNL in SNC4 mode, the configuration is&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;NUMA node 0 has 18 cores (9 tiles)&lt;/LI&gt;
	&lt;LI&gt;NUMA node 1 has 18 cores (9 tiles)&lt;/LI&gt;
	&lt;LI&gt;NUMA node 2 has 16 cores (8 tiles)&lt;/LI&gt;
	&lt;LI&gt;NUMA node 3 has 16 cores (8 tiles)&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;This uneven layout can be challenging to deal with in OpenMP, depending on what it is that you are trying to do.&amp;nbsp; Any core-level OpenMP binding will guarantee process/memory affinity in this case.&amp;nbsp;&amp;nbsp; If you are using all 68 cores, the memory traffic will be higher on nodes 0 and 1 than it is on 2 and 3, which may be an issue.&amp;nbsp;&amp;nbsp; If you want to use 64 cores, then I would use "lscpu" to determine which logical processors are mapped to each node, then I would build an explicit processor list for KMP_AFFINITY to put 16 threads on each of the four NUMA nodes.&lt;/P&gt;</description>
      <pubDate>Mon, 26 Feb 2018 17:14:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/numactl-in-cache-mode/m-p/1142854#M78340</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2018-02-26T17:14:37Z</dc:date>
    </item>
  </channel>
</rss>

