<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic The KMP_... are OpenMP in Software Archive</title>
    <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070899#M58172</link>
    <description>&lt;P&gt;The KMP_... are OpenMP affinity settings. I_MPI_PIN_DOMAIN can be used for process(rank) pinning. See &lt;A href="https://software.intel.com/en-us/articles/mpi-and-process-pinning-on-xeon-phi"&gt;this &lt;/A&gt;for information relating to the MPI process affinity pinning. The general technique is to specify how/where each process (MPI rank) is to be placed (IOW how each process(rank) affinity is mapped. Then each process(rank) OpenMP thread pool is restricted to that subset of logical processors.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
    <pubDate>Wed, 18 Jan 2017 13:51:40 GMT</pubDate>
    <dc:creator>jimdempseyatthecove</dc:creator>
    <dc:date>2017-01-18T13:51:40Z</dc:date>
    <item>
      <title>Knights Landing(KNL) Thread Affinity and Managing Hyper Threading</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070896#M58169</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;

&lt;P&gt;I need to run my program for different configurations of KNL, in the process I enabled Hyper threading and wanted to test for different no.of threads per core by setting environment variables, &lt;EM&gt;KMP_HW_SUBSET&lt;/EM&gt; and &lt;EM&gt;KMP_AFFINITY&lt;/EM&gt; (Ref:&amp;nbsp;&lt;A href="https://software.intel.com/en-us/node/680054"&gt;https://software.intel.com/en-us/node/680054&lt;/A&gt; ). But when I run the program, it is showing warning saying&lt;/P&gt;

&lt;P&gt;"&lt;STRONG&gt;OMP: Warning #245: KMP_HW_SUBSET ignored: non-uniform topology.&lt;/STRONG&gt;"&lt;/P&gt;

&lt;P&gt;what's the reason? According to reference, it is also clearly mentioned in NOTE that&lt;/P&gt;

&lt;P&gt;"&lt;STRONG&gt;On Intel® Xeon® Phi™ coprocessors, the default affinity type is &lt;SAMP class="codeph"&gt;scatter&lt;/SAMP&gt;, so &lt;SPAN class="keyword"&gt;KMP_HW_SUBSET&lt;/SPAN&gt; works by default on this platform.&lt;/STRONG&gt;"&lt;/P&gt;

&lt;P&gt;but the warning shown up was different.&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;How to resolve this? and How else can I manage 1/2/3/4 Threads per Core on KNL? Does the output for "&lt;STRONG&gt;lscpu&lt;/STRONG&gt;" command, &lt;STRONG&gt;KMP_AFFINITY=verbose&lt;/STRONG&gt; will be different for each Threads/Core Combination?&lt;/EM&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 16 Jan 2017 09:16:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070896#M58169</guid>
      <dc:creator>Rakesh_M_</dc:creator>
      <dc:date>2017-01-16T09:16:14Z</dc:date>
    </item>
    <item>
      <title>The easiest way, assuming 64</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070897#M58170</link>
      <description>&lt;P&gt;The easiest way, assuming 64 core system, is to use KMP_AFFINITY=scatter (or OMP_... equivalent), then select 64, 128. 192 or 256 threads.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Tue, 17 Jan 2017 19:02:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070897#M58170</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-01-17T19:02:37Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070898#M58171</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The easiest way, assuming 64 core system, is to use KMP_AFFINITY=scatter (or OMP_... equivalent), then select 64, 128. 192 or 256 threads.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Sir,&lt;/P&gt;

&lt;P&gt;Thanks for the reply,&lt;/P&gt;

&lt;P&gt;The Xeon Phi we are using is of &lt;STRONG&gt;68 core machine&lt;/STRONG&gt;, which will effectively allow &lt;STRONG&gt;272 threads&lt;/STRONG&gt; for use in Hyper threading enabled mode.&lt;/P&gt;

&lt;P&gt;And the memory mode is &lt;STRONG&gt;Cache Mode&lt;/STRONG&gt; in &lt;STRONG&gt;SNC4 &amp;amp; Hyper Threading Enabled.&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;and to test the max threads and no.of procs, created a parallel region and tried getting the thread count with default setting using&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;omp_get_num_procs()&lt;/STRONG&gt;&lt;/EM&gt; and &lt;EM&gt;&lt;STRONG&gt;omp_get_max_threads()&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;here is what the command used&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;$&lt;/STRONG&gt;&lt;STRONG&gt;KMP_AFFINITY=scatter KMP_HW_SUBSET=68c,2t mpirun -n 8 -env I_MPI_DEBUG=5 ./ex4&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;which should effectively &lt;EM&gt;&lt;STRONG&gt;map 8 MPI ranks to 64Cores x 2Threads/Core = 128 Threads&lt;/STRONG&gt;&lt;/EM&gt;, i.e each rank would get &lt;STRONG&gt;16 threads&lt;/STRONG&gt; but the above mentioned omp runtime calls returning &lt;EM&gt;&lt;STRONG&gt;34 threads per MPI Rank&lt;/STRONG&gt;&lt;/EM&gt;, which is an MPI Rank gets considering 272 threads on the whole i.e &lt;STRONG&gt;KMP_HW_SUBSET=68c,4t&lt;/STRONG&gt;. And it is also same with &lt;STRONG&gt;KMP_HW_SUBSET=68c,1t; KMP_HW_SUBSET=68c,3t.&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;in addition taking verbose "ignored...." warning into consideration, the KMP environment variables used above are not showing the intended effect.&lt;/P&gt;

&lt;P&gt;why it is assuming 272 threads? Is there any other way to do this?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2017 09:54:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070898#M58171</guid>
      <dc:creator>Rakesh_M_</dc:creator>
      <dc:date>2017-01-18T09:54:45Z</dc:date>
    </item>
    <item>
      <title>The KMP_... are OpenMP</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070899#M58172</link>
      <description>&lt;P&gt;The KMP_... are OpenMP affinity settings. I_MPI_PIN_DOMAIN can be used for process(rank) pinning. See &lt;A href="https://software.intel.com/en-us/articles/mpi-and-process-pinning-on-xeon-phi"&gt;this &lt;/A&gt;for information relating to the MPI process affinity pinning. The general technique is to specify how/where each process (MPI rank) is to be placed (IOW how each process(rank) affinity is mapped. Then each process(rank) OpenMP thread pool is restricted to that subset of logical processors.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2017 13:51:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070899#M58172</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-01-18T13:51:40Z</dc:date>
    </item>
    <item>
      <title>It seems you wish each</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070900#M58173</link>
      <description>&lt;P&gt;It seems you wish each instance of openmp to use 8 cores, so if you use hw_subset you would specify 8 cores with an independent offset for each mpi rank. &amp;nbsp;I don't see that you want a different pattern from what mpi should use in the absence of hw_subset when you set omp_num_threads.&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2017 14:16:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070900#M58173</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2017-01-18T14:16:00Z</dc:date>
    </item>
    <item>
      <title>In my experience the KMP_HW</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070901#M58174</link>
      <description>&lt;P&gt;In my experience the KMP_HW_SUBSET works fine if you are not in SNC4 mode.&amp;nbsp; SNC4 mode is a bit unusual on the 68-core parts, since two of the "nodes" have 18 cores and the other two "nodes" have 16 cores -- that must be the "non-uniform topology" that the OpenMP runtime warned about....&lt;/P&gt;

&lt;P&gt;Fortunately you don't need KMP_HW_SUBSET in this case.&amp;nbsp;&amp;nbsp; Intel MPI sets up reasonable binding domains for each MPI rank by default in most cases.&amp;nbsp; SNC4 mode does complicate this by providing non-uniform "nodes", so it looks like you will need to set up explicit processor binding lists using the MPI (not the OpenMP) environment variables.&amp;nbsp;&amp;nbsp; This is described in Section 3.2 of the Intel MPI Developer Reference Manuals.&amp;nbsp; (The section number is the same in the Intel MPI 5.1 and Intel MPI 2017 manuals.)&lt;/P&gt;

&lt;P&gt;I have not done this myself, but I think you will want to launch a script that looks at the MPI rank number and sets up a different explicit processor list using the I_MPI_PIN_PROCESSOR_LIST environment variable.&amp;nbsp;&amp;nbsp; For nodes 0 and 1 this will include 16 of the 18 cores, while for nodes 2 and 3 this will include all 16 of the 16 cores.&lt;/P&gt;

&lt;P&gt;In my testing I have not seen enough performance benefit from SNC4 mode to justify the irritation of figuring out how to use it (especially since I prefer "Flat" mode and have to deal with per-rank numactl commands as well), but your test sequence is clearly the right approach for you to decide whether this is also true for your workload(s).&lt;/P&gt;</description>
      <pubDate>Wed, 18 Jan 2017 15:44:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070901#M58174</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2017-01-18T15:44:37Z</dc:date>
    </item>
    <item>
      <title>Quote:Mccalpin, John wrote:</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070902#M58175</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Mccalpin, John wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;In my experience the KMP_HW_SUBSET works fine if you are not in SNC4 mode.&amp;nbsp; SNC4 mode is a bit unusual on the 68-core parts, since two of the "nodes" have 18 cores and the other two "nodes" have 16 cores -- that must be the "non-uniform topology" that the OpenMP runtime warned about....&lt;/P&gt;

&lt;P&gt;Fortunately you don't need KMP_HW_SUBSET in this case.&amp;nbsp;&amp;nbsp; Intel MPI sets up reasonable binding domains for each MPI rank by default in most cases.&amp;nbsp; SNC4 mode does complicate this by providing non-uniform "nodes", so it looks like you will need to set up explicit processor binding lists using the MPI (not the OpenMP) environment variables.&amp;nbsp;&amp;nbsp; This is described in Section 3.2 of the Intel MPI Developer Reference Manuals.&amp;nbsp; (The section number is the same in the Intel MPI 5.1 and Intel MPI 2017 manuals.)&lt;/P&gt;

&lt;P&gt;I have not done this myself, but I think you will want to launch a script that looks at the MPI rank number and sets up a different explicit processor list using the I_MPI_PIN_PROCESSOR_LIST environment variable.&amp;nbsp;&amp;nbsp; For nodes 0 and 1 this will include 16 of the 18 cores, while for nodes 2 and 3 this will include all 16 of the 16 cores.&lt;/P&gt;

&lt;P&gt;In my testing I have not seen enough performance benefit from SNC4 mode to justify the irritation of figuring out how to use it (especially since I prefer "Flat" mode and have to deal with per-rank numactl commands as well), but your test sequence is clearly the right approach for you to decide whether this is also true for your workload(s).&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello sir,&lt;/P&gt;

&lt;P&gt;Thanks for the explanation and that was helpful. I went with explicit processor pinning for &lt;STRONG&gt;2 threads/core&lt;/STRONG&gt; and &lt;STRONG&gt;1 thread/core&lt;/STRONG&gt; and used &lt;STRONG&gt;8 openmp threads&lt;/STRONG&gt;. As there are still more configurations to test, may be I could use and see KMP_HW_SUBSET effect.&lt;/P&gt;

&lt;P&gt;Basically &lt;STRONG&gt;1thread/core [ after Enabling Hyper Threading ]&lt;/STRONG&gt; which is also &lt;EM&gt;&lt;STRONG&gt;equivalent&lt;/STRONG&gt;&lt;/EM&gt; to running the same program &lt;STRONG&gt;disabling hyper threading&lt;/STRONG&gt;, and with my program I observed that with hyper threading it took a bit more time( in my case, almost 4 min.) than with no Hyper threading ( we saw this behavior earlier even with&amp;nbsp; KNC and continued with No Hyper Threading ) and with more threads hyper threading taking even more time.&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;What kind of programs would benefit from Hyper threading? and with your tests did you find any application benefiting from it?&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;question regarding &lt;EM&gt;&lt;STRONG&gt;Ivy Bridge Vs KNL&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;Taking a Cluster of 3 Ivy Bridge each with same configuration(16 Cores, 2.6GHz each) and a KNL(72 Cores, 1.4GHz) machine; parallel application (MPI + OpenMP) running on both the machines with same no.of ranks and same no.of omp threads.&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;How does the speed and time of program execution will be affected by 2.4GHz and 1.4GHz speed?&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;Thank you&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jan 2017 06:41:01 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070902#M58175</guid>
      <dc:creator>Rakesh_M_</dc:creator>
      <dc:date>2017-01-20T06:41:01Z</dc:date>
    </item>
    <item>
      <title>Quote:jimdempseyatthecove</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070903#M58176</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;jimdempseyatthecove wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;The KMP_... are OpenMP affinity settings. I_MPI_PIN_DOMAIN can be used for process(rank) pinning. See &lt;A href="https://software.intel.com/en-us/articles/mpi-and-process-pinning-on-xeon-phi"&gt;this &lt;/A&gt;for information relating to the MPI process affinity pinning. The general technique is to specify how/where each process (MPI rank) is to be placed (IOW how each process(rank) affinity is mapped. Then each process(rank) OpenMP thread pool is restricted to that subset of logical processors.&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Tim P. wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;It seems you wish each instance of openmp to use 8 cores, so if you use hw_subset you would specify 8 cores with an independent offset for each mpi rank. &amp;nbsp;I don't see that you want a different pattern from what mpi should use in the absence of hw_subset when you set omp_num_threads.&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hello Sir,&lt;/P&gt;

&lt;P&gt;Thanks for the reply, I pinned the processors with I_MPI_PIN_DOMAIN and threads were able to pin to those processors. And in SNC4 mode, with default MPI pinning the some of the ranks using cores of two nodes, &lt;EM&gt;&lt;STRONG&gt;for eg:&lt;/STRONG&gt;&lt;/EM&gt; In my case, R0,R1 pinned to Node0 CPUs but R2 is pinned to some of Node0 and Node1 CPUs. But I wanted to run only 2 Ranks on each Node and place near MCDRAM so that accessing data would take less time.&lt;/P&gt;

&lt;P&gt;I need some help, may be graphical representation, regarding order of core id or cpu ids are assigned on physical KNL representation like do they start assigning ids &lt;EM&gt;&lt;STRONG&gt;Tile-wise&lt;/STRONG&gt;&lt;/EM&gt; from &lt;EM&gt;&lt;STRONG&gt;Top-left towards Right / Down&lt;/STRONG&gt;&lt;/EM&gt; or from &lt;EM&gt;&lt;STRONG&gt;Bottom-left towards Right / up? &lt;/STRONG&gt;&lt;/EM&gt;So that I can place the MPI Ranks on particular cores which are as close as to other Rank's core and also to MCDRAM&lt;/P&gt;

&lt;P&gt;Thank You&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jan 2017 07:13:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070903#M58176</guid>
      <dc:creator>Rakesh_M_</dc:creator>
      <dc:date>2017-01-20T07:13:45Z</dc:date>
    </item>
    <item>
      <title>&gt;&gt; and with my program I</title>
      <link>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070904#M58177</link>
      <description>&lt;P&gt;&amp;gt;&amp;gt;&amp;nbsp;&lt;EM&gt;and with my program I observed that with hyper threading it took a bit more time( in my case, almost 4 min.) than with no Hyper threading&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;That behavior usually occurs with a memory bandwidth limited application. I suggest you investigate the data layout and memory access patterns to reduce memory (non-cached) accesses. Example: changing from Array of Structures to Structure of Arrays, or linked list to array.&lt;/P&gt;

&lt;P&gt;Other causes for this symptom is tuning for 1 thread per core, then running with multiple threads per core (and inducing excess cache evictions).&lt;/P&gt;

&lt;P&gt;Jim Dempsey&lt;/P&gt;</description>
      <pubDate>Fri, 20 Jan 2017 13:34:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Software-Archive/Knights-Landing-KNL-Thread-Affinity-and-Managing-Hyper-Threading/m-p/1070904#M58177</guid>
      <dc:creator>jimdempseyatthecove</dc:creator>
      <dc:date>2017-01-20T13:34:06Z</dc:date>
    </item>
  </channel>
</rss>

