Knights Landing(KNL) Thread Affinity and Managing Hyper Threading

Rakesh_M_ · ‎01-16-2017

Hi,

I need to run my program for different configurations of KNL, in the process I enabled Hyper threading and wanted to test for different no.of threads per core by setting environment variables, KMP_HW_SUBSET and KMP_AFFINITY (Ref: https://software.intel.com/en-us/node/680054 ). But when I run the program, it is showing warning saying

"OMP: Warning #245: KMP_HW_SUBSET ignored: non-uniform topology."

what's the reason? According to reference, it is also clearly mentioned in NOTE that

"On Intel® Xeon® Phi™ coprocessors, the default affinity type is scatter, so KMP_HW_SUBSET works by default on this platform."

but the warning shown up was different.

How to resolve this? and How else can I manage 1/2/3/4 Threads per Core on KNL? Does the output for "lscpu" command, KMP_AFFINITY=verbose will be different for each Threads/Core Combination?

jimdempseyatthecove · ‎01-17-2017

The easiest way, assuming 64 core system, is to use KMP_AFFINITY=scatter (or OMP_... equivalent), then select 64, 128. 192 or 256 threads.

Jim Dempsey

Rakesh_M_ · ‎01-18-2017

jimdempseyatthecove wrote:

The easiest way, assuming 64 core system, is to use KMP_AFFINITY=scatter (or OMP_... equivalent), then select 64, 128. 192 or 256 threads.

Jim Dempsey

Hi Sir,

Thanks for the reply,

The Xeon Phi we are using is of 68 core machine, which will effectively allow 272 threads for use in Hyper threading enabled mode.

And the memory mode is Cache Mode in SNC4 & Hyper Threading Enabled.

and to test the max threads and no.of procs, created a parallel region and tried getting the thread count with default setting using

omp_get_num_procs() and omp_get_max_threads()

here is what the command used

$KMP_AFFINITY=scatter KMP_HW_SUBSET=68c,2t mpirun -n 8 -env I_MPI_DEBUG=5 ./ex4

which should effectively map 8 MPI ranks to 64Cores x 2Threads/Core = 128 Threads, i.e each rank would get 16 threads but the above mentioned omp runtime calls returning 34 threads per MPI Rank, which is an MPI Rank gets considering 272 threads on the whole i.e KMP_HW_SUBSET=68c,4t. And it is also same with KMP_HW_SUBSET=68c,1t; KMP_HW_SUBSET=68c,3t.

in addition taking verbose "ignored...." warning into consideration, the KMP environment variables used above are not showing the intended effect.

why it is assuming 272 threads? Is there any other way to do this?

jimdempseyatthecove · ‎01-18-2017

The KMP_... are OpenMP affinity settings. I_MPI_PIN_DOMAIN can be used for process(rank) pinning. See this for information relating to the MPI process affinity pinning. The general technique is to specify how/where each process (MPI rank) is to be placed (IOW how each process(rank) affinity is mapped. Then each process(rank) OpenMP thread pool is restricted to that subset of logical processors.

Jim Dempsey

TimP · ‎01-18-2017

It seems you wish each instance of openmp to use 8 cores, so if you use hw_subset you would specify 8 cores with an independent offset for each mpi rank. I don't see that you want a different pattern from what mpi should use in the absence of hw_subset when you set omp_num_threads.

McCalpinJohn · ‎01-18-2017

In my experience the KMP_HW_SUBSET works fine if you are not in SNC4 mode. SNC4 mode is a bit unusual on the 68-core parts, since two of the "nodes" have 18 cores and the other two "nodes" have 16 cores -- that must be the "non-uniform topology" that the OpenMP runtime warned about....

Fortunately you don't need KMP_HW_SUBSET in this case. Intel MPI sets up reasonable binding domains for each MPI rank by default in most cases. SNC4 mode does complicate this by providing non-uniform "nodes", so it looks like you will need to set up explicit processor binding lists using the MPI (not the OpenMP) environment variables. This is described in Section 3.2 of the Intel MPI Developer Reference Manuals. (The section number is the same in the Intel MPI 5.1 and Intel MPI 2017 manuals.)

I have not done this myself, but I think you will want to launch a script that looks at the MPI rank number and sets up a different explicit processor list using the I_MPI_PIN_PROCESSOR_LIST environment variable. For nodes 0 and 1 this will include 16 of the 18 cores, while for nodes 2 and 3 this will include all 16 of the 16 cores.

In my testing I have not seen enough performance benefit from SNC4 mode to justify the irritation of figuring out how to use it (especially since I prefer "Flat" mode and have to deal with per-rank numactl commands as well), but your test sequence is clearly the right approach for you to decide whether this is also true for your workload(s).

Rakesh_M_ · ‎01-19-2017

Mccalpin, John wrote:

In my experience the KMP_HW_SUBSET works fine if you are not in SNC4 mode. SNC4 mode is a bit unusual on the 68-core parts, since two of the "nodes" have 18 cores and the other two "nodes" have 16 cores -- that must be the "non-uniform topology" that the OpenMP runtime warned about....

Fortunately you don't need KMP_HW_SUBSET in this case. Intel MPI sets up reasonable binding domains for each MPI rank by default in most cases. SNC4 mode does complicate this by providing non-uniform "nodes", so it looks like you will need to set up explicit processor binding lists using the MPI (not the OpenMP) environment variables. This is described in Section 3.2 of the Intel MPI Developer Reference Manuals. (The section number is the same in the Intel MPI 5.1 and Intel MPI 2017 manuals.)

I have not done this myself, but I think you will want to launch a script that looks at the MPI rank number and sets up a different explicit processor list using the I_MPI_PIN_PROCESSOR_LIST environment variable. For nodes 0 and 1 this will include 16 of the 18 cores, while for nodes 2 and 3 this will include all 16 of the 16 cores.

In my testing I have not seen enough performance benefit from SNC4 mode to justify the irritation of figuring out how to use it (especially since I prefer "Flat" mode and have to deal with per-rank numactl commands as well), but your test sequence is clearly the right approach for you to decide whether this is also true for your workload(s).

Hello sir,

Thanks for the explanation and that was helpful. I went with explicit processor pinning for 2 threads/core and 1 thread/core and used 8 openmp threads. As there are still more configurations to test, may be I could use and see KMP_HW_SUBSET effect.

Basically 1thread/core [ after Enabling Hyper Threading ] which is also equivalent to running the same program disabling hyper threading, and with my program I observed that with hyper threading it took a bit more time( in my case, almost 4 min.) than with no Hyper threading ( we saw this behavior earlier even with KNC and continued with No Hyper Threading ) and with more threads hyper threading taking even more time.

What kind of programs would benefit from Hyper threading? and with your tests did you find any application benefiting from it?

question regarding Ivy Bridge Vs KNL

Taking a Cluster of 3 Ivy Bridge each with same configuration(16 Cores, 2.6GHz each) and a KNL(72 Cores, 1.4GHz) machine; parallel application (MPI + OpenMP) running on both the machines with same no.of ranks and same no.of omp threads.

How does the speed and time of program execution will be affected by 2.4GHz and 1.4GHz speed?

Thank you

Rakesh_M_ · ‎01-19-2017

jimdempseyatthecove wrote:

The KMP_... are OpenMP affinity settings. I_MPI_PIN_DOMAIN can be used for process(rank) pinning. See this for information relating to the MPI process affinity pinning. The general technique is to specify how/where each process (MPI rank) is to be placed (IOW how each process(rank) affinity is mapped. Then each process(rank) OpenMP thread pool is restricted to that subset of logical processors.

Jim Dempsey

Tim P. wrote:

It seems you wish each instance of openmp to use 8 cores, so if you use hw_subset you would specify 8 cores with an independent offset for each mpi rank. I don't see that you want a different pattern from what mpi should use in the absence of hw_subset when you set omp_num_threads.

Hello Sir,

Thanks for the reply, I pinned the processors with I_MPI_PIN_DOMAIN and threads were able to pin to those processors. And in SNC4 mode, with default MPI pinning the some of the ranks using cores of two nodes, for eg: In my case, R0,R1 pinned to Node0 CPUs but R2 is pinned to some of Node0 and Node1 CPUs. But I wanted to run only 2 Ranks on each Node and place near MCDRAM so that accessing data would take less time.

I need some help, may be graphical representation, regarding order of core id or cpu ids are assigned on physical KNL representation like do they start assigning ids Tile-wise from Top-left towards Right / Down or from Bottom-left towards Right / up? So that I can place the MPI Ranks on particular cores which are as close as to other Rank's core and also to MCDRAM

Thank You

jimdempseyatthecove · ‎01-20-2017

>> and with my program I observed that with hyper threading it took a bit more time( in my case, almost 4 min.) than with no Hyper threading

That behavior usually occurs with a memory bandwidth limited application. I suggest you investigate the data layout and memory access patterns to reduce memory (non-cached) accesses. Example: changing from Array of Structures to Structure of Arrays, or linked list to array.

Other causes for this symptom is tuning for 1 thread per core, then running with multiple threads per core (and inducing excess cache evictions).

Jim Dempsey