OMP thread placement on KNL SNC4 - Page 3

Mike_B_2 · ‎02-02-2017

We're seeing an unexpected (Intel) OMP thread placement in SNC4 mode.

The thread placement is different on each KNL node. While that by itself may ok, we're seeing entire numa nodes not being used, and some numa nodes getting double or triple the number of threads.

A nominal aprun launch is 8 OMP threads under a single MPI rank on a KNL node, in SNC4 mode. All of the nodes tested result in different core placements based on process id. Using the results of numactl -H (which lists what I'm calling process ids), it appears numa nodes are being skipped. If we add KMP_AFFINITY verbose, we can see the process id to core id mapping, and observe consistently the threads are being placed in order based on sequential core id. However, since the core id to process id mapping is seemingly random, so we end up with threads being clumped onto some numa nodes, leaving other entire numa nodes empty.

How do we ensure threads are distributed onto all numa regions? Control to also keep the thread assignment (close?) where they are sequential within a numa node (but still on all numa nodes, say 2 threads per numa node) is also desired. The default thread assignment where some numa nodes are empty, is not expected when in SNC4 mode.

It seems either the OMP runtime is placing the threads incorrectly, or the output of numactl is incorrect wrt which process ids are in the numa nodes.

thanks,

Mike

SergeyKostrov · ‎02-13-2017

Jim, we're off the topic...

Mike_B_2 · ‎02-23-2017

Jim,

Just a follow-up question on the use of the ":" to set different options per rank. In SNC4 mode, if we want to set numactl preferred=1, and we're using 4 MPI ranks per KNL node, and one rank per numa node, something like the following would work?

aprun -n 1 -S 1 numactl preferred=1 a.out : -n 1 -S 1 numactl preferred=1 a.out : -n 1 -S 1 preferred=1 a.out : -n 1 -S 1 numactl preferred=1 a.out

But, what if we have 8k KNLs?, each running 4 MPI ranks per node, needing preferred=1....is there a more compact way than 32k ":" clauses?

Mike_B_2 · ‎02-23-2017

same idea, but better numa node preference:

aprun -n 1 -S 1 numactl preferred=4 a.out : -n 1 -S 1 numactl preferred=5 a.out : -n 1 -S 1 preferred=6 a.out : -n 1 -S 1 numactl preferred=7 a.out

Lawrence_M_Intel · ‎02-23-2017

Does Cray aprun allow the ':' syntax? I thought that was mpich-base MPIs only.

Intel MPI has a memory policy, search for I_MPI_HBW_POLICY https://software.intel.com/en-us/node/528817

I remember seeing some documentation of a similar envirable for Cray MPI but I don't know where their documentation lives. Someone at your center should be able to help.

If you are using MCDRAM only you can use numactl -m 4,5,6,7 . Unfortunately there's no way to express the preferred policy this way.

A while ago I wrote a program that reads the CPU affinity mask and outputs the "closest" numa domains, that can be used in a script to output the values for numactl -p . I can try to dig that up if you want. Some people thought it was too complicated so I never tried to push it.

-- Larry

Gregg_S_Intel · ‎03-17-2017

John,

Here is how to set MPI+OpenMP affinity for flat SNC4 mode. The new I_MPI_PIN_DOMAIN=numa feature was added with this situation in mind. The MPI domains exactly match the NUMA domains.

mpirun –n 4 –env I_MPI_PIN_DOMAIN numa

[0] MPI startup(): Rank Pid Node name Pin cpu

[0] MPI startup(): 0 28501 knl {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,204,205,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221}

[0] MPI startup(): 1 28504 knl {18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239}

[0] MPI startup(): 2 28507 knl {36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255}

[0] MPI startup(): 3 28510 knl {52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,188,189,190,191,192,193,194,195,196,197,198,199,200,201,202,203,256,257,258,259,260,261,262,263,264,265,266,267,268,269,270,271}

NUMA node0 CPU(s): 0-17,68-85,136-153,204-221

NUMA node1 CPU(s): 18-35,86-103,154-171,222-239

NUMA node2 CPU(s): 36-51,104-119,172-187,240-255

NUMA node3 CPU(s): 52-67,120-135,188-203,256-271

Gregg Skinner

Gregg_S_Intel · ‎03-17-2017

Mike,

For the Intel MPI, you can replace all the colon ugliness with a single environment variable, I_MPI_HBW_POLICY=hbw_preferred

Gregg

Gregg_S_Intel · ‎03-17-2017

Mike B. wrote:

The attached spreadsheet shows the results from running the above command on 3 different SNC4/flat nodes.

Columns are node id, process id, core id (that is mapped to process id identified by KMP_AFFIINITY verbose), hyper thread id (id'd by verbose, always 0 in these cases), and numa node (as id'd by numactl -H for that process id). Highlights are the threads used by these runs. The rest just include the complete output of verbose with numa nodes added by process id. This view is sorted by core id, which showed why the threads were being assigned in this order. Originally each node id set was sorted by process id, which showed assignment by process id, but re-sorted as it is now, helped understand more what was happening with thread to core id assignment.

Mike, do these these systems have a recent BIOS and all patches from XPSSL? What you're seeing is an issue identified internally last year. It was resolved with some help from the BIOS team.

Gregg Skinner