If I understand Jim Dempsey's - Page 2

Mike_B_2 · ‎02-02-2017

We're seeing an unexpected (Intel) OMP thread placement in SNC4 mode.

The thread placement is different on each KNL node. While that by itself may ok, we're seeing entire numa nodes not being used, and some numa nodes getting double or triple the number of threads.

A nominal aprun launch is 8 OMP threads under a single MPI rank on a KNL node, in SNC4 mode. All of the nodes tested result in different core placements based on process id. Using the results of numactl -H (which lists what I'm calling process ids), it appears numa nodes are being skipped. If we add KMP_AFFINITY verbose, we can see the process id to core id mapping, and observe consistently the threads are being placed in order based on sequential core id. However, since the core id to process id mapping is seemingly random, so we end up with threads being clumped onto some numa nodes, leaving other entire numa nodes empty.

How do we ensure threads are distributed onto all numa regions? Control to also keep the thread assignment (close?) where they are sequential within a numa node (but still on all numa nodes, say 2 threads per numa node) is also desired. The default thread assignment where some numa nodes are empty, is not expected when in SNC4 mode.

It seems either the OMP runtime is placing the threads incorrectly, or the output of numactl is incorrect wrt which process ids are in the numa nodes.

thanks,

Mike

SergeyKostrov · ‎02-06-2017

>>... >>Now, is this layout (distributed threads per numa region) possible without the proclist, in order to make it easier for users? >> So they don't have to check the numa node proc list, and have it as an env setting. I think yes. For example, a USER_PROC_LIST environment variable could be initialized at a start up ( of course, for every user it will be different ): ... export USER_PROC_LIST=0,1,16,17,48,49,32,33 ... and then it needs to be used like: ... mpirun -host [hostname] -np 1 -env OMP_NUM_THREADS=8 -env KMP_AFFINITY=granularity=fine,proclist=[$USER_PROC_LIST],explicit ./[apptorun] ... One more thing, USER_PROC_LIST environment variable could be also initialized from output of 'numactl --hardware' utility.

Lawrence_M_Intel · ‎02-06-2017

OK, Mike, sorry, I didn't read all the comments. I see that you tried the document. The document does assume that ranks don't span SNC domains. This isn't something you'd do on a multi-socket system, I don't see why you want to do it on a KNL in SNC mode.

If you really want one rank to span SNC4 modes then right now proclist is the only option. I still think this is a bad idea, you should run at least one rank per SNC node and then everything should be fine.

There are plans to add numa awareness to OpenMP RTL for things like this, but I don't know the schedule.

Mike_B_2 · ‎02-06-2017

Larry,

If I run with one rank per numa node, then things are well behaved by default. yeah! Associated OMP threads are placed on the respective numa nodes, as expected.

Sergey, while your env method works, it really just moves where it's created. We might have to wait for a new OpenMP run time library, but it would be good if thread placement could controlled per numa node with KMP or OMP settings, in order to get a "good" distribution, regardless of the number of threads requested. iow, the affinity placement would use some aprun/mpirun args and env vars to determine the thread affinity. But based on your previous suggestion, at least we have an explicit way to get what we're after.

Mike

SergeyKostrov · ‎02-06-2017

>>... >>We might have to wait for a new OpenMP run time library, but it would be good if thread placement could controlled per numa node... >>... In case of direct Threads-To-ProcessingUnits bindings call omp_get_thread_num, then get a raw thread ID, and then any thread affinity management could be done.

Andrey_C_Intel1 · ‎02-08-2017

Mike,

> Andrey, do you still want the output from the verbose and KMP_SETTINGS=1?

No, thank you, I realized same info contained in .xls file you attached.

Regarding future libraries, as a first step we implemented KMP_HW_SUBSET to be aware of NUMA. So it will be possible to flexibly pick resources for a particular run, and the work should be well balanced once you use ALL chosen resources. But there still can be problems in case you use only part of chosen resources, because the library will still sort cores linearly. For example, given core numbering you provided for one of systems, if we choose 2 cores per node having 8 cores in total, and try to run only 4 threads, then library can place threads per node as 2-1-1-0 creating imbalance.

We will try to add NUMA awareness for KMP_AFFINITY and OMP_PLACES functionality later, that should help more usage scenarios. But this will not be available in nearest release of the compiler unfortunately.

Best suggestion for now is to involve MPI ranks those limit underlying OpenMP library to work on single NUMA node in each rank, as you and Larry already mentioned.

Regards,
Andrey

jimdempseyatthecove · ‎02-08-2017

Andrey,

Can you pass on to the OpenMP developers the need for better integration with MPI. By this I mean, if you have an SMP (multi-socket) or KNL in SNC2 or SNC4, that an MPI user would wish (need) to distribute MPI ranks by nodes (and node sets), while needing to OpenMP thread within the subset of threads specified by the MPI spawner (mpiexec, mpirun, aprun, ...). There is, or used to be, and is insufficiently documented "respect" keyword for KMP_AFFINITY. The following is not specifically what Mike asked for

export KMP_AFFINITY=respect,granularity=core
export OMP_NUM_THREADS=8
mpiexec.hydra -n 4 ./appToRun

each rank of 4 restricted to one node (of SNC4) with each rank running 8 OpenMP threads on different cores restricted to the subset of threads of the process (rank) specified by the MPI thread spawner (mpiexe.hydra in this case).

Jim Dempsey

James_C_Intel2 · ‎02-08-2017

Jim

Can you pass on to the OpenMP developers the need for better integration with MPI. By this I mean, if you have an SMP (multi-socket) or KNL in SNC2 or SNC4, that an MPI user would wish (need) to distribute MPI ranks by nodes (and node sets), while needing to OpenMP thread within the subset of threads specified by the MPI spawner (mpiexec, mpirun, aprun, ...). There is, or used to be, and is insufficiently documented "respect" keyword for KMP_AFFINITY.

The Intel (and LLVM) OpenMP runtime always respects the incoming affinity mask unless you explicitly tell it not to (by using the"norespect" keyword in KMP_AFFINITY). So there is no need for any change. The OpenMP runtime tries to be well behaved and it believes what it's told (via the incoming affinity mask).

However, that does make us at the mercy of the outer scheduler. If it sets some weird affinity mask, the OpenMP runtime will respect that and you'll get perverse results. (Note, also, that some systems intercept clone calls and bind threads themselves, which can also make things "interesting", and there's nothing we can do about it).

Lawrence_M_Intel · ‎02-08-2017

Expanding on Jim Cownie's comment, the whole point of the document attached above is to give you the right incantations to the job launcher/MPI implementation so that each rank gets the desired number of cores (I'm ignoring crazy people who want to run >1 rank per core, though my document does address that) with an affinity mask that includes all the threads in the core. Then you can use the existing OpenMP controls to further divide or restrict that mask to provide, e.g., one thread per core, or some nested OpenMP affinity that suits your application.

I'm working on a javascript GUI to let you enter a few parameters like # cores/rank and #thread/core and generate the appropriate launch lines and environment variables. If I get ambitious I'll try to draw some pretty pictures ala lstopo.

jimdempseyatthecove · ‎02-08-2017

>>I'm working on a javascript GUI to let you enter a few parameters like # cores/rank and #thread/core and generate the appropriate launch lines and environment variables. If I get ambitious I'll try to draw some pretty pictures ala lstopo.

Great. Let us know when it is available (hopefully it gets included in future distributions).

Jim Dempsey

Mike_B_2 · ‎02-08-2017

If I understand Jim Dempsey's comments correctly, I'll echo Jim Crownie's comments. By default (no explicit KMP settings), using one MPI rank per numa node, we're getting well behaved, and expected behaviour, where each of the OMP threads in that rank are placed on that ranks' numa node. It was just for the case of a single rank (maybe crazy) per node, that the threads weren't well placed. proclist is the current hammer to remedy that.

Mike

McCalpinJohn · ‎02-08-2017

At TACC, we have found that Intel MPI almost always does "the right thing" for binding Hybrid MPI/OpenMP jobs with default settings.

(There was a short period last summer when Intel MPI bound all threads to a single core when running a single MPI task per node, but that was fixed quickly.)

We don't do much work with SNC4 mode. The performance benefit is modest, and it is a bit of a pain to properly handle the numactl memory bindings for "Flat" mode in this case. I am certainly not surprised that splitting the 68-cores of the Xeon Phi 7250 processors into 2 18-core nodes plus 2 16-core nodes would lead to some unexpected behaviors.

SergeyKostrov · ‎02-09-2017

>>...We might have to wait for a new OpenMP run time library, but it would be good if thread placement could controlled per >>numa node with KMP or OMP settings... Things are very slow at OpenMP "task force" and you know that it took more than 4 versions for them to realize that affinity control is needed ( very! ) on multicore systems. My point of view is that OpenMP versions 1, 2 and 3 were designed to easily parallelize processing and these versions weren't designed to achieve the highest possible performance. Unfortunately, many years ago OpenMP "task force" didn't take into account that process and thread affinity control already exist in OSs. UNIX and Windows NT OSs have that functionality from the very beginning for multiprocessor systems and later that functionality was adapted to multicore systems with shared cache lines. Intel realized the problem quickly and that is why KMP_AFFINITY was introduced many years ago. But, it broke compatibility and is available only with Intel's OpenMP runtime libraries. In version 4.x OpenMP "task force" decided to introduce two new OpenMP environment variables, OMP_BIND_PROC and OMP_PLACES, instead of adapting and extending Intel's KMP_AFFINITY concept into a new and powerful OMP_AFFINITY environment variable which would handle all possible cases ( like yours for NUMA, etc ). But, it broke compatibility with KMP_AFFINITY and another mess is created! I personally was challenged by all these OpenMP thread problems some time in 2013. In September 2014 I've completed a small R&D and found a way of getting native threads IDs ( handles ) of an OS from an OpenMP thread number ( returned from omp_get_thread_num function / see my Post # 25 ) . It is a compact and very portable solution ( used in C/C++ codes since March 2015 ) that doesn't use KMP_AFFINITY, OMP_BIND_PROC and OMP_PLACES environment variables. I also know that Jim followed almost the same path and has its own affinity control in QuickThreads library but I don't know if Jim's solution is for Pthreads or for OpenMP threads.

SergeyKostrov · ‎02-09-2017

>>...a bit of a pain to properly handle the numactl memory bindings for "Flat" mode in this case... John, It is not clear what you're talking about. Did you mean MCDRAM, or something else?

jimdempseyatthecove · ‎02-10-2017

QuickThread used (uses) pthread on Linux, and Windows native threads on Windows. An application at run time can construct thread teams by use of a token specifying proximity placement. Virtually any type of placement can be used:

All threads
All threads within same NUMA node of thread of invocation (as well as within various NUMA distances)
All threads within LLC (typically L3 or same socket) of thread of invocation
All threads within L2
All threads within L1 (IOW all sibling threads of core)
To self (for processing later)
One thread for each NUMA node (as well as within various NUMA distances)
One thread per LLC
One thread per L2
One thread per L1 (one per core)
One thread per L2 within LLC (IOW one each per L2 within socket)
One thread per L1 within LLC
One thread per L1 within L2
... (additional specialized grouping)

Jim Dempsey

McCalpinJohn · ‎02-10-2017

Sergey asked:

John, It is not clear what you're talking about. Did you mean MCDRAM, or something else?

The issue that I was referring to was the use of MCDRAM as memory (not as cache), which is usually referred to as "Flat" mode (rather than as "cache" mode) on Xeon Phi x200.

When "Flat" mode is combined with SNC4 mode, each KNL processor reports that there are 8 NUMA "nodes". Nodes 0,1,2,3 have cores and DDR4 memory, while nodes 4,5,6,7 have no cores but do each have 1/4 of the MCDRAM memory.

The MCDRAM memory reported as NUMA node 4 is the part that is physically adjacent to the cores in NUMA node 0.
The MCDRAM memory reported as NUMA node 5 is the part that is physically adjacent to the cores in NUMA node 1.
etc.

From a locality perspective, SNC4 mode is ideal for an MPI application configured to use 4 MPI tasks per KNL. The inconvenient part is setting up the bindings to force each MPI task to use "local" MCDRAM memory. The overall procedure typically requires that the MPI launcher starts a script (rather than directly launching the MPI task), and having the script query a number of environment variables, compute some new variables, and then launch the local copy of the MPI executable. The steps are something like:

MPI launcher launches script on the target nodes, with 4 copies of the script assigned to each node.
- Each script gets its global MPI rank number from an environment variable.
- Each script computes MPI-rank modulo 4 to get its "local" MPI rank number.
  - This will be used directly to determine the NUMA node where the processes for the corresponding MPI task will be run.
- Each script adds 4 to the "local" MPI rank number.
  - This will be used as the NUMA node number for the memory binding.
- Each script uses "numactl --membind=<local MPI rank number +4> --physcpubind=<local MPI rank number>" to launch the MPI executable.

This is not too hard, but it is a pain to ensure that the MPI process/thread bindings are actually consistent with the assumptions above, and it is a big pain to generalize the logic so that the scripts will work correctly the "normal" mode, the "SNC2" mode, and the "SNC4" mode. All of this is made even more complex by the need (in some environments) to deal with the different environment variables and different default thread binding schemes used by different MPI stacks.

SergeyKostrov · ‎02-10-2017

>>When "Flat" mode is combined with SNC4 mode, each KNL processor reports that there are 8 NUMA "nodes". Nodes 0,1,2,3 have >>cores and DDR4 memory, while nodes 4,5,6,7 have no cores but do each have 1/4 of the MCDRAM memory. Thanks, John for a detailed explanation. During last a couple of weeks I've been doing some tests to understand how KNL modes ( Cluster and MCDRAM ) work. As of today I have some performance data for these configurations: MCDRAM = Hybrid 50-50 - Cluster = All2All - done MCDRAM = Hybrid 50-50 - Cluster = SNC-2 - done MCDRAM = Hybrid 50-50 - Cluster = SNC-4 - done MCDRAM = Hybrid 50-50 - Cluster = Hemisphere - done MCDRAM = Hybrid 50-50 - Cluster = Quadrant - done A set of another five configurations will be done during next two weeks: MCDRAM = Flat - Cluster = All2All - not done yet MCDRAM = Flat - Cluster = SNC-2 - not done yet MCDRAM = Flat - Cluster = SNC-4 - not done yet MCDRAM = Flat - Cluster = Hemisphere - not done yet MCDRAM = Flat - Cluster = Quadrant - not done yet I'm not sure if it makes sense to do tests for MCDRAM "Cache" mode. What do you think? I've detected already a problem and I will describe it in a new thread as soon as all tests are completed.

McCalpinJohn · ‎02-10-2017

The majority of the KNL nodes at TACC are configured in "cache" mode with "quadrant" interleaving. With the addition of an Intel kernel patch to sort the page tables, this works reasonably well and avoids the need for any numactl memory binding.

SergeyKostrov · ‎02-10-2017

>>...QuickThread used (uses) pthread on Linux, and Windows native threads on Windows... This is what I wanted to confirm. Thanks, Jim. I've been talking about setting OpenMP thread affinities by using a "trick": OpenMP thread num -> Native thread handle -> Set a thread Affinity using an OS kernel API. The "trick" doesn't use KMP_... or OMP_... environment variables.

jimdempseyatthecove · ‎02-11-2017

>>I've been talking about setting OpenMP thread affinities by using a "trick": OpenMP thread num -> Native thread handle -> Set a thread Affinity using an OS kernel API. The "trick" doesn't use KMP_... or OMP_... environment variables.

You could just as well use OpenMP (base level) thread num -> OS kernel logical processor(s)/HW threads. This works well for most situations. You can do your own compact or scatter or other variations, as well as bind a thread to multiple OS kernel logical processor(s)/HW threads. For example, on Xeon w/HT:

Read the process affinity map (do not assume the process is bound to all the HW threads)
bind OpenMP thread (team member) number 0 to the first two
bind OpenMP thread number 1 to the second two bits in the process affinity bitmap
...

The above though is naïve in making assumptions about the process affinity map. (QuichThread uses CPUID/CPUIDEX to read APIC and/or APIC2 to remove assumptions).

Jim Dempsey

SergeyKostrov · ‎02-12-2017

>>...You could just as well use OpenMP (base level) thread num -> OS kernel logical processor(s)/HW threads.... Nobody is going to re-implement what was done in March 2015. It is working and KMP_AFFINITY modes ( scatter, balanced or compact ) are already supported.

jimdempseyatthecove · ‎02-13-2017

>>Nobody is going to re-implement what was done in March 2015. It is working and KMP_AFFINITY modes ( scatter, balanced or compact ) are already supported.

Sergey,

That is too inclusive of a statement. There are some specialized cases where the one size does not "fits" all. For example, using OpenMP environment variables and library calls, how does one permit one or more of the threads float (not pinned) while pinning the other threads? While this is not a usual case, it certainly contradicts Nobody.

Jim Dempsey

OMP thread placement on KNL SNC4