- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We're seeing an unexpected (Intel) OMP thread placement in SNC4 mode.
The thread placement is different on each KNL node. While that by itself may ok, we're seeing entire numa nodes not being used, and some numa nodes getting double or triple the number of threads.
A nominal aprun launch is 8 OMP threads under a single MPI rank on a KNL node, in SNC4 mode. All of the nodes tested result in different core placements based on process id. Using the results of numactl -H (which lists what I'm calling process ids), it appears numa nodes are being skipped. If we add KMP_AFFINITY verbose, we can see the process id to core id mapping, and observe consistently the threads are being placed in order based on sequential core id. However, since the core id to process id mapping is seemingly random, so we end up with threads being clumped onto some numa nodes, leaving other entire numa nodes empty.
How do we ensure threads are distributed onto all numa regions? Control to also keep the thread assignment (close?) where they are sequential within a numa node (but still on all numa nodes, say 2 threads per numa node) is also desired. The default thread assignment where some numa nodes are empty, is not expected when in SNC4 mode.
It seems either the OMP runtime is placing the threads incorrectly, or the output of numactl is incorrect wrt which process ids are in the numa nodes.
thanks,
Mike
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
OK, Mike, sorry, I didn't read all the comments. I see that you tried the document. The document does assume that ranks don't span SNC domains. This isn't something you'd do on a multi-socket system, I don't see why you want to do it on a KNL in SNC mode.
If you really want one rank to span SNC4 modes then right now proclist is the only option. I still think this is a bad idea, you should run at least one rank per SNC node and then everything should be fine.
There are plans to add numa awareness to OpenMP RTL for things like this, but I don't know the schedule.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Larry,
If I run with one rank per numa node, then things are well behaved by default. yeah! Associated OMP threads are placed on the respective numa nodes, as expected.
Sergey, while your env method works, it really just moves where it's created. We might have to wait for a new OpenMP run time library, but it would be good if thread placement could controlled per numa node with KMP or OMP settings, in order to get a "good" distribution, regardless of the number of threads requested. iow, the affinity placement would use some aprun/mpirun args and env vars to determine the thread affinity. But based on your previous suggestion, at least we have an explicit way to get what we're after.
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Mike,
> Andrey, do you still want the output from the verbose and KMP_SETTINGS=1?
No, thank you, I realized same info contained in .xls file you attached.
Regarding future libraries, as a first step we implemented KMP_HW_SUBSET to be aware of NUMA. So it will be possible to flexibly pick resources for a particular run, and the work should be well balanced once you use ALL chosen resources. But there still can be problems in case you use only part of chosen resources, because the library will still sort cores linearly. For example, given core numbering you provided for one of systems, if we choose 2 cores per node having 8 cores in total, and try to run only 4 threads, then library can place threads per node as 2-1-1-0 creating imbalance.
We will try to add NUMA awareness for KMP_AFFINITY and OMP_PLACES functionality later, that should help more usage scenarios. But this will not be available in nearest release of the compiler unfortunately.
Best suggestion for now is to involve MPI ranks those limit underlying OpenMP library to work on single NUMA node in each rank, as you and Larry already mentioned.
Regards,
Andrey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrey,
Can you pass on to the OpenMP developers the need for better integration with MPI. By this I mean, if you have an SMP (multi-socket) or KNL in SNC2 or SNC4, that an MPI user would wish (need) to distribute MPI ranks by nodes (and node sets), while needing to OpenMP thread within the subset of threads specified by the MPI spawner (mpiexec, mpirun, aprun, ...). There is, or used to be, and is insufficiently documented "respect" keyword for KMP_AFFINITY. The following is not specifically what Mike asked for
export KMP_AFFINITY=respect,granularity=core
export OMP_NUM_THREADS=8
mpiexec.hydra -n 4 ./appToRun
each rank of 4 restricted to one node (of SNC4) with each rank running 8 OpenMP threads on different cores restricted to the subset of threads of the process (rank) specified by the MPI thread spawner (mpiexe.hydra in this case).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Jim
Can you pass on to the OpenMP developers the need for better integration with MPI. By this I mean, if you have an SMP (multi-socket) or KNL in SNC2 or SNC4, that an MPI user would wish (need) to distribute MPI ranks by nodes (and node sets), while needing to OpenMP thread within the subset of threads specified by the MPI spawner (mpiexec, mpirun, aprun, ...). There is, or used to be, and is insufficiently documented "respect" keyword for KMP_AFFINITY.
The Intel (and LLVM) OpenMP runtime always respects the incoming affinity mask unless you explicitly tell it not to (by using the"norespect" keyword in KMP_AFFINITY). So there is no need for any change. The OpenMP runtime tries to be well behaved and it believes what it's told (via the incoming affinity mask).
However, that does make us at the mercy of the outer scheduler. If it sets some weird affinity mask, the OpenMP runtime will respect that and you'll get perverse results. (Note, also, that some systems intercept clone calls and bind threads themselves, which can also make things "interesting", and there's nothing we can do about it).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Expanding on Jim Cownie's comment, the whole point of the document attached above is to give you the right incantations to the job launcher/MPI implementation so that each rank gets the desired number of cores (I'm ignoring crazy people who want to run >1 rank per core, though my document does address that) with an affinity mask that includes all the threads in the core. Then you can use the existing OpenMP controls to further divide or restrict that mask to provide, e.g., one thread per core, or some nested OpenMP affinity that suits your application.
I'm working on a javascript GUI to let you enter a few parameters like # cores/rank and #thread/core and generate the appropriate launch lines and environment variables. If I get ambitious I'll try to draw some pretty pictures ala lstopo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I'm working on a javascript GUI to let you enter a few parameters like # cores/rank and #thread/core and generate the appropriate launch lines and environment variables. If I get ambitious I'll try to draw some pretty pictures ala lstopo.
Great. Let us know when it is available (hopefully it gets included in future distributions).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If I understand Jim Dempsey's comments correctly, I'll echo Jim Crownie's comments. By default (no explicit KMP settings), using one MPI rank per numa node, we're getting well behaved, and expected behaviour, where each of the OMP threads in that rank are placed on that ranks' numa node. It was just for the case of a single rank (maybe crazy) per node, that the threads weren't well placed. proclist is the current hammer to remedy that.
Mike
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
At TACC, we have found that Intel MPI almost always does "the right thing" for binding Hybrid MPI/OpenMP jobs with default settings.
(There was a short period last summer when Intel MPI bound all threads to a single core when running a single MPI task per node, but that was fixed quickly.)
We don't do much work with SNC4 mode. The performance benefit is modest, and it is a bit of a pain to properly handle the numactl memory bindings for "Flat" mode in this case. I am certainly not surprised that splitting the 68-cores of the Xeon Phi 7250 processors into 2 18-core nodes plus 2 16-core nodes would lead to some unexpected behaviors.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
QuickThread used (uses) pthread on Linux, and Windows native threads on Windows. An application at run time can construct thread teams by use of a token specifying proximity placement. Virtually any type of placement can be used:
All threads
All threads within same NUMA node of thread of invocation (as well as within various NUMA distances)
All threads within LLC (typically L3 or same socket) of thread of invocation
All threads within L2
All threads within L1 (IOW all sibling threads of core)
To self (for processing later)
One thread for each NUMA node (as well as within various NUMA distances)
One thread per LLC
One thread per L2
One thread per L1 (one per core)
One thread per L2 within LLC (IOW one each per L2 within socket)
One thread per L1 within LLC
One thread per L1 within L2
... (additional specialized grouping)
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sergey asked:
John, It is not clear what you're talking about. Did you mean MCDRAM, or something else?
The issue that I was referring to was the use of MCDRAM as memory (not as cache), which is usually referred to as "Flat" mode (rather than as "cache" mode) on Xeon Phi x200.
When "Flat" mode is combined with SNC4 mode, each KNL processor reports that there are 8 NUMA "nodes". Nodes 0,1,2,3 have cores and DDR4 memory, while nodes 4,5,6,7 have no cores but do each have 1/4 of the MCDRAM memory.
- The MCDRAM memory reported as NUMA node 4 is the part that is physically adjacent to the cores in NUMA node 0.
- The MCDRAM memory reported as NUMA node 5 is the part that is physically adjacent to the cores in NUMA node 1.
- etc.
From a locality perspective, SNC4 mode is ideal for an MPI application configured to use 4 MPI tasks per KNL. The inconvenient part is setting up the bindings to force each MPI task to use "local" MCDRAM memory. The overall procedure typically requires that the MPI launcher starts a script (rather than directly launching the MPI task), and having the script query a number of environment variables, compute some new variables, and then launch the local copy of the MPI executable. The steps are something like:
- MPI launcher launches script on the target nodes, with 4 copies of the script assigned to each node.
- Each script gets its global MPI rank number from an environment variable.
- Each script computes MPI-rank modulo 4 to get its "local" MPI rank number.
- This will be used directly to determine the NUMA node where the processes for the corresponding MPI task will be run.
- Each script adds 4 to the "local" MPI rank number.
- This will be used as the NUMA node number for the memory binding.
- Each script uses "numactl --membind=<local MPI rank number +4> --physcpubind=<local MPI rank number>" to launch the MPI executable.
This is not too hard, but it is a pain to ensure that the MPI process/thread bindings are actually consistent with the assumptions above, and it is a big pain to generalize the logic so that the scripts will work correctly the "normal" mode, the "SNC2" mode, and the "SNC4" mode. All of this is made even more complex by the need (in some environments) to deal with the different environment variables and different default thread binding schemes used by different MPI stacks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The majority of the KNL nodes at TACC are configured in "cache" mode with "quadrant" interleaving. With the addition of an Intel kernel patch to sort the page tables, this works reasonably well and avoids the need for any numactl memory binding.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>I've been talking about setting OpenMP thread affinities by using a "trick": OpenMP thread num -> Native thread handle -> Set a thread Affinity using an OS kernel API. The "trick" doesn't use KMP_... or OMP_... environment variables.
You could just as well use OpenMP (base level) thread num -> OS kernel logical processor(s)/HW threads. This works well for most situations. You can do your own compact or scatter or other variations, as well as bind a thread to multiple OS kernel logical processor(s)/HW threads. For example, on Xeon w/HT:
Read the process affinity map (do not assume the process is bound to all the HW threads)
bind OpenMP thread (team member) number 0 to the first two
bind OpenMP thread number 1 to the second two bits in the process affinity bitmap
...
The above though is naïve in making assumptions about the process affinity map. (QuichThread uses CPUID/CPUIDEX to read APIC and/or APIC2 to remove assumptions).
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
>>Nobody is going to re-implement what was done in March 2015. It is working and KMP_AFFINITY modes ( scatter, balanced or compact ) are already supported.
Sergey,
That is too inclusive of a statement. There are some specialized cases where the one size does not "fits" all. For example, using OpenMP environment variables and library calls, how does one permit one or more of the threads float (not pinned) while pinning the other threads? While this is not a usual case, it certainly contradicts Nobody.
Jim Dempsey
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page