We're seeing an unexpected (Intel) OMP thread placement in SNC4 mode.
The thread placement is different on each KNL node. While that by itself may ok, we're seeing entire numa nodes not being used, and some numa nodes getting double or triple the number of threads.
A nominal aprun launch is 8 OMP threads under a single MPI rank on a KNL node, in SNC4 mode. All of the nodes tested result in different core placements based on process id. Using the results of numactl -H (which lists what I'm calling process ids), it appears numa nodes are being skipped. If we add KMP_AFFINITY verbose, we can see the process id to core id mapping, and observe consistently the threads are being placed in order based on sequential core id. However, since the core id to process id mapping is seemingly random, so we end up with threads being clumped onto some numa nodes, leaving other entire numa nodes empty.
How do we ensure threads are distributed onto all numa regions? Control to also keep the thread assignment (close?) where they are sequential within a numa node (but still on all numa nodes, say 2 threads per numa node) is also desired. The default thread assignment where some numa nodes are empty, is not expected when in SNC4 mode.
It seems either the OMP runtime is placing the threads incorrectly, or the output of numactl is incorrect wrt which process ids are in the numa nodes.
Hey Larry, sure.
Intel 17.0.1, MPICH 7.4.3 (Cray)
Below is the output from 2 nodes. We first noted the difference in thread placement on cores, and then realized in some cases numa nodes were being skipped. The KMP_AFFINITY verbose showed that thread assignment was being made by sequential order of the next core id, I think as expected. However, due to the core-id to proc-id mapping it seems like the proc ids are not ending up in all the numa nodes.
2 nodes 193, 195, both snc4/flat
193 uses all 4 numa nodes, 195 skips numa node 2, and wraps back to numa node 0
I've included the numactl view of the 2 nodes, they look the same to me.
The attached spreadsheet shows the results from running the above command on 3 different SNC4/flat nodes.
Columns are node id, process id, core id (that is mapped to process id identified by KMP_AFFIINITY verbose), hyper thread id (id'd by verbose, always 0 in these cases), and numa node (as id'd by numactl -H for that process id). Highlights are the threads used by these runs. The rest just include the complete output of verbose with numa nodes added by process id. This view is sorted by core id, which showed why the threads were being assigned in this order. Originally each node id set was sorted by process id, which showed assignment by process id, but re-sorted as it is now, helped understand more what was happening with thread to core id assignment.
Still waiting on a node allocation to try hwloc-dump-hwdata.
OMP_NUM_THREADS was definitely set, but just to 8. Do we also need the 4 to drive the numa nodes correctly.
I did not have OMP_NESTED set. (but can try that)
Is that required in this situation, even if we're not nesting threads?
yes, we tried both cores and threads with OMP_PLACES, no joy
I have not tried I_MPI_DEBUG, but I'm assuming that's an Intel MPI variable, since we're using Cray MPICH, I'm assuming that MPI_DEBUG=5, would be ok. I can give that a shot once I can get onto some KNL SNC4 nodes.
Jim, no, but there is only one MPI rank in this problem.
Do you mean a separate OMP_PLACES entry for each of the numa nodes? What would that look like?
Sergey, I don't seem to have a hwloc-dump* on either the front-end or on the KNL compute node. I see several other hwloc- variants, but no hwloc-dump
Is there some other way to get this info or run hwloc?
>>but there is only one MPI rank in this problem
If you are only running 1 rank, then why run the application as an MPI application???
If running MPI with 1 rank .OR. non-MPI, then set the environment variable prior to issuing the command line (either to directly run the program or issuing mpiexec/mpirun). See: https://software.intel.com/en-us/node/522691#AFFINITY_TYPES
Note, when the KNL is configured as SNC4 or SNC2, although you have only 1 package, interpret the link documentation as if you have as many packages as you have nodes. For 0 or 1 ranks, consider using KMP_AFFINITY=scatter (then the 8 or whatever number of threads would be distributed equally).
If you are running multiple ranks see: https://software.intel.com/en-us/node/528776 to see how you can use the ":" to specify different environment variables for each rank. IOW when configured as SNC4 and running 4 ranks, use different placement environment variables (on each side of the :'s)
Use the second long command-line syntax to set different argument sets for different MPI program runs. For example, the following command executes two different binaries with different argument sets:
$ mpiexec.hydra -f <hostfile> -env <VAR1> <VAL1> -n 2 ./a.out : \ -env <VAR2> <VAL2> -n 2 ./b.out
Note, in your case, you would run 1 process/rank on each node, same program name, but different -env environment variable settings.
The affinity test code is an MPI+OMP program. We just use one rank, when tests are focused on OMP thread placement. btw, all of this testing is only run on one node at time.
tt-login1 1036% setenv KMP_AFFINITY scatter
tt-login1 1037% aprun -n 1 -d 8 -j 1 -cc none -L 274 ./xthi_knl.intel
Hello from rank 0, thread 0, on nid00274. (core affinity = 0)
Hello from rank 0, thread 1, on nid00274. (core affinity = 1)
Hello from rank 0, thread 4, on nid00274. (core affinity = 36)
Hello from rank 0, thread 5, on nid00274. (core affinity = 37)
Hello from rank 0, thread 2, on nid00274. (core affinity = 18)
Hello from rank 0, thread 3, on nid00274. (core affinity = 19)
Hello from rank 0, thread 7, on nid00274. (core affinity = 21)
Hello from rank 0, thread 6, on nid00274. (core affinity = 20)
No affect with using scatter. In this run, 4 threads are placed on numa node 1, and no threads ended up on numa node 3.
Oh, this is Cray.
See if the attached affinity.docx helps.
aprun -n 1 -cc depth -j 4 -d 8
Will give you 1 rank with 2 cores (8 threads). Then set KMP_HW_SUBSET=1T to get 1 thread per core.
aprun -n 4 -N 1 -cc depth -j 4 -d 8 -S 4
should give you 4 ranks with 2 cores each, 1 rank per numa node
I did a little testing with aprun in snc4, but it takes forever to get a node on theta, so I didn't do much.
The OpenMP runtime in compiler 17 update 1 (the version you are using) does not aware of the NUMA nodes unfortunately, so it tries to use cores in linear manner as you mentioned. More functionality will be available in next versions of the Intel compiler starting 17 update 2 to be released soon.
I don't see general solution of your problem form OpenMP runtime side currently, besides eplicit affinity binding that has to be different on different systems, and thus does not look as a viable solution. As a workaround for three particular systems you mentioned in attached .xls table it is possible to use KMP_HW_SUBSET=32c@18, asking the library to skip first 18 cores until core numbering becomes regular for next 32 cores on all three systems. But this workaround may not work on other systems with yet different core numbering.
So my only hope for now is that Larry's suggestion would work for you. In future compiler releases it will be possible to specify nodes and tiles to be used by the runtime, first via KMP_HW_SUBSET environment variable.
Larry, I looked at sections 3.2 and 6. I tried the example for aprun in section 6. I'm still getting empty numa nodes--no threads assigned.
tt-login1 1027% aprun -n 1 -cc none -j 4 -d 32 -L 276 ./xthi_knl.intel | sort -n -k 4
Hello from rank 0, thread 0, on nid00276. (core affinity = 0) numa node 0
Hello from rank 0, thread 1, on nid00276. (core affinity = 156) 1
Hello from rank 0, thread 2, on nid00276. (core affinity = 57) 3
Hello from rank 0, thread 3, on nid00276. (core affinity = 195) 3
Hello from rank 0, thread 4, on nid00276. (core affinity = 8) 0
Hello from rank 0, thread 5, on nid00276. (core affinity = 146) 0
Hello from rank 0, thread 6, on nid00276. (core affinity = 31) 1
Hello from rank 0, thread 7, on nid00276. (core affinity = 169) 1
I'm focused on a case with a single MPI rank, 8 OMP threads, one thread per core, single KNL node, running in SNC4 mode. Attempting to get even distribution of threads in all 4 numa regions/nodes.
Andrey, do you still want the output from the verbose and KMP_SETTINGS=1?
Your latest comment, indicates that the OMP runtime in 17.0.1 is not aware of numa nodes. We are seeking a generic solution that will work on any KNL node. As you point out the core mapping can be different on each KNL, so a hardcoded core list isn't appropriate--although it would of course allow us to run on a specific node.
If either of you had other tests that you think might work, I'm happy to run those.
Yes, that's exactly the layout that I was after. For some reason when I used the proclist earlier, I thought it assigned by core id, and not the proc id, which made it different for each KNL (due to missing tiles).
I've repeated this with aprun on 2 different KNL nodes, and it works the same as what you show. Great.
Now, is this layout (distributed threads per numa region) possible without the proclist, in order to make it easier for users? So they don't have to check the numa node proc list, and have it as an env setting.
When you use the proclist you are completely overriding what aprun is doing and you need a different proclist for every MPI rank on a node.
Is my document TL;DR? The whole point is to get aprun to give you a CPU affinity mask including all the threads on all the cores in your MPI rank and then use OMP environment variables to subset that for nesting or different OpenMP thread layout.
It is true that with my document you can't run a single OpenMP program with 4 cores/2 theads per core with 1 core per numa node. But you really don't want to do that. You should run at least one MPI rank per numa node. Note also that on the 68 core part, two of the nodes have 9 cores and two have 8 cores.