Quick numactl question. Do I need to use numactl to run in snc-4/snc-2? I know from examples I've seen is that if you are in flat memory mode, you can pin the threads to fast memory with numactl, like this:
numactl -m 4,5,6,7
But if I am in cache mode, there is no fast memory option. So snc-4/snc-2 in cache mode is equivalent to quadrant and hemishphere, no?
thanks for any clarification,
you are right that in cache mode there is no "fast memory" visible by operating system.
However snc modes expose arrangement while quadrant and hemishphere don't.
Cache mode + SNC should be used the same as multi-socket Xeons are with only sockets or groups of cpus exposed by numa.
Hope it helps.
On a KNL systems in Cached-SNC-2 or Cached-SNC-4 mode, you would use "numactl" in exactly the same way (and for exactly the same reasons) as on a 2-socket or 4-socket Xeon server -- the MCDRAM cache is invisible in these cases.
Just as on any other multi-socket system, the default local memory allocation policy means that binding each of your processors to a specific NUMA node is usually all you need to do to maintain process-data affinity.
Thanks for the reply. For our specific situation, an application running an OpenMP application on a KNL node with 68 cores in cache/SNC-4 mode, what is the appropriate numactl invocation of the app to get optimal performance?
or is this better?:
numactl -l --interleave=0,1,2,3 myapp
do you guys have any suggestions? thanks
For OpenMP jobs, you need separate binding to each set of threads. This can't be done with numactl.
All OpenMP implementations support thread binding using environment variables.
For the 68-core KNL in SNC4 mode, the configuration is
- NUMA node 0 has 18 cores (9 tiles)
- NUMA node 1 has 18 cores (9 tiles)
- NUMA node 2 has 16 cores (8 tiles)
- NUMA node 3 has 16 cores (8 tiles)
This uneven layout can be challenging to deal with in OpenMP, depending on what it is that you are trying to do. Any core-level OpenMP binding will guarantee process/memory affinity in this case. If you are using all 68 cores, the memory traffic will be higher on nodes 0 and 1 than it is on 2 and 3, which may be an issue. If you want to use 64 cores, then I would use "lscpu" to determine which logical processors are mapped to each node, then I would build an explicit processor list for KMP_AFFINITY to put 16 threads on each of the four NUMA nodes.