When running codes on our KNL nodes at LRZ (64 cores) using OpenMP with "KMP_AFFINITY=verbose" and one of the pinning settings (either compact or scatter), the core IDs printed in the output do not seem right and even exceed the physical core count when multiple threads/core are used, e.g.
OMP: Info #171: KMP_AFFINITY: OS proc 62 maps to package 0 core 72 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 126 maps to package 0 core 72 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 190 maps to package 0 core 72 thread 2
OMP: Info #171: KMP_AFFINITY: OS proc 254 maps to package 0 core 72 thread 3
OMP: Info #171: KMP_AFFINITY: OS proc 63 maps to package 0 core 73 thread 0
OMP: Info #171: KMP_AFFINITY: OS proc 127 maps to package 0 core 73 thread 1
OMP: Info #171: KMP_AFFINITY: OS proc 191 maps to package 0 core 73 thread 2
OMP: Info #171: KMP_AFFINITY: OS proc 255 maps to package 0 core 73 thread 3
This is making thread affinity very nontransparent and comparing pinning strategies is quite hard to do.
From what I can tell, the "apic_ids" of the logical cpus seem to repeat at some point (this I found out from manually reading out the IDs from cpuid calls).
I have attached a small test example with c++ code, with a job script I used, and an example output file. You can compile the code with:
mpiicpc -qopenmp test.cpp -o test
any help is appreciated. Respectfully yours, Momme Allalen
The "core" numbers used by KMP_AFFINITY's "verbose" option are not contiguous values on many Intel processors. They are derived from the X2APIC ID that can be obtained from the CPUID instruction. On some processors (e.g., Haswell) it looks like the "missing" core numbers correspond to cores that are only present on larger die versions, but this is not always the case. The important thing is that each "core" number used by KMP_AFFINITY corresponds to a different physical core, which is the main point of thread/process binding.
Note that there is no published mapping of "core" numbers used by KMP_AFFINITY maps to the physical layout of the cores on the die. On most Intel processors (Haswell/Broadwell/Skylake Xeon, but not KNL), the set of "core" numbers used is the same, even if different subsets of the cores are disabled. If I recall correctly, KNL does not do this extra level of mapping, and the "core" numbers used on KNL processors are always in the range of 0..75, with the missing values corresponding to the disabled cores on that particular die. There is still no published "map" that shows where on the die the "core" numbers are located, but at least on KNL it does not look like you have to invert another level of remapping to compute the map.
I hope to be talking about this mapping for Skylake Xeon and KNL processors at the next meeting of the Intel Xeon Phi User's Group (IXPUG) in Bologna, Italy, March 5-7 (https://www.ixpug.org/events/spring2018).
What you are seeing is how the OS is also reporting the identity of the logicalCPUs in the hardware. (lf you look at /proc/cpuinfo you will see this non-contiguous enumeration). There is therefore nothing sensible for the OpenMP runtime to do. If it were to use some other, contiguous, enumeration, that would be even more confusing, since then you'd have two different names for the same entity.
The critical things to be careful about when comparing pinning strategies are
If you explore that space you should be able to find the good points.
Note that none of that requires a contiguous core enumeration...
How to Plot OpenMP Scaling Results may also help.
Thank you John and thank you James for providing all these information. I will try also the environment settings above whether it help to speed up the code.
Thanks for your help.