As I couldn't think of any other place where I should post this information to get someone to fix it:
XPPSL version 1.5.1 introduced a bug in its bundled hwloc. This results in wrong process binding using slurm when the KNL is configured in SNC4 + Flat. In particular when, e.g. running 4 processes per KNL, the first two processes are bound correctly, the third hower is bound to the hwthread id #2 of all 64 cores, the fourth process to hwthread id #3 of all cores. Thus process 3 and 4 are running one thread on every core instead of 4 threads on 16 cores only. SNC4 + Cache is not affected.
This bug has not been fixed in 1.5.2. I actually wanted to dig into the sources to find the exact bug, however as it seems like Intel still prefers to hide any changes to open source software as good as possible, I gave up and just inform you this way. A public github page or a simple bugtracker could be useful as well for users to submit bugs. If such thing actually exists (and I do not have to register any product to get access), please tell me where I can find it.
I'm using a very naive slurm (vanilla 17.02) setup, you can essentially reproduce it by using default configuration. Only slurm is interacting with hwloc, I do not interact with it directly.
Anyway, I had a closer look at both the xppsl modified hwloc and slurm affinity source code:
One can argue wether the issue is actually caused by a previous bug in xppsl hwloc (up to 1.5.0) that slurm circumvated in its source and has been fixed in 1.5.1 causing slurm to malfunction as it assumes xppsl hwloc to be broken. Or one can argue that with xppsl 1.5.1 hwloc the OCP has been violated potentially causing issues in any software relying on the previous behavior.
In particular the behavior of some functions (see diff include/hwloc/helper.h 1.5.0 -> 1.5.1) has been changed to ignore objects with empty CPU sets. Slurm indirectly relies on at least one of these functions (see src/slurmd/common/xcpuinfo.c) and assumes it to not ignore objects with empty CPU sets, instead earlier checks indices for empty CPU sets in an earlier step and ignores these later.
These two different approaches to ignore empty CPU sets used in conjunction cause wrong behavior.
At least those are my assumptions after investigating both source codes, I have not yet written a minimal working example to confirm.
However I'm very confident these assumptions are correct.
It turned out all my assumptions have been correct. See a proposed patch for slurm to work for both the old and new hwloc behavior (by re-checking for empty CPU sets) in the attachments (I will try to get that patch or a better solution upstream).