MKL library scans available cores - touching nodes it absolutely shouldn't

VladP · ‎05-25-2021

I'm running the following with numpy/mkl 2019. Environment is

"KMP_AFFINITY=verbose" MKL_NUM_THREADS=1 MKL_DOMAIN_NUM_THREADS="MKL_BLAS=1" MKL_DYNAMIC=FALSE OMP_D
YNAMIC=FALSE OMP_NUM_THREADS=1 MKL_VERBOSE=1

strace -e trace=sched_setaffinity  taskset -cp 2-3     $(which python) -c 'import numpy as np; a = np.random.normal(size=(1000,1000)); np.dot(a, a)' 
sched_setaffinity(0, 16, {c, 0})        = 0
Numpy + Intel(R) MKL: THREADING LAYER: (null)
Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
Numpy + Intel(R) MKL: preloading libiomp5.so runtime
sched_setaffinity(0, 16, 0)             = -1 EFAULT (Bad address)
OMP: Info #211: KMP_AFFINITY: decoding x2APIC ids.
sched_setaffinity(0, 16, {4, 0})        = 0
sched_setaffinity(0, 16, {8, 0})        = 0
sched_setaffinity(0, 16, {c, 0})        = 0
OMP: Info #209: KMP_AFFINITY: Affinity capable, using global cpuid leaf 11 info
OMP: Info #154: KMP_AFFINITY: Initial OS proc set respected: {2,3}
OMP: Info #156: KMP_AFFINITY: 2 available OS procs
OMP: Info #157: KMP_AFFINITY: Uniform topology
OMP: Info #179: KMP_AFFINITY: 1 packages x 2 cores/pkg x 1 threads/core (2 total cores)
OMP: Info #249: KMP_AFFINITY: pid 3521664 tid 3521664 thread 0 bound to OS proc set {2,3}
sched_setaffinity(0, 16, {c, 0})        = 0
sched_setaffinity(0, 16, {c, 0})        = 0
sched_setaffinity(0, 16, {1, 0})        = 0
sched_setaffinity(0, 16, {2, 0})        = 0
sched_setaffinity(0, 16, {c, 0})        = 0
MKL_VERBOSE Intel(R) MKL 2019.0 Update 5 Product build 20190808 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions (Intel(R) AVX) enabled processors, Lnx 3.00GHz lp64 intel_thread
MKL_VERBOSE DGEMM(N,N,1000,1000,1000,0x7ffd61ee7270,0x7faf55733010,1000,0x7faf55733010,1000,0x7ffd61ee7278,0x7faf54f91010,1000) 92.50ms CNR:OFF Dyn:0 FastMM:1 TID:0  NThr:1
+++ exited with 0 +++

The problem is that MKL does not seem to be respecting the original taskset (only cores 2 and 3). After declaring that it does, it also probes cores 0 and 1 (with masks 1 and 2, respectively).

Is this a big deal? I can imagine that in most circumstances it is not. But what if you had a realtime process on either core 0 or 1? Well, if the calling process had regular priority, it never gets scheduled, so it hangs.

Please advise.

Thank you,

Vlad

RahulV_intel · ‎05-26-2021

Hi,

Thanks for reporting your issue. We will try this at our end and get back to you with an update.

Regards,

Rahul

Ruqiu_C_Intel · ‎06-04-2021

Hi Vlad,

The issue might be python itself or openMP. Inside MKL we don’t have any unique mechanism for threading, we just rely on openMP. MKL printed that it got only 1 thread (MKL_BLAS=1) so it would use only 1 thread. Also MKL verbose info is printed before going to the execution of the MKL kernels.

If you remove np.dot and keep everything else, you will see the sched_setaffinity(0, 16, [0]) = 0 and sched_setaffinity(0, 16, [1]) = 0 still exist. So we can confirm that the issue is not relate to MKL. You can try it in your site.

Here is my logs:

# strace -e trace=sched_setaffinity taskset -c 2-3 $(which python) -c "import numpy as np; a = np.random.normal(size=(1000,1000)); "

sched_setaffinity(0, 16, [2, 3]) = 0

mkl-service + Intel(R) MKL: THREADING LAYER: (null)

mkl-service + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime

mkl-service + Intel(R) MKL: preloading libiomp5.so runtime

OMP: Info #155: KMP_AFFINITY: Initial OS proc set respected: 2,3

OMP: Info #216: KMP_AFFINITY: decoding x2APIC ids.

sched_setaffinity(0, 16, [2]) = 0

sched_setaffinity(0, 16, [3]) = 0

sched_setaffinity(0, 16, [2, 3]) = 0

OMP: Info #157: KMP_AFFINITY: 2 available OS procs

OMP: Info #158: KMP_AFFINITY: Uniform topology

OMP: Info #287: KMP_AFFINITY: topology layer "LL cache" is equivalent to "socket".

OMP: Info #287: KMP_AFFINITY: topology layer "L3 cache" is equivalent to "socket".

OMP: Info #287: KMP_AFFINITY: topology layer "L2 cache" is equivalent to "core".

OMP: Info #287: KMP_AFFINITY: topology layer "L1 cache" is equivalent to "core".

OMP: Info #192: KMP_AFFINITY: 1 socket x 2 cores/socket x 1 thread/core (2 total cores)

OMP: Info #218: KMP_AFFINITY: OS proc to physical thread map:

OMP: Info #172: KMP_AFFINITY: OS proc 2 maps to socket 0 core 2 thread 0

OMP: Info #172: KMP_AFFINITY: OS proc 3 maps to socket 0 core 3 thread 0

OMP: Info #254: KMP_AFFINITY: pid 76728 tid 76728 thread 0 bound to OS proc set 2,3

sched_setaffinity(0, 16, [2, 3]) = 0

sched_setaffinity(0, 16, [0]) = 0

sched_setaffinity(0, 16, [1]) = 0

sched_setaffinity(0, 16, [2, 3]) = 0

MKL_VERBOSE Intel(R) MKL 2021.0 Update 2 Product build 20210312 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Intel(R) Deep Learning Boost (Intel(R) DL Boost), Lnx 2.10GHz lp64 intel_thread

MKL_VERBOSE SDOT(2,0x556a2f4621b0,1,0x556a2f4621b0,1) 1.86ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:1

+++ exited with 0 +++

Thanks,

Ruqiu

Ruqiu_C_Intel · ‎06-30-2021

Since we didn't hear back from you, we are closing this thread for now. If you require any additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.