Re:MKL library scans available cores, disregards existing cpu affinity

EddyF · ‎10-22-2021

Hi, this is about the same issue as this earlier thread: https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-library-scans-available-cores-touching-nodes-it-absolutely/m-p/1284598

Since the earlier thread never had any resolution, I'm reopening a new thread (as suggested in that thread).

Here is what I'm running to reproduce the problem:

conda create -n mkl -c intel mkl-service
conda activate mkl
taskset -c 2,4,5 strace -e trace=sched_setaffinity python -c 'import mkl; mkl.get_num_stripes()'

Output:

sched_setaffinity(0, 8, [2])            = 0
sched_setaffinity(0, 8, [4])            = 0
sched_setaffinity(0, 8, [5])            = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [0])            = 0
sched_setaffinity(0, 8, [1])            = 0
sched_setaffinity(0, 8, [2])            = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
+++ exited with 0 +++

As we can see there are two phases happening here:

First it scans through cpus 2, 4, and 5
Later, it scans through cpus 0, 1, and 2

It seems that if the initial affinity set contains N cpus, then the second phase above will always scan through cpus 0 through N-1, regardless of which cpus were actually in the affinity set. This seems like a very strange and patently buggy behavior?

Using gdb, I was able to figure out that all of these sched_setaffinity calls are happening inside of a function called mkl_serv_get_num_stripes. Furthermore, the "first phase" (where we scan through the correct cpus) is happening inside of a sub-call to omp_get_num_procs ; the "second phase" (which is buggy) happens inside of mkl_serv_get_num_stripes itself.

What can be done to fix this?

RahulV_intel · ‎10-25-2021

Hi,

Thanks for posting on the MKL forum. I could reproduce your output with oneAPI 2021.4, but I would further need to check with the team internally whether MKL is causing this behavior.

Regards,

Rahul

Ruqiu_C_Intel · ‎10-28-2021

Hi ,

As the early thread mentioned that the issue might be python itself or openMP rather than MKL.

You can test without MKL involved, the issue still exist. For more details, please refer to the early thread: https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-library-scans-available-cores-to...

Thanks,

Ruqiu

EddyF · ‎10-28-2021

Hi, thanks for the response! Unfortunately I'm quite sure that the issue is coming from inside of MKL. I have a simpler reproducible example than the earlier thread, which makes it more clear that MKL is the problem.

This is how I create my python environment and install MKL:

conda create -n mkl -c intel mkl-service
conda activate mkl

As you can see, I'm only installing the mkl-service package (along with its dependencies), from the Intel conda channel to make sure it's the latest official version.

If I run the following then the issue appears:

Input:

taskset -c 2,4,5 strace -e trace=sched_setaffinity python -c 'import mkl; mkl.get_num_stripes()'

Output:

sched_setaffinity(0, 8, [2])            = 0
sched_setaffinity(0, 8, [4])            = 0
sched_setaffinity(0, 8, [5])            = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [0])            = 0
sched_setaffinity(0, 8, [1])            = 0
sched_setaffinity(0, 8, [2])            = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
+++ exited with 0 +++

However, if I skip calling mkl.get_num_stripes(), then the issue does not appear:

Input:

taskset -c 2,4,5 strace -e trace=sched_setaffinity python -c 'import mkl'

Output:

+++ exited with 0 +++

This shows that the sched_setaffinity calls are happening inside of mkl.get_num_stripes().

As I was saying at the start of this thread, I have already used gdb to investigate further, and I figured out exactly where sched_setaffinity was being called from:

The first few calls (scanning through 2, 4, 5) happen inside of on OpenMP function called omp_get_num_procs which is called from an MKL function called mkl_serv_get_num_stripes
The remaining calls (scanning through 0, 1, 2, which is erroneous) happens directly inside of mkl_serv_get_num_stripes

I have also spent some time stepping through the execution of mkl_serv_get_num_stripes in gdb (one assembly instruction at a time) and I could see exactly where it was triggering the erroneous sched_setaffinity syscalls. Are you familiar with this function? I guess it is an internal MKL function and its source code is not publicly available. It would be great if someone within Intel who has access to the source code and knows how it is built could have a closer look and confirm if these observations make sense.

Ruqiu_C_Intel · ‎11-01-2021

Hi Eddy,

Thanks for the information! We are investigating it internally and will let you know once there is any update.

Ruqiu_C_Intel · ‎11-22-2021

Hi,

Have you tried to set MKL_DYNAMIC=false? Its default value is Ture, oneMKL can adjust number of threads to get the best performance. Switching off MKL_DYNAMIC will let user set whatever he wants. For more details, please check the MKL document here:

https://www.intel.com/content/www/us/en/develop/documentation/onemkl-windows-developer-guide/top/managing-performance-and-memory/improving-performance-with-threading/using-additional-threading-control/mkl-dynamic.html

Regards,

Ruqiu

EddyF · ‎11-22-2021

I just tried it but it doesn't seem to affect anything and the end result is the same. Using the same minimal reproducible example from my earlier posts:

Input:

MKL_DYNAMIC=false taskset -c 2,4,5 strace -e trace=sched_setaffinity python -c 'import mkl; mkl.get_num_stripes()'

Output:

sched_setaffinity(0, 8, [2])            = 0
sched_setaffinity(0, 8, [4])            = 0
sched_setaffinity(0, 8, [5])            = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
sched_setaffinity(0, 8, [0])            = 0
sched_setaffinity(0, 8, [1])            = 0
sched_setaffinity(0, 8, [2])            = 0
sched_setaffinity(0, 8, [2, 4, 5])      = 0
+++ exited with 0 +++

I also tried MKL_DYNAMIC=FALSE and the result is exactly the same. @Ruqiu_C_Intel is it unexpected that this didn't work? Does it work for you? Perhaps there are some additional environment variables that I need to set? I have tried many combinations and have not yet been able to find anything that works for skipping the faulty sched_setaffinity logic.

Ruqiu_C_Intel · ‎12-21-2021

Hi Eddy,

Thanks for your patience. We are still investigating the issue.

Why skip calling mkl.get_num_stripes(), then the issue does not appear, The reason is that no MKL functions gets called. In fact the observed issues is only triggered when a global thread-control function is called.

Regards,

Ruqiu

EddyF · ‎01-04-2022

Thanks for the update! Yes, I agree, the issue only gets triggered when a certain global thread-control function is called (and a lot of MKL functions cause that thread-control function to be called, so the issue ends up happening with almost any MKL function). I'll be on the lookout for further updates here.

Ruqiu_C_Intel · ‎03-01-2022

The fixed will be available in our next version. Thank you for your patience.

EddyF · ‎03-11-2022

Great! Thank you!