Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
7234 Discussions

How to set affinity of threads spawned by MKL?

styc
Beginner
3,323 Views
I have a program which invokes MKL from within an OpenMP parallel region. It sets $MKL_DYNAMIC and $MKL_NUM_THREADS so that MKL will exploit nested parallelism, and calls MKL to work on different sets of data from different OpenMP threads. Is it possible to set the affinity mask of threads spawned by MKL from a specific function call?
0 Kudos
1 Solution
Dmitry_B_Intel
Employee
3,323 Views

Hi styc,

The instructions in the MKL User's Guide seem to be incomplete. The code snippet in the MKL User's Guide is apparently missing correct thread identification: instead of getpid() one should use syscall(SYS_gettid). Another issue is thatOpenMP layer appliesin terms of OpenMP threads while theyare dynamically mapped toOS threads. This issue can be worked around by settingenvvar KMP_AFFINITY=disabled (seeThread Affinity Interface) - this may have perfromance implications though, I don't know.

In summary, could you try this function for binding current thread to cpus?

// Handle up to 32 cpus
void bind_me_to(unsigned cpumask)
{
cpu_set_t mask;
pid_t tid = syscall(SYS_gettid);
int cpuid;

CPU_ZERO(&mask);
for (cpuid=0; cpuid < 32; cpuid++)
{
if (cpumask & (1< CPU_SET(cpuid, &mask);
}
sched_setaffinity(tid, sizeof(mask), &mask);
}

This function is assumed to be called in the following setup, ifI understood you correctly (ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions):

#pragma omp parallel default(shared) num_threads(2)
{
int omp_tid = omp_get_thread_num();
omp_set_nested(1); // nested parallel regions should be enabled
if (omp_tid==0)
{
bind_me_to(0x0f); // four threads on one socket
omp_set_num_threads(4);
do_dgemm();
}
if (omp_tid==1)
{
bind_me_to(0xf0); // four threads on another socket
omp_set_num_threads(4);
do_fft();
}
}

I hope this will help
Thanks
Dima

View solution in original post

0 Kudos
7 Replies
TimP
Honored Contributor III
3,323 Views
You may be able to set the environment variable KMP_AFFINITY or GOMP_AFFINITY prior to the parallel region. I don't think this will be effective when MKL_DYNAMIC is set. If these are your questions, it would be good to have an answer from the library experts.
I'm wondering why I don't find documentation on KMP_AFFINITY=physical, which appears to be the favored setting for HyperThreading.
0 Kudos
styc
Beginner
3,323 Views
Quoting - tim18
You may be able to set the environment variable KMP_AFFINITY or GOMP_AFFINITY prior to the parallel region. I don't think this will be effective when MKL_DYNAMIC is set. If these are your questions, it would be good to have an answer from the library experts.
I'm wondering why I don't find documentation on KMP_AFFINITY=physical, which appears to be the favored setting for HyperThreading.
My program sets MKL_DYNAMIC to FALSE. KMP_AFFINITY is basically something I try to avoid because they don't seem to work on AMD machines. What I hope to see is that threads executing a call to MKL will inherit the affinity mask of the calling OpenMP thread or can have their affinity masks specified (perhaps through some sched_setaffinity magic?).
0 Kudos
TimP
Honored Contributor III
3,323 Views
Quoting - styc
My program sets MKL_DYNAMIC to FALSE. KMP_AFFINITY is basically something I try to avoid because they don't seem to work on AMD machines. What I hope to see is that threads executing a call to MKL will inherit the affinity mask of the calling OpenMP thread or can have their affinity masks specified (perhaps through some sched_setaffinity magic?).
OK, then MKL_DYNAMIC should not be interfering. When I set KMP_AFFINITY=compact,0,verbose with the 10.1 compiler on a recent AMD machine, it gives me the non-support message, but tells me it is setting affinity as if there are 8 single core CPUs. This is effectively the same as taskset -c 0-7, as far as I can see. I don't see any reasonable behavior other than for the same affinity mask to persist in the nested OpenMP. According to the doc, sched_setaffinity() would be the mechanism used for KMP_AFFINITY, so what you see by sched_getaffinity() should be what MKL is using under OMP_NESTED, subject to its own determination of how many additional threads to use.
I agree with your implication that failing to support affinity mask in a similar way on Intel and AMD platforms would be a serious deficiency.
0 Kudos
Dmitry_B_Intel
Employee
3,323 Views

Hello,

MKL User's Guide has a section with examples on setting affinity mask by means of operating system. The section should be named like "Managing Performance and Memory>Tips and Techniques to Improve Performance>Managing Multi-Core Performance". Have in mind that affinity mask is per-thread attribute (on Linux, at least), so it should be set after the top level OpenMP threads are initiated.

Hope this helps
Thanks
Dima
0 Kudos
styc
Beginner
3,323 Views

Hello,

MKL User's Guide has a section with examples on setting affinity mask by means of operating system. The section should be named like "Managing Performance and Memory>Tips and Techniques to Improve Performance>Managing Multi-Core Performance". Have in mind that affinity mask is per-thread attribute (on Linux, at least), so it should be set after the top level OpenMP threads are initiated.

Hope this helps
Thanks
Dima
I tried that that, but it did not quite work. I pinned an OpenMP thread to a core (other threads were simply put to wait on a "#pragma omp barrier"), then called DGEMM from it and expected all MKL threads to get stuffed onto one core. But it seemed that MKL did not quite honor the affinity mask I set---the threads were spread over all cores. Of course this looks crazy. But given that, I really don't know what to do so that on a dual-socket quad-core machine, I can have one (physical) processor handle one DGEMM call and the other processor handle another call from inside the same parallel region.
0 Kudos
Dmitry_B_Intel
Employee
3,324 Views

Hi styc,

The instructions in the MKL User's Guide seem to be incomplete. The code snippet in the MKL User's Guide is apparently missing correct thread identification: instead of getpid() one should use syscall(SYS_gettid). Another issue is thatOpenMP layer appliesin terms of OpenMP threads while theyare dynamically mapped toOS threads. This issue can be worked around by settingenvvar KMP_AFFINITY=disabled (seeThread Affinity Interface) - this may have perfromance implications though, I don't know.

In summary, could you try this function for binding current thread to cpus?

// Handle up to 32 cpus
void bind_me_to(unsigned cpumask)
{
cpu_set_t mask;
pid_t tid = syscall(SYS_gettid);
int cpuid;

CPU_ZERO(&mask);
for (cpuid=0; cpuid < 32; cpuid++)
{
if (cpumask & (1< CPU_SET(cpuid, &mask);
}
sched_setaffinity(tid, sizeof(mask), &mask);
}

This function is assumed to be called in the following setup, ifI understood you correctly (ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions):

#pragma omp parallel default(shared) num_threads(2)
{
int omp_tid = omp_get_thread_num();
omp_set_nested(1); // nested parallel regions should be enabled
if (omp_tid==0)
{
bind_me_to(0x0f); // four threads on one socket
omp_set_num_threads(4);
do_dgemm();
}
if (omp_tid==1)
{
bind_me_to(0xf0); // four threads on another socket
omp_set_num_threads(4);
do_fft();
}
}

I hope this will help
Thanks
Dima

0 Kudos
styc
Beginner
3,323 Views

Hi styc,

The instructions in the MKL User's Guide seem to be incomplete. The code snippet in the MKL User's Guide is apparently missing correct thread identification: instead of getpid() one should use syscall(SYS_gettid). Another issue is thatOpenMP layer appliesin terms of OpenMP threads while theyare dynamically mapped toOS threads. This issue can be worked around by settingenvvar KMP_AFFINITY=disabled (seeThread Affinity Interface) - this may have perfromance implications though, I don't know.

In summary, could you try this function for binding current thread to cpus?

// Handle up to 32 cpus
void bind_me_to(unsigned cpumask)
{
cpu_set_t mask;
pid_t tid = syscall(SYS_gettid);
int cpuid;

CPU_ZERO(&mask);
for (cpuid=0; cpuid < 32; cpuid++)
{
if (cpumask & (1< CPU_SET(cpuid, &mask);
}
sched_setaffinity(tid, sizeof(mask), &mask);
}

This function is assumed to be called in the following setup, ifI understood you correctly (ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions):

#pragma omp parallel default(shared) num_threads(2)
{
int omp_tid = omp_get_thread_num();
omp_set_nested(1); // nested parallel regions should be enabled
if (omp_tid==0)
{
bind_me_to(0x0f); // four threads on one socket
omp_set_num_threads(4);
do_dgemm();
}
if (omp_tid==1)
{
bind_me_to(0xf0); // four threads on another socket
omp_set_num_threads(4);
do_fft();
}
}

I hope this will help
Thanks
Dima

It seems that key is "KMP_AFFINITY=disabled". The program works as I suppose now. Thanks for your response!
0 Kudos
Reply