Showing results for 
Search instead for 
Did you mean: 
New Contributor I

Intel MKL (2019-2021) no longer threads internally when using MPI


Attached is a test case which exhibits a slowdown in our codes we have been observing recently when we moved our codes from Intel 2018.  This only occurs with MKL and MPI.  We do not observe it when using MKL without MPI.

In Intel 2018 with MPI, blas/lapack calls using MKL would thread internally and we got good performance.  Starting with 2019, when using MPI, the MKL calls do not seem the thread internally anymore. 

In the attached test case, we perform two loops: one we thread ourselves with OpenMP and one we do not thread ourselves.  Within each loop, we call dgemm (other functions also exhibit the issue).  With Intel 2018, both loops perform similarly.  For the non-threaded loop, we can observe that MKL is threading the blas call internally by observing the cpu usage (using top).  However, from Intel 2019 onward, the non-threaded loop does not exhibit any threading when we observe the cpu usage (using top)  and the loop execution time is much slower than the loop we thread explicitly.

Here are timings from our Linux cluster using 4 mpi processes bound to 4 physical nodes (4 process per node) with 16 cores per process.  Our compile line is

mpiifort test_blas.F90  -traceback -O2 -fpp -qopenmp -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl


MKL VERSION                         2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Non-Threaded:     1.35             1.45                      1.35                1.30
TIME(s) for Threaded :             1.35           16.1                     16.1                16.1

Why did the threading behavior change from 2019 onward? Is there any setting we can set in Intel 2019-2021 to recover the threading behavior of 2018?  If not, can the threading be turned back on in MKL when using MPI in a future release? This is a critical issue for the performance of our code on clusters.



0 Kudos
3 Replies
New Contributor I

So, after more investigation, I tried setting the environmental variable MKL_NUM_THREADS=16 (the number of cores on each cluster nodes), and the timings for 2019-2021 return to the 2018 timings.

MKL VERSION               2018.0.03 2019.0.4 2020.0.4 2021.1
TIME(s) for Non-Threaded: 1.25 1.25 1.25 1.25
TIME(s) for Threaded : 1.30 1.30 1.30 1.30

Alternatively, I can also call mkl_set_num_threads(16) at the start of my program and the timings for MKL versions are similar at around 1.3 seconds. 

However, whether or not I set the MKL_NUM_THREADS environmental variable, a call to mkl_get_max_threads within the program returns 16.  So, it seems that when using Intel 2019-2021, MKL is defaulting to using one thread unless you set MKL_NUM_THREADS (or call mkl_set_num_threads) explicitly.  This seems a very strange default as the 2018 behavior of using all the available threads by default seems to be a much more desirable behavior.

Also, when I do not set the mkl threads explicitly, why is mkl_get_max_threads returning 16 but only using one thread internally (for 2019-2021)?  This does not make sense.  Should not mkl_get_max_threads return the number of threads that mkl will use internally to thread (except when already in a threaded region)?

Could we get the default behavior for MKL running with MPI returned to that of Intel 2018 in a future release (unless there is a good reason why this was changed)?





New Contributor I

I was slightly mistaken in my last post.  By default if you do not set the number of mkl threads explicitly, when you call mkl_get_max_threads() from Intel 2019-2021, only one thread is reported even though a call to omp_get_max_threads() returns 16.  For Intel 2018, both mkl_get_max_threads() and omp_get_max_threads() return the same value. 

So mkl_get_max_threads() is not reporting something inconsistent with the behavior.

However, it would be nice if mkl would use all the threads by default in Intel 2019-2021.



Hi John,

We are transferring your query to the internal team as they can better answer the change in behavior.