Intel MKL (2019-2021) no longer threads internally when using MPI - Page 2

John_Young · ‎01-28-2021

Hi,

Attached is a test case which exhibits a slowdown in our codes we have been observing recently when we moved our codes from Intel 2018. This only occurs with MKL and MPI. We do not observe it when using MKL without MPI.

In Intel 2018 with MPI, blas/lapack calls using MKL would thread internally and we got good performance. Starting with 2019, when using MPI, the MKL calls do not seem the thread internally anymore.

In the attached test case, we perform two loops: one we thread ourselves with OpenMP and one we do not thread ourselves. Within each loop, we call dgemm (other functions also exhibit the issue). With Intel 2018, both loops perform similarly. For the non-threaded loop, we can observe that MKL is threading the blas call internally by observing the cpu usage (using top). However, from Intel 2019 onward, the non-threaded loop does not exhibit any threading when we observe the cpu usage (using top) and the loop execution time is much slower than the loop we thread explicitly.

Here are timings from our Linux cluster using 4 mpi processes bound to 4 physical nodes (4 process per node) with 16 cores per process. Our compile line is

mpiifort test_blas.F90 -traceback -O2 -fpp -qopenmp -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl

MKL VERSION                         2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Non-Threaded:     1.35             1.45    1.35                1.30
TIME(s) for Threaded :             1.35 16.1                     16.1                16.1

Why did the threading behavior change from 2019 onward? Is there any setting we can set in Intel 2019-2021 to recover the threading behavior of 2018? If not, can the threading be turned back on in MKL when using MPI in a future release? This is a critical issue for the performance of our code on clusters.

Thanks,

John

John_Young · ‎07-13-2021

Hi Gennady,

Attached is the test we use on our cluster. The run.sh script is how we start the test on our linux cluster. We still observe the problem in Intel 2021.3. The output data we see is in the screen_*.txt files for single node runs where the node has 16 cores and we assign 4 mpi processes to the node (hopefully using 4 cores per mpi process).

You can see in the Intel 2018 data that all three loops (calling dgemm) take about 4 seconds. However, for Intel 2019 to 2021.3, the explicitly threaded loop takes about 4 seconds while the first loop relying on MKL threading takes 16 seconds. Then, after explicitly calling mkl_set_num_threads, the loop relying on MKL threading drops back to 4 seconds.

Gennady_F_Intel · ‎08-23-2021

Yes, the MKL behavior was slightly changed. In the case when MKL routines will be called from

within the MPI processes, the sequential call would be done.

You might manage the number of OpenMP threads by expliciltly disabling the

adjustment of the number of OpenMP theads and setting the desihred numbers of thread:

mkl_set_dynamic(0);

mkl_set_num_threads( omp_get_max_threads());

e.x:

added the dgemm call to the original example you shared and

compiling the code against oneMKL 2021

CPU 2 socets * 20 threads == 40

]$ mpirun -n 1 ./2021.x

MPI world size=1

After MPI_Init():

omp_max_num_threads=40

mkl_max_num_threads=1

MKL_VERBOSE oneMKL 2021.0 Update 3 Product build 20210617 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.40GHz lp64 intel_thread

MKL_VERBOSE DGEMM(N,N,256,256,256,0x7ffeeab409c0,0x2afd55def080,256,0x2afd55e70080,256,0x7ffeeab409c8,0x2afd55ef1080,256) 34.56ms CNR:OFF Dyn:0 FastMM:1 TID:0 NThr:40

or 4 MPI threds

$ mpirun -n 4 ./2021.x

.....

omp_max_num_threads=10

....

mkl_max_num_threads=1

...

MKL_VERBOSE oneMKL 2021.0 Update 3 Product build 20210617 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.40GHz lp64 intel_thread

MKL_VERBOSE DGEMM(N,N,256,256,256,0x7ffeeaa6f700,0x2b12ef7ab080,256,0x2b12ef82c080,256,0x7ffeeaa6f708,0x2b12ef8ad080,256) 2.61ms CNR:OFF Dyn:0 FastMM:1 TID:0 NThr:10

Gennady_F_Intel · ‎08-23-2021

This issue has been resolved and we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.