Re: Intel MKL (2019-2021) no longer threads internally when using MPI

John_Young · ‎01-28-2021

Hi,

Attached is a test case which exhibits a slowdown in our codes we have been observing recently when we moved our codes from Intel 2018. This only occurs with MKL and MPI. We do not observe it when using MKL without MPI.

In Intel 2018 with MPI, blas/lapack calls using MKL would thread internally and we got good performance. Starting with 2019, when using MPI, the MKL calls do not seem the thread internally anymore.

In the attached test case, we perform two loops: one we thread ourselves with OpenMP and one we do not thread ourselves. Within each loop, we call dgemm (other functions also exhibit the issue). With Intel 2018, both loops perform similarly. For the non-threaded loop, we can observe that MKL is threading the blas call internally by observing the cpu usage (using top). However, from Intel 2019 onward, the non-threaded loop does not exhibit any threading when we observe the cpu usage (using top) and the loop execution time is much slower than the loop we thread explicitly.

Here are timings from our Linux cluster using 4 mpi processes bound to 4 physical nodes (4 process per node) with 16 cores per process. Our compile line is

mpiifort test_blas.F90 -traceback -O2 -fpp -qopenmp -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -lmkl_blacs_intelmpi_lp64 -liomp5 -lpthread -lm -ldl

MKL VERSION                         2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Non-Threaded:     1.35             1.45    1.35                1.30
TIME(s) for Threaded :             1.35 16.1                     16.1                16.1

Why did the threading behavior change from 2019 onward? Is there any setting we can set in Intel 2019-2021 to recover the threading behavior of 2018? If not, can the threading be turned back on in MKL when using MPI in a future release? This is a critical issue for the performance of our code on clusters.

Thanks,

John

John_Young · ‎01-29-2021

So, after more investigation, I tried setting the environmental variable MKL_NUM_THREADS=16 (the number of cores on each cluster nodes), and the timings for 2019-2021 return to the 2018 timings.

MKL VERSION               2018.0.03 2019.0.4 2020.0.4 2021.1
TIME(s) for Non-Threaded:  1.25        1.25    1.25    1.25
TIME(s) for Threaded :     1.30        1.30    1.30    1.30

Alternatively, I can also call mkl_set_num_threads(16) at the start of my program and the timings for MKL versions are similar at around 1.3 seconds.

However, whether or not I set the MKL_NUM_THREADS environmental variable, a call to mkl_get_max_threads within the program returns 16. So, it seems that when using Intel 2019-2021, MKL is defaulting to using one thread unless you set MKL_NUM_THREADS (or call mkl_set_num_threads) explicitly. This seems a very strange default as the 2018 behavior of using all the available threads by default seems to be a much more desirable behavior.

Also, when I do not set the mkl threads explicitly, why is mkl_get_max_threads returning 16 but only using one thread internally (for 2019-2021)? This does not make sense. Should not mkl_get_max_threads return the number of threads that mkl will use internally to thread (except when already in a threaded region)?

Could we get the default behavior for MKL running with MPI returned to that of Intel 2018 in a future release (unless there is a good reason why this was changed)?

Thanks,

John

John_Young · ‎01-29-2021

I was slightly mistaken in my last post. By default if you do not set the number of mkl threads explicitly, when you call mkl_get_max_threads() from Intel 2019-2021, only one thread is reported even though a call to omp_get_max_threads() returns 16. For Intel 2018, both mkl_get_max_threads() and omp_get_max_threads() return the same value.

So mkl_get_max_threads() is not reporting something inconsistent with the behavior.

However, it would be nice if mkl would use all the threads by default in Intel 2019-2021.

PrasanthD_intel · ‎02-01-2021

Hi John,

We are transferring your query to the internal team as they can better answer the change in behavior.

Regards

Prasanth

Khang_N_Intel · ‎05-24-2021

Hi John,

I built and linked the code based on the instruction you provided.

I didn't have a cluster system to run. I just ran it on one system. So, I ran it with one rank (0). Here is the result:

Threaded: 0.616 on rank 0

Non-Threaded: 3.950 on rank 0

I used the latest version of oneMKL, 2021.2.0

It seems like the threaded version is better than the non-threaded version.

MRajesh_intel · ‎06-22-2021

Hi,

Can you please provide an update regarding the issue?

Regards

Rajesh.

John_Young · ‎06-22-2021

Hi Rajesh and Khang,

We are still observing the issue for Intel 2019-2021.

@khang: I apologize I missed your last reply above. The timings you show above actually indicate the problem. The threaded code is threaded explicitly and the MKL blas calls should be single-threaded. The non-threaded code should execute sequentially but the MKL blas calls should thread internally. So, if things were working properly, the threaded and non-threaded times should be similar. This indicates that MKL is not threading internally when called from a non-threaded region. In Intel MKL 2018, this worked properly. Note this only happens when you have built with MPI.

Here are some timings I ran today (code attached). In this case, I ran the threaded code twice (A and B below). Before the second run (the B run) of the threaded code, I explicitly called set_mkl_num_threads(16) to force MKL to use 16 threads.

BUILT WITH MPI and ran with 4 mpi processes (16 cores per process) (Time in seconds)

MKL VERSION                         2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Non-Threaded:        1.34           1.45     1.54                1.50
TIME(s) for Threaded A:             1.33 16.1                     16.1                16.2
TIME(s) for Threaded B:             1.31     1.3                        1.4                  1.5

Next, I build the code without (removing all MPI calls) and ran on a node with 16 cores. The timings are

BUILT WITHOUT MPI

TIME(s) for Threaded After calling set_mkl_num_threads explicitly

MKL VERSION                         2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Non-Threaded:        1.44           1.48     1.44                 1.50
TIME(s) for Threaded A:             1.33    1.30                     1.37                1.34
TIME(s) for Threaded B:             1.31    1.28                     1.31                1.29

When built without MPI, both threaded runs (A and B) give the same good time. This indicates that when built without MPI, Intel MKL is setting the number of threads properly so that MKL calls are threaded when called from a serial region. However, when built with MPI, Intel MKL (2019-2021) is somehow not detecting the number of threads properly so the user is forced to set them explicitly to achieve to threading performance. Since Intel MKL 2018 gives good results in all cases, I believe something got messed up starting with Intel 2019.

Thanks,

John

John_Young · ‎06-22-2021

Note that the two lines:

BUILT WITHOUT MPI

TIME(s) for Threaded After calling set_mkl_num_threads explicitly

should just read

BUILT WITHOUT MPI

John_Young · ‎06-22-2021

Sorry, I messed up the tables. The threaded and non-threaded labels are reversed. They should read:

BUILT WITH MPI and ran with 4 mpi processes (16 cores per process) (Time in seconds)

MKL VERSION                         2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Threaded:                   1.34           1.45     1.54                1.50
TIME(s) for Non-Threaded A:       1.33 16.1                     16.1                16.2
TIME(s) for Non-Threaded B:        1.31     1.3                        1.4                  1.5

BUILT WITHOUT MPI (16 cores)

MKL VERSION                          2018.0.03   2019.0.4            2020.0.4        2021.1
TIME(s) for Threaded:            1.44           1.48     1.44                 1.50
TIME(s) for Non-Threaded A:          1.33    1.30                     1.37                1.34
TIME(s) for Non-Threaded B:          1.31    1.28                     1.31                1.29

fanselm · ‎06-28-2021

Hi John,

As another Intel MPI / MKL user I just want to report that I have observed the same behavior - so it's not just you.

The problem I have found is that if one has not called an MKL function before the call to then MPI_Init() will set mkl_max_threads to 1, even when the program is not called through mpiexec, but just started in serial.

Here's a little C++ program that reproduces the problem:

#include <mkl.h>
#include <mpi.h>
#include <iostream>
#include <cassert>

int main()
{
    // Calling mkl_get_max_threads() before initializing MPI leads to
    // max threads equal to number of physical cores. If not, max threads
    // will be 1 when we initialize MPI.
    //std::cout << "Before MPI_Init():\n";
    //std::cout << "mkl_max_num_threads=" << mkl_get_max_threads() << "\n";
    int mpi_argc = 0;
    char** mpi_argv = nullptr;
    assert(MPI_Init(&mpi_argc, &mpi_argv) == MPI_SUCCESS);
    int world_size = 0;
    assert(MPI_Comm_size(MPI_COMM_WORLD, &world_size) == MPI_SUCCESS);
    std::cout << "MPI world size=" << world_size << "\n";
    std::cout << "After MPI_Init():\n";
    std::cout << "mkl_max_num_threads=" << mkl_get_max_threads() << "\n";

    assert(MPI_Finalize() == MPI_SUCCESS);
}

When I compile and run with earlier versions of MKL/MPI I get (on my 4 core laptop, no mpiexec):

MPI world size=1
After MPI_Init():
mkl_max_num_threads=4

While if I compile and run with oneAPI 2021 I get

MPI world size=1
After MPI_Init():
mkl_max_num_threads=1

If I uncomment the lines above the MPI_Init() call I get the same behavior with oneAPI 2021 as in earlier versions.

It's if MPI_Init() doesn't care whether it is actually run in parallel (more than one process) - it always set the max number of threads to 1. I don't know if this is expected behavior or a bug, but it is absolutely not documented anywhere.

Hopefully Intel will return with an answer...?

John_Young · ‎06-28-2021

Hi fanselm,

I concluded this is a bug. I cannot think of any good reason that the default for an HPC program would be to not use all the threads/cores available. It used to work properly in Intel MKL/MPI 2018, so something changed.

Thank you for the alternative workaround. I find your method cleaner than my workaround since you don't have to determine the number of threads to set. However, the only issue I would be concerned about is that the MPI documentation is a bit vague about what is allowable before the call to MPI_Init. It's possible calling mkl_get_max_threads before MPI_Init could get you into some sort of undefined behavior.

John_Young · ‎06-28-2021

Sorry,

I misunderstood your program. It is not a workaround for the issue. It is strange the different behavior that results. However, calling mkl_get_max_threads before MPI_Init may be undefined behavior (I'm not sure) at which point anything can happen.

Here is what I see with your program with 4 mpi processes and 16 cores per process. In the A-row, I did not call mkl_get_max_threads before MPI_Init, and, in the B-row, I did call mkl_get_max_threads before MPI_Init. I think I see slightly different behavior than you reported.

MKL/MPI 2018 2019 2020 2021

maxReportedThreadsA 16 1 1 1

maxReportedThreads B 1 1 1 1

fanselm · ‎06-29-2021

Hi John,

I agree with you that the documentation is not very clear on what exactly is the default in certain situations. However I think Intel changed the behavior from 2018 to 2019 so that if you run your program with multiple MPI processes it would set the default number of threads to one to avoid over-subscription. This would make sense as many older academic codes are not well thread parallel and many would run with #MPI processes = #cores and thus only single-threaded. However, it obviously shouldn't do this if you run the program in serial - especially not when you don't even start the program through mpiexec. When you run your program in serial the most natural thing to do is to use all the cores available for threads.

In our case, when running MPI parallel, we always set OMP_NUM_THREADS manually so that OMP_NUM_THREADS x MPI_PROCS_PER_NODE = CORES_PER_NODE. When running our program in serial (not on a cluster), we just want to use threading using all the cores available. I have therefore made the following workaround:

// Get the number of threads before MPI resets it.
int default_num_threads = mkl_get_max_threads();
MPI_Init(&mpi_argc, &mpi_argv);
int world_size = 0;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
if (world_size == 1) {
    // If we're running in serial, set the number of threads to the
    // non-MPI default.
    mkl_set_num_threads(default_num_threads);
}

John_Young · ‎06-29-2021

fanselm,

I am not sure why over-subscription would be an issue. The Intel MPI environment is able to detect the number of mpi processes per node and the number of threads/cores per node and set the number of omp processes per mpi process such that over-subscription does not occur. In fact, this is the way we workaround the issue in 2019-2021. We call omp_get_num_procs and then use the result in the call to mkl_set_num_threads.

On our cluster, below is the result of a omp_get_num_procs on a single node with 16 cores for runs of 1 to 16 mpi processes. It confirms that the OMP & MPI environments are perfectly able to communicate with each other appropriately to prevent over-subscription (although it may not use all cores when the mpi processes/cores do not divide evenly) without having to set OMP_NUM_THREADS in the environment. So, MKL should easily be able to set the appropriate number of threads per mpi process to avoid over-subscription (as it used to do in 2018).

omp_get_num_procs() for MKL/MPI version

NumMPIProc 2018 2019 2020 2021

1 16      16      16       16
2              8       8 8    8
3              5       5 5    5
4              4       4 4    4
5    3 3       3    3
6    2       2       2    2
7    2       2       2    2
8              2       2       2    2
9              1       1       1        1
10             1       1       1        1
11 1       1       1        1
12             1       1       1        1
13             1       1       1    1
14             1       1       1        1
15             1       1       1        1
16             1       1       1        1

MRajesh_intel · ‎07-12-2021

Hi,

Can you please provide the exact compiler, MKL, MPI, and the OS versions where you have tried running your sample(2018, 2019, 2021)?

Regards

Rajesh.

fanselm · ‎07-12-2021

Hi Rajesh,

I use MPI v. 2021.2.0 and MKL v. 2021.2.0 with GCC 9.2.0 using Intel OpenMP (iomp) instead of gomp on Ubuntu 16.04. I compile with:

mpicxx -std=c++14 -fopenmp -m64 -I"/opt/intel/oneapi/mkl/latest/include" stat_num_threads.cpp -o stat_num_threads \
-L/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64 -L/opt/intel/oneapi/mkl/latest/lib/intel64 \
-Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl

and doing mpicxx -show I get

g++ -I"/opt/intel/oneapi/mpi/2021.2.0/include" -L"/opt/intel/oneapi/mpi/2021.2.0/lib/release" -L"/opt/intel/oneapi/mpi/2021.2.0/lib" -Xlinker --enable-new-dtags -Xlinker -rpath -Xlinker "/opt/intel/oneapi/mpi/2021.2.0/lib/release" -Xlinker -rpath -Xlinker "/opt/intel/oneapi/mpi/2021.2.0/lib" -lmpicxx -lmpifort -lmpi -lrt -lpthread -Wl,-z,now -Wl,-z,relro -Wl,-z,noexecstack -Xlinker --enable-new-dtags -ldl

John_Young · ‎07-12-2021

I have ran the samples on 64-bit CentOS 7 (kernel 3.10.0) . The MKL/MPI/compiler versions are

MKL :    2018.0.3      2019.0.4       2020.0.4         2021.1    2021.3
IFORT: 18.0.3         19.0.4.243     19.1.3.304       2021.1    2021.3.0
MPI : 2018Update3    2019Update4    2019Update9       2021.1     2021.3

All versions exhibit the issue except for Intel 2018.

Gennady_F_Intel · ‎07-12-2021

that's strange a little bit, the similar problem has been fixed into 2021.3 and many customers confirmed the fix. could you give the execution numbers you see with MKL 2018 and 2021.3?

and give us the reproducer you use.

fanselm · ‎07-13-2021

Hi Gennady,

The output of my little program with impi 2018.1.163 (but with iomp and mkl 2021.3.0 as I don't have those for v2018) when run in serial is:

$> ./stat_num_threads 
MPI world size=1
After MPI_Init():
omp_max_num_threads=8
mkl_max_num_threads=4

and when run with 2 processes:

$> mpirun -n 2 ./stat_num_threads 
MPI world size=2
After MPI_Init():
omp_max_num_threads=4
mkl_max_num_threads=2

When linked and run with MPI 2021.3.0 the output in serial is:

$> ./stat_num_threads 
MPI world size=1
After MPI_Init():
omp_max_num_threads=8
mkl_max_num_threads=1

and when run with 2 processes:

mpirun -n 2 ./stat_num_threads 
MPI world size=2
After MPI_Init():
omp_max_num_threads=4
mkl_max_num_threads=1

Here's a table that summarizes:

Number of MPI processes:	1	2	4
2018.1.163 omp_get_max_threads	8	4	2
2018.1.163 mkl_get_max_threads	4	2	1
2021.3.0 omp_get_max_threads	8	4	2
2021.3.0 mkl_get_max_threads	1	1	1

So as you can see: omp_get_max_threads() correctly takes into account the number of processes when determining the maximum number of threads (but also gives one thread for each hyperthreading logical core), while mkl_get_max_threads() used to do the same in 2018 but only considering physical cores, while in 2021.3.0 it always sets the default number of threads to 1 in an MPI context, no matter whether it is run in serial or parallel.

Gennady_F_Intel · ‎07-13-2021

I am interested to reproduce the performance regression behavior between 2018 and 2021.3 versions of MKL.

could you give us the exact code you use? and show us as well - how did you call this executable.

fanselm · ‎07-13-2021

Sure, I posted the code above, but here it is again:

stat_num_threads.cpp:

#include <mkl.h>
#include <mpi.h>
#include <omp.h>
#include <iostream>
#include <cassert>

int main()
{
    // Calling mkl_get_max_threads() before initializing MPI leads to
    // max threads equal to number of physical cores. If not, max threads
    // will be 1 when we initialize MPI.
    //std::cout << "Before MPI_Init():\n";
    //std::cout << "omp_max_num_threads=" << omp_get_max_threads() << "\n";
    //std::cout << "mkl_max_num_threads=" << mkl_get_max_threads() << "\n";
    int mpi_argc = 0;
    char** mpi_argv = nullptr;
    assert(MPI_Init(&mpi_argc, &mpi_argv) == MPI_SUCCESS);
    int world_size = 0;
    assert(MPI_Comm_size(MPI_COMM_WORLD, &world_size) == MPI_SUCCESS);
    std::cout << "MPI world size=" << world_size << "\n";
    std::cout << "After MPI_Init():\n";
    std::cout << "omp_max_num_threads=" << omp_get_max_threads() << "\n";
    std::cout << "mkl_max_num_threads=" << mkl_get_max_threads() << "\n";

    assert(MPI_Finalize() == MPI_SUCCESS);
}

I have this Makefile:

all: stat_num_threads

IOMP_LIBS = /opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64

stat_num_threads: stat_num_threads.cpp
	mpicxx -std=c++14 -fopenmp -m64 -I"${MKLROOT}/include" stat_num_threads.cpp -o stat_num_threads \
	-L${IOMP_LIBS} -L${MKLROOT}/lib/intel64 \
	-Wl,--no-as-needed -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 -lpthread -lm -ldl

clean:
	rm -f ./stat_num_threads

.PHONY: all clean

I start a new terminal and source /opt/intel/oneapi/setvars.sh and then make. Then I run first in serial: $> ./stat_num_threads and then in parallel $> mpirun -n NN ./stat_num_threads or to get separate output from each process: $mpirun -n NN xterm -hold -e ./stat_num_threads. I have not set OMP_NUM_THREADS nor MKL_NUM_THREADS. All other info about my system and compiler are given above. I can give you output with I_MPI_DEBUG=5 as well if that helps?