- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Copied from this link.
I'm using intel's MKL to perform large matrix-matrix multiplications (Y=A*X). I noticed a significant performance drop when I increased the dimension of p from 4900 to 4950 while keeping the other dimensions fixed (average runtime from 10 runs is around 0.6s for p=4900 and 8s for p = 4950). Here's the code:
#include <iostream> #include <chrono> #include <mkl.h> using namespace std::chrono; int main(int argc, char** argv){ int N = 240000; int p = std::stoi(argv[1]); int K = 20; double *A, *X, *Y; double alpha = 1.0; double beta = 0.0; A = (double *)mkl_malloc(N * p *sizeof(double), 64); X = (double *)mkl_malloc(p * K *sizeof(double), 64); Y = (double *)mkl_malloc(N * K *sizeof(double), 64); for(int i = 0; i < (N*p); ++i){ A = 1.0; } for(int i = 0; i < (p*K); ++i){ X = 0.5; } for(int i = 0; i < (N*K); ++i){ Y = 0.0; } auto start = high_resolution_clock::now(); for(int i = 0; i < 10; ++i){ cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, N, K, p, alpha, A, N, X, p, beta, Y, N); } auto stop = high_resolution_clock::now(); auto duration = duration_cast<microseconds>(stop - start); std::cout << (double)duration.count()/(1e6*10.0) << std::endl; mkl_free(X); mkl_free(A); mkl_free(Y); return 0; }
Does anyone know the reason for that? This happens for both MKL 2020.1.217 as well as MKL 2019. I'm using CentOS 7 and compiled the above code with g++ 6.3.0
g++ main.cpp -o main -DMKL_ILP64 -m64 -I${MKLROOT}/include -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl
Thanks!
Edit: This problem seems to be specific to the cluster node I was using since it disappears when I switched to a different cluster. The cluster node that has the problem uses Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz. The other cluster node that does not have this problem uses Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Probably this issue is very specific with this CPU and OS types because I was trying to reproduce the problem on RH7 and
Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and don't see the problem.
I only added the -std=c++11 option to build this example and linking against the MKL 2020
$ ./a.out 4900
0.235571
$ ./a.out 4950
0.24464
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
export MKL_VERBOSE=1
$ ./a.out 4950
MKL_VERBOSE Intel(R) MKL 2020.0 Update 1 Product build 20200208 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.40GHz ilp64 gnu_thread
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 360.52ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 221.39ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 215.22ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 215.25ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 217.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 223.51ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 224.87ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 258.30ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 220.53ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 217.61ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:40
0.322713
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Here is what I got:
./main 4900
MKL_VERBOSE Intel(R) MKL 2018.0 Product build 20170720 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.30GHz ilp64 gnu_thread NMICDev:0
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 2.98s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 477.67ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 404.95ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 386.72ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 427.28ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 527.10ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 504.34ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 401.23ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 369.35ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 421.55ms CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
0.778294
./main 4950
MKL_VERBOSE Intel(R) MKL 2018.0 Product build 20170720 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.30GHz ilp64 gnu_thread NMICDev:0
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.77s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.25s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.15s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.25s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.53s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.10s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.31s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.93s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.66s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.13s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:13 WDiv:HOST:+0.000
7.87838
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see you use mkl v.2018 "MKL_VERBOSE Intel(R) MKL 2018.0 ....".
Could you check the behavior with MKL 2020.1? I linked your examples against the latest MKL 2020.1.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I was unable to get MKL 2020 but still MKL 2018 in MKL_VERBOSE when I explicitly set
MKLROOT=<directory where I installed MKL 2020>/compilers_and_libraries_2020.1.217/linux/mkl
in my Makefile.
Let me contact our cluster support to see if they know what might be going on.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
It seems to be a problem with the compiler I was using: gcc (crosstool-NG 1.23.0.449-a04d0) 7.3.0
Changing the compiler to either gcc 7.1.0 or clang 4.0.0 solves the problem.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
thanks for the update, though such a solution looks very suspicious as the performance of mkl doesn't depend on which compiler has been used to build the application. We will try to play with your reproducer, versions of compilers, and this specific CPU type.
Gennady

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page