Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
95 Views

dgemm large performance difference for matrices of similar size

Copied from this link.

I'm using intel's MKL to perform large matrix-matrix multiplications (Y=A*X). I noticed a significant performance drop when I increased the dimension of p from 4900 to 4950 while keeping the other dimensions fixed (average runtime from 10 runs is around 0.6s for p=4900 and 8s for p = 4950). Here's the code:

#include <iostream>
#include <chrono> 
#include <mkl.h>
using namespace std::chrono;


int main(int argc, char** argv){
  int N = 240000;
  int p = std::stoi(argv[1]);
  int K = 20;

  double *A, *X, *Y;
  double alpha = 1.0;
  double beta = 0.0;

  A = (double *)mkl_malloc(N * p *sizeof(double), 64);
  X = (double *)mkl_malloc(p * K *sizeof(double), 64);
  Y = (double *)mkl_malloc(N * K *sizeof(double), 64);

  for(int i = 0; i < (N*p); ++i){
    A = 1.0;
  }

  for(int i = 0; i < (p*K); ++i){
    X = 0.5;
  }

  for(int i = 0; i < (N*K); ++i){
    Y = 0.0;
  }

  auto start = high_resolution_clock::now(); 
  for(int i = 0; i < 10; ++i){
    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,
                N, K, p, alpha, A, N, X, p, beta, Y, N);
  }


  auto stop = high_resolution_clock::now();
  auto duration = duration_cast<microseconds>(stop - start); 
  std::cout << (double)duration.count()/(1e6*10.0) << std::endl; 

  mkl_free(X);
  mkl_free(A);
  mkl_free(Y);

  return 0;
}

Does anyone know the reason for that? This happens for both MKL 2020.1.217 as well as MKL 2019. I'm using CentOS 7 and compiled the above code with g++ 6.3.0

g++ main.cpp -o main -DMKL_ILP64 -m64 -I${MKLROOT}/include  -L${MKLROOT}/lib/intel64 -Wl,--no-as-needed -lmkl_intel_ilp64 -lmkl_gnu_thread -lmkl_core -lgomp -lpthread -lm -ldl

Thanks! 

 

Edit: This problem seems to be specific to the cluster node I was using since it disappears when I switched to a different cluster. The cluster node that has the problem uses Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz. The other cluster node that does not have this problem uses Intel(R) Xeon(R) CPU E7-8867 v4 @ 2.40GHz

0 Kudos
7 Replies
Highlighted
Moderator
95 Views

Probably this issue is very specific with this CPU and OS types because I was trying to reproduce the problem on RH7 and  

Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz and don't see the problem.

I only added the -std=c++11 option to build this example and linking against the MKL 2020 

$ ./a.out 4900
0.235571
$ ./a.out 4950
0.24464
 

0 Kudos
Highlighted
Moderator
95 Views

export MKL_VERBOSE=1

$ ./a.out 4950
MKL_VERBOSE Intel(R) MKL 2020.0 Update 1 Product build 20200208 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.40GHz ilp64 gnu_thread
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 360.52ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 221.39ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 215.22ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 215.25ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 217.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 223.51ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 224.87ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 258.30ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 220.53ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7fff67187de8,0x2ab00ab49080,240000,0x2ab003007080,4950,0x7fff67187df0,0x2ab241302080,240000) 217.61ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:40
0.322713
 

0 Kudos
Highlighted
Beginner
95 Views

Here is what I got:

./main 4900

MKL_VERBOSE Intel(R) MKL 2018.0 Product build 20170720 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.30GHz ilp64 gnu_thread NMICDev:0

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 2.98s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 477.67ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 404.95ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 386.72ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 427.28ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 527.10ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 504.34ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 401.23ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 369.35ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4900,0x7fff55e8b598,0x7f92f1f9b080,240000,0x7f92f1edb080,4900,0x7fff55e8b5a0,0x7f92efa3b080,240000) 421.55ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

0.778294

 

./main 4950

MKL_VERBOSE Intel(R) MKL 2018.0 Product build 20170720 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.30GHz ilp64 gnu_thread NMICDev:0

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.77s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.25s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.15s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.25s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.53s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.10s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.31s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.93s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 7.66s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,240000,20,4950,0x7ffe9ba840c8,0x7efc503f4080,240000,0x7efc50332080,4950,0x7ffe9ba840d0,0x7efc4de92080,240000) 8.13s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:13 WDiv:HOST:+0.000

7.87838

0 Kudos
Highlighted
Moderator
95 Views

I see you use mkl v.2018  "MKL_VERBOSE Intel(R) MKL 2018.0 ....". 

Could you check the behavior with MKL 2020.1? I linked your examples against the latest MKL 2020.1.

0 Kudos
Highlighted
Beginner
95 Views

I was unable to get MKL 2020 but still MKL 2018 in MKL_VERBOSE when I explicitly set 

MKLROOT=<directory where I installed MKL 2020>/compilers_and_libraries_2020.1.217/linux/mkl

in my Makefile.

Let me contact our cluster support to see if they know what might be going on.

0 Kudos
Highlighted
Beginner
95 Views

It seems to be a problem with the compiler I was using: gcc (crosstool-NG 1.23.0.449-a04d0) 7.3.0

Changing the compiler to either gcc 7.1.0 or clang 4.0.0 solves the problem.

Thanks!

0 Kudos
Highlighted
Moderator
95 Views

thanks for the update, though such a solution looks very suspicious as the performance of mkl doesn't depend on which compiler has been used to build the application. We will try to play with your reproducer, versions of compilers, and this specific CPU type.

Gennady

0 Kudos