Community
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Beginner
32 Views

mkl(dgemm) performance problems on "superlarge" processors

Hi,

I was running two subsequent dgemm operations: T=AB and C=A'T with A=(56,000x400,000), B=(400,000x30), T=(56,000x30) and C=B.

Conditional on the CPU I measured these wall clock times (for the dgemm operations only):

Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz with 36 (real) cores, 46080 KB cache, 250GB of RAM

T=AB: 3.73 seconds,

C=A'T: 4.17 seconds

 

Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 56 (real) cores, 19712 KB cache, 2TB of RAM

T=AB: 91.47 seconds

C=A'T: 232.78 seconds

What was paticularly striking was that T=AB used all 56 cores, whereas C=A'T used only half of it.

kmp setting was: KMP_AFFINITY=compact,1,0,granularity=fine

 

I am wondering whether the bad performance of the latter is solely attributable to its architecture and therefore is set in stone, or whether I can somehow optimize mkl/kmp environment variables to increase performance.

Thanks

0 Kudos
11 Replies
Highlighted
32 Views

Hi,

  56k x 30 x 400k with 3.73 seconds corresponds to 3603GFlop/s. Since expected maximum DGEMM performance would be around 1400 - 1500GF and that performance isn't realistic. Can you check system condition and computed results on E5-2697v4?

Xeon Gold performance is around 146GFlop/s, but it would be reasonable by taking account into problem shape.

Thanks,

 

0 Kudos
Highlighted
Moderator
32 Views

could you export MKL_VERBOSE=1 and give the output?

0 Kudos
Highlighted
Beginner
32 Views

Hi Gennady

Thanks for the response.

Gennady F. (Intel) wrote:

could you export MKL_VERBOSE=1 and give the output?

I'll post results once I have done it.

0 Kudos
Highlighted
Beginner
32 Views

Hi Kazushige

Kazushige G. (Intel) wrote:

Hi,

  56k x 30 x 400k with 3.73 seconds corresponds to 3603GFlop/s. ..................

Just to get the numbers right: 56k x 30 x 400k * 2 / 10^9=1344 GFLOPS. Divided by 3.73 seconds equals 360.3 GFlops per

second, which is then about 24% of the peak performance of 1500 GFLOPS per second.

The computed results are correct.

What do you mean with system conditions?

The operation system is Arch Linux, Kernel version 4.20.6.

 

Thanks

0 Kudos
Highlighted
Beginner
32 Views

MKL_VERBOSE output for  E5-2697 v4:

MKL_VERBOSE Intel(R) MKL 2017.0 Update 4 Product build 20170811 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz ilp64 intel_thread NMICDev:0
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffd287b7be8,0x148e2a60a280,56000,0x1469e63ff300,400000,0x7ffd287b7bf0,0x148cae0fd340,56000) 4.19s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36 WDiv:HOST:+0.000
 1:    5.03400000000000     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffd287b7be8,0x148e2a60a280,56000,0x148cae0fd340,56000,0x7ffd287b7bf0,0x1469e07fe380,400000) 4.81s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36 WDiv:HOST:+0.000
 2:    4.81500000000000

 

0 Kudos
Highlighted
Moderator
32 Views

MKL_VERBOSE Intel(R) MKL 2017.

Could you try to evaluate the latest MKL 2019 u1! 

0 Kudos
Highlighted
Beginner
32 Views

Hi, Thanks for the response.

I could not try 2019 u1 because the fortran compiler is (still) super buggy (the same for 2018). I am just downloading 2019 update 2 and will try that. I'll get back to you then.

Cheers

0 Kudos
Highlighted
Moderator
32 Views

ok, please try this. FYI regard to Fortran Compiler - the latest 2019 u2 contains only some  security updates. You may see more details follow the release notes: https://software.intel.com/en-us/articles/intel-fortran-compiler-190-for-linux-release-notes-for-intel-parallel-studio-xe-2019:  

0 Kudos
Highlighted
Beginner
32 Views

Hi,

so I made a stand-alone program which the ifort 19 compiler can cope with:

Program Test
  use blas95
  USE IFPORT
  implicit none
  integer(kind=8) :: nM,nG,nE
  integer(kind=4) :: istat, i
  character(len=200) :: msg
  Real(kind=8), allocatable :: A(:,:), B(:,:), C(:,:),T(:,:)
  real(kind=8) :: r1=0.0D0, r2=0.0D0
  outer:block
    nM=56000;nG=400000;nE=30;istat=0
    write(*,"(*(g0"",""))") nM,nG,nE
    r1=dclock()
    allocate(&
      &A(nM,nG),&
      &B(nG,nE),&
      &T(nM,nE),&
      &C(nG,nE),&
      &stat=istat,errmsg=msg)
    if(istat/=0) Then
      write(*,*) msg;exit outer
    end if
    !$OMP PARALLEL
    !$OMP DO
    Do i=1,nG
      A(:,i)=0.0D0
    end Do
    !$OMP END DO NOWAIT
    !$OMP DO
    Do i=1,nE
      B(:,i)=0.0D0
      T(:,i)=0.0D0
      C(:,i)=0.0D0
    end Do
    !$OMP END DO
    !$OMP END PARALLEL
    r2=Dclock()
    write(*,*) "alloc: ", r2-r1
    Do i=1,5
      r2=Dclock()
      call gemm(A=A,B=B,C=T)
      r1=dclock()
      write(*,*) "1: ",r1-r2
      call gemm(A=A,B=T,C=C,transa="T")
      r2=dclock()
      write(*,*) "2: ",r2-r1
    End Do
  End block outer
End Program Test

Further settings are:

MKL_VERBOSE=1

KMP_AFFINITY=compact,1,0,granularity=fine

 

The bash output from E5-2697 v4 is:

56000,400000,30,
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='-10'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='TRUE'
  [host] OMP_NUM_THREADS='72'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='2000M'
   OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


 alloc:    12.6054401397705     
MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 2 (Intel(R) AVX2) enabled processors, Lnx 2.30GHz ilp64 intel_thread
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.12s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 1:    4.77215790748596     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.64s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 2:    4.63893508911133     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 1:    4.07306599617004     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 2:    4.63029694557190     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 1:    4.07025289535522     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 2:    4.63470506668091     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 1:    4.07217407226562     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 2:    4.63222002983093     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d61d6240,400000,0x7ffc8ef496d0,0x14c8d54d5280,56000) 4.07s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 1:    4.06569814682007     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffc8ef496c8,0x14c8dbdd7200,56000,0x14c8d54d5280,56000,0x7ffc8ef496d0,0x14c8cf8d42c0,400000) 4.63s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:36
 2:    4.63313007354736

 

The bash output from Gold 5117 was:

56000,400000,30,
OMP: Warning #181: OMP_PROC_BIND: ignored because KMP_AFFINITY has been defined
OMP: Warning #181: OMP_PLACES: ignored because KMP_AFFINITY has been defined

OPENMP DISPLAY ENVIRONMENT BEGIN
   _OPENMP='201611'
  [host] OMP_AFFINITY_FORMAT='OMP: pid %P tid %T thread %n bound to OS proc set {%a}'
  [host] OMP_ALLOCATOR='omp_default_mem_alloc'
  [host] OMP_CANCELLATION='FALSE'
  [host] OMP_DEBUG='disabled'
  [host] OMP_DEFAULT_DEVICE='-10'
  [host] OMP_DISPLAY_AFFINITY='FALSE'
  [host] OMP_DISPLAY_ENV='TRUE'
  [host] OMP_DYNAMIC='FALSE'
  [host] OMP_MAX_ACTIVE_LEVELS='2147483647'
  [host] OMP_MAX_TASK_PRIORITY='0'
  [host] OMP_NESTED='TRUE'
  [host] OMP_NUM_THREADS='112'
  [host] OMP_PLACES: value is not defined
  [host] OMP_PROC_BIND='intel'
  [host] OMP_SCHEDULE='static'
  [host] OMP_STACKSIZE='2000M'
   OMP_TARGET_OFFLOAD=DEFAULT
  [host] OMP_THREAD_LIMIT='2147483647'
  [host] OMP_TOOL='enabled'
  [host] OMP_TOOL_LIBRARIES: value is not defined
  [host] OMP_WAIT_POLICY='PASSIVE'
OPENMP DISPLAY ENVIRONMENT END


 alloc:    18.7253189086914     
MKL_VERBOSE Intel(R) MKL 2019.0 Update 1 Product build 20180928 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) enabled processors, Lnx 2.00GHz ilp64 intel_thread
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 98.58s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 1:    98.8509511947632     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 222.52s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 2:    222.525346994400     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 28.31s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 1:    28.3082640171051     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 73.63s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 2:    73.6357510089874     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 112.45s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 1:    112.451642990112     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 152.85s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 2:    152.847929000854     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 23.87s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 1:    23.8693110942841     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 53.01s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 2:    53.0122749805450     
MKL_VERBOSE DGEMM(N,N,56000,30,400000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3190b43240,400000,0x7ffda0750ed0,0x2b3196744280,56000) 43.80s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 1:    43.8020231723785     
MKL_VERBOSE DGEMM(T,N,400000,30,56000,0x7ffda0750ec8,0x2b07d78c2200,56000,0x2b3196744280,56000,0x7ffda0750ed0,0x2b31974452c0,400000) 97.87s CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:56
 2:    97.8659319877625

The compiler flags for the program were:

ifort -i8 -warn nounused -warn declarations -O3 -static -mkl=parallel -qopenmp -parallel -c -o OMP_MKLPARA_ifort_4.20.6-arch1-1-ARCH/Test.o Test.f90 -I /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/include/

ifort -i8 -warn nounused -warn declarations -O3 -static -mkl=parallel -qopenmp -parallel -o Test_OMP_MKLPARA_4.20.6-arch1-1-ARCH OMP_MKLPARA_ifort_4.20.6-arch1-1-ARCH/Test.o /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_blas95_ilp64.a /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_lapack95_ilp64.a -Wl,--start-group /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_intel_ilp64.a /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_core.a /opt/intel/compilers_and_libraries_2019.1.144/linux/mkl/lib/intel64/libmkl_intel_thread.a -Wl,--end-group -lpthread -lm -ldl

 

The Gold 5117 runs a centos 7 operation system, kernel version 3.10.0-862.el7.x86_64

The E5-2697 v4 runs an Arch Linux operation system, kernel version 4.20.6-arch1-1-ARCH

If you need more information let me know.

 

cheers

0 Kudos
Highlighted
32 Views

> Just to get the numbers right: 56k x 30 x 400k * 2 / 10^9=1344 GFLOPS. Divided by 3.73 seconds equals 360.3 GFlops per

Sorry, I made a mistake. Yes, 360GFlop/s is the correct number. Then can you check system condition for Xeon Gold? This operation requires high memory bandwidth against main memory. Using stream benchmark (https://www.cs.virginia.edu/stream/) is useful.

 icc -qopenmp -o stream stream.c

export  KMP_AFFINITY=compact,1,0,granularity=fine

numactl --cpunodebind=0 --membind=0 ./stream

numactl --cpunodebind=0 --membind=1 ./stream

numactl --cpunodebind=1 --membind=0 ./stream

numactl --cpunodebind=1 --membind=1 ./stream

Please check "Copy" performance. Expected performance is around 80 - 95GB/sec for local memory access, and 40 - 50GB/sec for remote access.

 

 

0 Kudos
Highlighted
Beginner
32 Views

> numactl --cpunodebind=0 --membind=0 ./stream

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           54365.6     0.002975     0.002943     0.002999

> numactl --cpunodebind=0 --membind=1 ./stream

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           23422.1     0.007045     0.006831     0.007223

> numactl --cpunodebind=1 --membind=0 ./stream

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           23025.1     0.007101     0.006949     0.007346

> numactl --cpunodebind=1 --membind=1 ./stream

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           49765.6     0.003226     0.003215     0.003239

> Please check "Copy" performance. Expected performance is around 80 - 95GB/sec for local memory access, and 40 - 50GB/sec for remote access.

About half of what you expect. Possibly because all ram slots are not populated.  It runs 32x 64g modules.  dmidecode lists a repeating pattern of modules in set1(A1,A2,A4,A5), set2(A7,A8,A10,A11), set3(B1,B2,B4,B5), set4(B7,B8,B10,B11), set5(C1,C2,C4,C5), set6(C7,C8,C10,C11), set7(D1,D2,D4,D5), set8(D7,D8,D10,D11).  As delivered from Dell.

Also interesting numactl --cpunodebind=1 --membind=3 ./stream_c

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           17097.4     0.009394     0.009358     0.009473

How many UPI does the 5117 really have? ark.intel.com (it's not on the xeon scalable processor list but google can find its page) says 2, wikichip.org says 3. In quad configuration, does it have to work like a ring?

0 Kudos