topic Re: Re:Threaded MKL's DGEMM performance does not improve with increasing threads in Intel® oneAPI Math Kernel Library

Threaded MKL's DGEMM performance does not improve with increasing threads

babreu — Thu, 09 Sep 2021 15:18:13 GMT

Hello,

I am trying to improve the performance of a Fortran code by making better use of MKL's DGEMM (DGEMV could be used as well). This code basically performs matrix diagonalization, and after profiling it I was able to find that ~70% of the time is spent on calls to DGEMM(V). However, it was not clear to me from the profiling results whether or not these calls were profiting from multithreading. Therefore, I started experimenting with an isolated DGEMM code that it is taken from here. To my surprise, I don't seem to be gaining any performance. The total run-time is always the same, regardless of how many threads are called. I understand that MKL can be doing all sorts of optimization/smart choices, but it is quite hard to tell what they are. Would you have any suggestions or comments on this issue?

The code that I am running is:

program mkl_dgemm use, intrinsic :: iso_fortran_env use :: mkl_service implicit none include "mkl_lapack.fi" integer, parameter :: dp = REAL64 ! double precision float integer, parameter :: i32 = INT32 ! 32-bit integer integer(i32), parameter :: ord1=40000_i32 ! leading dim of matrix integer(i32), parameter :: ord2=20000_i32 ! lower dim of matrix real(dp) :: startT, endT real(dp), dimension(:,:), allocatable :: m, v, p integer(i32) :: MAX_THREADS, l, i ! allocate allocate(m(ord1, ord2)) allocate(v(ord2,1)) allocate(p(ord1,1)) ! fill in with random stuff call random_seed() call random_number(m) call random_number(v) p = 0.0_dp MAX_THREADS = MKL_GET_MAX_THREADS() PRINT 20," Running Intel(R) MKL from 1 to ",MAX_THREADS," threads" 20 FORMAT(A,I2,A) PRINT *, "" do l = 1, MAX_THREADS PRINT 30, " Requesting Intel(R) MKL to use ", l," thread(s)" 30 FORMAT(A,I2,A) CALL MKL_SET_NUM_THREADS(l) ! call MKL (syntax below)) ! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M) call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1) startT = dsecnd() !startT = omp_get_wtime() !call cpu_time(startT) do i = 1, 1 call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1) enddo !call cpu_time(endT) !endT = omp_get_wtime() endT = dsecnd() PRINT *, "== Matrix multiplication using Intel(R) MKL DGEMM ==" PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds ==" PRINT 60, " == using ",l," thread(s) ==" 50 FORMAT(A,F12.5,A) 60 FORMAT(A,I2,A) PRINT *, "" enddo end program mkl_dgemm

My compiling options were taken from Link Line Advisor, which are:

FC=ifort MKLPATH=${MKLROOT}/lib/intel64 MKLINCLUDE=${MKLROOT}/mkl/include LDFLAGS=-mkl=parallel -L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm all: $(FC) mkl_dgemm.f90 $(LDFLAGS)

The MKL version that I have access to is: MKL 2020.4.304,

and the Intel Fortran compiler is: ifort (IFORT) 2021.3.0 20210609

Here's an example of output from ./a.out when I'm using 4 cores:

[babreu@r002 intel]$ ./a.out Running Intel(R) MKL from 1 to 4 threads Requesting Intel(R) MKL to use 1 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == completed at 378.01160 milliseconds == == using 1 thread(s) == Requesting Intel(R) MKL to use 2 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == completed at 377.27408 milliseconds == == using 2 thread(s) == Requesting Intel(R) MKL to use 3 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == completed at 379.07949 milliseconds == == using 3 thread(s) == Requesting Intel(R) MKL to use 4 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == completed at 377.69205 milliseconds == == using 4 thread(s) ==

This machine that I am using has AMD EPYC 7742 cpus. I am happy to provide any other information that you may find useful.

Thanks!

Re:Threaded MKL's DGEMM performance does not improve with increasing threads

Gennady_F_Intel — Mon, 13 Sep 2021 10:45:08 GMT

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.

Re: Re:Threaded MKL's DGEMM performance does not improve with increasing threads

babreu — Mon, 13 Sep 2021 13:09:15 GMT

Dear Gennady,

Many thanks for bringing that to my attention. I had many discussions with several other people and a lot of possibilities were lined up. I tried exactly the same code as above on an Intel Xeon Phi 7250 ("Knights Landing") node and, sure enough, the scaling is there.

c455-004[knl](172)$ OMP_NUM_THREADS=68 ./a.out Running Intel(R) MKL from 1 to 68 threads Requesting Intel(R) MKL to use 1 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 627.42681 milliseconds == == using 1 thread(s) == Requesting Intel(R) MKL to use 2 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 529.81592 milliseconds == == using 2 thread(s) == Requesting Intel(R) MKL to use 3 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 351.05031 milliseconds == == using 3 thread(s) == Requesting Intel(R) MKL to use 4 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 261.95568 milliseconds == == using 4 thread(s) == Requesting Intel(R) MKL to use 5 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 205.79281 milliseconds == == using 5 thread(s) == Requesting Intel(R) MKL to use 6 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 169.52665 milliseconds == == using 6 thread(s) == Requesting Intel(R) MKL to use 7 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 143.74205 milliseconds == == using 7 thread(s) == Requesting Intel(R) MKL to use 8 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 124.09850 milliseconds == == using 8 thread(s) == Requesting Intel(R) MKL to use 9 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 109.59659 milliseconds == == using 9 thread(s) == Requesting Intel(R) MKL to use 10 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 97.01271 milliseconds == == using 10 thread(s) == Requesting Intel(R) MKL to use 11 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 86.46407 milliseconds == == using 11 thread(s) == Requesting Intel(R) MKL to use 12 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 79.22576 milliseconds == == using 12 thread(s) == Requesting Intel(R) MKL to use 13 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 72.35889 milliseconds == == using 13 thread(s) == Requesting Intel(R) MKL to use 14 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 67.00823 milliseconds == == using 14 thread(s) == Requesting Intel(R) MKL to use 15 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 61.52943 milliseconds == == using 15 thread(s) == Requesting Intel(R) MKL to use 16 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 57.67981 milliseconds == == using 16 thread(s) == Requesting Intel(R) MKL to use 17 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 54.55822 milliseconds == == using 17 thread(s) == Requesting Intel(R) MKL to use 18 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 50.60534 milliseconds == == using 18 thread(s) == Requesting Intel(R) MKL to use 19 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 48.46043 milliseconds == == using 19 thread(s) == Requesting Intel(R) MKL to use 20 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 45.59280 milliseconds == == using 20 thread(s) == Requesting Intel(R) MKL to use 21 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 44.60451 milliseconds == == using 21 thread(s) == Requesting Intel(R) MKL to use 22 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 42.00900 milliseconds == == using 22 thread(s) == Requesting Intel(R) MKL to use 23 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 40.46292 milliseconds == == using 23 thread(s) == Requesting Intel(R) MKL to use 24 thread(s) == Matrix multiplication using Intel(R) MKL DGEMM == == Timed 39.27597 milliseconds == == using 24 thread(s) ==

Re:Threaded MKL's DGEMM performance does not improve with increasing threads

Gennady_F_Intel — Wed, 15 Sep 2021 11:16:52 GMT

This query is closing we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.