Solved: Threaded MKL's DGEMM performance does not improve with increasing threads

babreu · ‎09-09-2021

Hello,

I am trying to improve the performance of a Fortran code by making better use of MKL's DGEMM (DGEMV could be used as well). This code basically performs matrix diagonalization, and after profiling it I was able to find that ~70% of the time is spent on calls to DGEMM(V). However, it was not clear to me from the profiling results whether or not these calls were profiting from multithreading. Therefore, I started experimenting with an isolated DGEMM code that it is taken from here. To my surprise, I don't seem to be gaining any performance. The total run-time is always the same, regardless of how many threads are called. I understand that MKL can be doing all sorts of optimization/smart choices, but it is quite hard to tell what they are. Would you have any suggestions or comments on this issue?

The code that I am running is:

program mkl_dgemm
      use, intrinsic :: iso_fortran_env
      use :: mkl_service
      implicit none
      include "mkl_lapack.fi"
      integer, parameter :: dp = REAL64 ! double precision float
      integer, parameter :: i32 = INT32 ! 32-bit integer
      integer(i32), parameter :: ord1=40000_i32  ! leading dim of matrix
      integer(i32), parameter :: ord2=20000_i32   ! lower dim of matrix
      real(dp) :: startT, endT
      real(dp), dimension(:,:), allocatable :: m, v, p
      integer(i32) :: MAX_THREADS, l, i

      ! allocate
      allocate(m(ord1, ord2))
      allocate(v(ord2,1))
      allocate(p(ord1,1))

      ! fill in with random stuff
      call random_seed()
      call random_number(m)
      call random_number(v)
      p = 0.0_dp

      MAX_THREADS = MKL_GET_MAX_THREADS()
      PRINT 20," Running Intel(R) MKL from 1 to ",MAX_THREADS," threads"
 20   FORMAT(A,I2,A)
      PRINT *, ""

      do l = 1, MAX_THREADS
        PRINT 30, " Requesting Intel(R) MKL to use ", l," thread(s)"
 30     FORMAT(A,I2,A)
        CALL MKL_SET_NUM_THREADS(l)

        ! call MKL (syntax below))
        ! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M)
        call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)

        startT = dsecnd()
        !startT = omp_get_wtime()
        !call cpu_time(startT)
        do i = 1, 1
                call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
        enddo
        !call cpu_time(endT)
        !endT = omp_get_wtime()
        endT = dsecnd()

        PRINT *, "== Matrix multiplication using Intel(R) MKL DGEMM =="
        PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
        PRINT 60, " == using ",l," thread(s) =="
 50     FORMAT(A,F12.5,A)
 60     FORMAT(A,I2,A)
        PRINT *, ""
      enddo

end program mkl_dgemm

My compiling options were taken from Link Line Advisor, which are:

FC=ifort
MKLPATH=${MKLROOT}/lib/intel64
MKLINCLUDE=${MKLROOT}/mkl/include

LDFLAGS=-mkl=parallel -L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm

all:
	$(FC) mkl_dgemm.f90 $(LDFLAGS)

The MKL version that I have access to is: MKL 2020.4.304,

and the Intel Fortran compiler is: ifort (IFORT) 2021.3.0 20210609

Here's an example of output from ./a.out when I'm using 4 cores:

[babreu@r002 intel]$ ./a.out 
 Running Intel(R) MKL from 1 to  4 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    378.01160 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    377.27408 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    379.07949 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    377.69205 milliseconds ==
 == using  4 thread(s) ==

This machine that I am using has AMD EPYC 7742 cpus. I am happy to provide any other information that you may find useful.

Thanks!

Gennady_F_Intel · ‎09-13-2021

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.

View solution in original post

Gennady_F_Intel · ‎09-13-2021

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.

babreu · ‎09-13-2021

Dear Gennady,

Many thanks for bringing that to my attention. I had many discussions with several other people and a lot of possibilities were lined up. I tried exactly the same code as above on an Intel Xeon Phi 7250 ("Knights Landing") node and, sure enough, the scaling is there.

c455-004[knl](172)$ OMP_NUM_THREADS=68 ./a.out 
 Running Intel(R) MKL from 1 to 68 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    627.42681 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    529.81592 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    351.05031 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    261.95568 milliseconds ==
 == using  4 thread(s) ==
 
 Requesting Intel(R) MKL to use  5 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    205.79281 milliseconds ==
 == using  5 thread(s) ==
 
 Requesting Intel(R) MKL to use  6 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    169.52665 milliseconds ==
 == using  6 thread(s) ==
 
 Requesting Intel(R) MKL to use  7 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    143.74205 milliseconds ==
 == using  7 thread(s) ==
 
 Requesting Intel(R) MKL to use  8 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    124.09850 milliseconds ==
 == using  8 thread(s) ==
 
 Requesting Intel(R) MKL to use  9 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    109.59659 milliseconds ==
 == using  9 thread(s) ==
 
 Requesting Intel(R) MKL to use 10 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     97.01271 milliseconds ==
 == using 10 thread(s) ==
 
 Requesting Intel(R) MKL to use 11 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     86.46407 milliseconds ==
 == using 11 thread(s) ==
 
 Requesting Intel(R) MKL to use 12 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     79.22576 milliseconds ==
 == using 12 thread(s) ==
 
 Requesting Intel(R) MKL to use 13 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     72.35889 milliseconds ==
 == using 13 thread(s) ==
 
 Requesting Intel(R) MKL to use 14 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     67.00823 milliseconds ==
 == using 14 thread(s) ==
 
 Requesting Intel(R) MKL to use 15 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     61.52943 milliseconds ==
 == using 15 thread(s) ==
 
 Requesting Intel(R) MKL to use 16 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     57.67981 milliseconds ==
 == using 16 thread(s) ==
 
 Requesting Intel(R) MKL to use 17 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     54.55822 milliseconds ==
 == using 17 thread(s) ==
 
 Requesting Intel(R) MKL to use 18 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     50.60534 milliseconds ==
 == using 18 thread(s) ==
 
 Requesting Intel(R) MKL to use 19 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     48.46043 milliseconds ==
 == using 19 thread(s) ==
 
 Requesting Intel(R) MKL to use 20 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     45.59280 milliseconds ==
 == using 20 thread(s) ==
 
 Requesting Intel(R) MKL to use 21 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     44.60451 milliseconds ==
 == using 21 thread(s) ==
 
 Requesting Intel(R) MKL to use 22 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     42.00900 milliseconds ==
 == using 22 thread(s) ==
 
 Requesting Intel(R) MKL to use 23 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     40.46292 milliseconds ==
 == using 23 thread(s) ==
 
 Requesting Intel(R) MKL to use 24 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     39.27597 milliseconds ==
 == using 24 thread(s) ==

Gennady_F_Intel · ‎09-15-2021

This query is closing we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.