Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.
Announcements
FPGA community forums and blogs on community.intel.com are migrating to the new Altera Community and are read-only. For urgent support needs during this transition, please visit the FPGA Design Resources page or contact an Altera Authorized Distributor.
7234 Discussions

Threaded MKL's DGEMM performance does not improve with increasing threads

babreu
Beginner
1,904 Views

Hello,

 

I am trying to improve the performance of a Fortran code by making better use of MKL's DGEMM (DGEMV could be used as well). This code basically performs matrix diagonalization, and after profiling it I was able to find that ~70% of the time is spent on calls to DGEMM(V). However, it was not clear to me from the profiling results whether or not these calls were profiting from multithreading. Therefore, I started experimenting with an isolated DGEMM code that it is taken from here. To my surprise, I don't seem to be gaining any performance. The total run-time is always the same, regardless of how many threads are called. I understand that MKL can be doing all sorts of optimization/smart choices, but it is quite hard to tell what they are. Would you have any suggestions or comments on this issue?

 

The code that I am running is:

 

program mkl_dgemm
      use, intrinsic :: iso_fortran_env
      use :: mkl_service
      implicit none
      include "mkl_lapack.fi"
      integer, parameter :: dp = REAL64 ! double precision float
      integer, parameter :: i32 = INT32 ! 32-bit integer
      integer(i32), parameter :: ord1=40000_i32  ! leading dim of matrix
      integer(i32), parameter :: ord2=20000_i32   ! lower dim of matrix
      real(dp) :: startT, endT
      real(dp), dimension(:,:), allocatable :: m, v, p
      integer(i32) :: MAX_THREADS, l, i

      ! allocate
      allocate(m(ord1, ord2))
      allocate(v(ord2,1))
      allocate(p(ord1,1))

      ! fill in with random stuff
      call random_seed()
      call random_number(m)
      call random_number(v)
      p = 0.0_dp

      MAX_THREADS = MKL_GET_MAX_THREADS()
      PRINT 20," Running Intel(R) MKL from 1 to ",MAX_THREADS," threads"
 20   FORMAT(A,I2,A)
      PRINT *, ""

      do l = 1, MAX_THREADS
        PRINT 30, " Requesting Intel(R) MKL to use ", l," thread(s)"
 30     FORMAT(A,I2,A)
        CALL MKL_SET_NUM_THREADS(l)

        ! call MKL (syntax below))
        ! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M)
        call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)

        startT = dsecnd()
        !startT = omp_get_wtime()
        !call cpu_time(startT)
        do i = 1, 1
                call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
        enddo
        !call cpu_time(endT)
        !endT = omp_get_wtime()
        endT = dsecnd()

        PRINT *, "== Matrix multiplication using Intel(R) MKL DGEMM =="
        PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
        PRINT 60, " == using ",l," thread(s) =="
 50     FORMAT(A,F12.5,A)
 60     FORMAT(A,I2,A)
        PRINT *, ""
      enddo

end program mkl_dgemm

 

 

 

My compiling options were taken from Link Line Advisor, which are:

 

FC=ifort
MKLPATH=${MKLROOT}/lib/intel64
MKLINCLUDE=${MKLROOT}/mkl/include

LDFLAGS=-mkl=parallel -L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm

all:
	$(FC) mkl_dgemm.f90 $(LDFLAGS)

 

 

The MKL version that I have access to is: MKL 2020.4.304,

and the Intel Fortran compiler is: ifort (IFORT) 2021.3.0 20210609

 

Here's an example of output from ./a.out when I'm using 4 cores:

 

[babreu@r002 intel]$ ./a.out 
 Running Intel(R) MKL from 1 to  4 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    378.01160 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    377.27408 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    379.07949 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    377.69205 milliseconds ==
 == using  4 thread(s) ==

 


This machine that I am using has AMD EPYC 7742 cpus. I am happy to provide any other information that you may find useful.

Thanks!

 

0 Kudos
1 Solution
Gennady_F_Intel
Moderator
1,837 Views

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.



View solution in original post

0 Kudos
3 Replies
Gennady_F_Intel
Moderator
1,838 Views

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.



0 Kudos
babreu
Beginner
1,819 Views

Dear Gennady,

 

Many thanks for bringing that to my attention. I had many discussions with several other people and a lot of possibilities were lined up. I tried exactly the same code as above on an Intel Xeon Phi 7250 ("Knights Landing") node and, sure enough, the scaling is there.

 

 

c455-004[knl](172)$ OMP_NUM_THREADS=68 ./a.out 
 Running Intel(R) MKL from 1 to 68 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    627.42681 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    529.81592 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    351.05031 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    261.95568 milliseconds ==
 == using  4 thread(s) ==
 
 Requesting Intel(R) MKL to use  5 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    205.79281 milliseconds ==
 == using  5 thread(s) ==
 
 Requesting Intel(R) MKL to use  6 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    169.52665 milliseconds ==
 == using  6 thread(s) ==
 
 Requesting Intel(R) MKL to use  7 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    143.74205 milliseconds ==
 == using  7 thread(s) ==
 
 Requesting Intel(R) MKL to use  8 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    124.09850 milliseconds ==
 == using  8 thread(s) ==
 
 Requesting Intel(R) MKL to use  9 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    109.59659 milliseconds ==
 == using  9 thread(s) ==
 
 Requesting Intel(R) MKL to use 10 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     97.01271 milliseconds ==
 == using 10 thread(s) ==
 
 Requesting Intel(R) MKL to use 11 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     86.46407 milliseconds ==
 == using 11 thread(s) ==
 
 Requesting Intel(R) MKL to use 12 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     79.22576 milliseconds ==
 == using 12 thread(s) ==
 
 Requesting Intel(R) MKL to use 13 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     72.35889 milliseconds ==
 == using 13 thread(s) ==
 
 Requesting Intel(R) MKL to use 14 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     67.00823 milliseconds ==
 == using 14 thread(s) ==
 
 Requesting Intel(R) MKL to use 15 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     61.52943 milliseconds ==
 == using 15 thread(s) ==
 
 Requesting Intel(R) MKL to use 16 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     57.67981 milliseconds ==
 == using 16 thread(s) ==
 
 Requesting Intel(R) MKL to use 17 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     54.55822 milliseconds ==
 == using 17 thread(s) ==
 
 Requesting Intel(R) MKL to use 18 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     50.60534 milliseconds ==
 == using 18 thread(s) ==
 
 Requesting Intel(R) MKL to use 19 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     48.46043 milliseconds ==
 == using 19 thread(s) ==
 
 Requesting Intel(R) MKL to use 20 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     45.59280 milliseconds ==
 == using 20 thread(s) ==
 
 Requesting Intel(R) MKL to use 21 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     44.60451 milliseconds ==
 == using 21 thread(s) ==
 
 Requesting Intel(R) MKL to use 22 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     42.00900 milliseconds ==
 == using 22 thread(s) ==
 
 Requesting Intel(R) MKL to use 23 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     40.46292 milliseconds ==
 == using 23 thread(s) ==
 
 Requesting Intel(R) MKL to use 24 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     39.27597 milliseconds ==
 == using 24 thread(s) ==

 

 

0 Kudos
Gennady_F_Intel
Moderator
1,771 Views

This query is closing we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only. 



0 Kudos
Reply