- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hello,

I am trying to improve the performance of a Fortran code by making better use of MKL's DGEMM (DGEMV could be used as well). This code basically performs matrix diagonalization, and after profiling it I was able to find that ~70% of the time is spent on calls to DGEMM(V). However, it was not clear to me from the profiling results whether or not these calls were profiting from multithreading. Therefore, I started experimenting with an isolated DGEMM code that it is taken from here. To my surprise, I don't seem to be gaining any performance. The total run-time is always the same, regardless of how many threads are called. I understand that MKL can be doing all sorts of optimization/smart choices, but it is quite hard to tell what they are. Would you have any suggestions or comments on this issue?

The code that I am running is:

```
program mkl_dgemm
use, intrinsic :: iso_fortran_env
use :: mkl_service
implicit none
include "mkl_lapack.fi"
integer, parameter :: dp = REAL64 ! double precision float
integer, parameter :: i32 = INT32 ! 32-bit integer
integer(i32), parameter :: ord1=40000_i32 ! leading dim of matrix
integer(i32), parameter :: ord2=20000_i32 ! lower dim of matrix
real(dp) :: startT, endT
real(dp), dimension(:,:), allocatable :: m, v, p
integer(i32) :: MAX_THREADS, l, i
! allocate
allocate(m(ord1, ord2))
allocate(v(ord2,1))
allocate(p(ord1,1))
! fill in with random stuff
call random_seed()
call random_number(m)
call random_number(v)
p = 0.0_dp
MAX_THREADS = MKL_GET_MAX_THREADS()
PRINT 20," Running Intel(R) MKL from 1 to ",MAX_THREADS," threads"
20 FORMAT(A,I2,A)
PRINT *, ""
do l = 1, MAX_THREADS
PRINT 30, " Requesting Intel(R) MKL to use ", l," thread(s)"
30 FORMAT(A,I2,A)
CALL MKL_SET_NUM_THREADS(l)
! call MKL (syntax below))
! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M)
call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
startT = dsecnd()
!startT = omp_get_wtime()
!call cpu_time(startT)
do i = 1, 1
call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
enddo
!call cpu_time(endT)
!endT = omp_get_wtime()
endT = dsecnd()
PRINT *, "== Matrix multiplication using Intel(R) MKL DGEMM =="
PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
PRINT 60, " == using ",l," thread(s) =="
50 FORMAT(A,F12.5,A)
60 FORMAT(A,I2,A)
PRINT *, ""
enddo
end program mkl_dgemm
```

My compiling options were taken from Link Line Advisor, which are:

```
FC=ifort
MKLPATH=${MKLROOT}/lib/intel64
MKLINCLUDE=${MKLROOT}/mkl/include
LDFLAGS=-mkl=parallel -L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm
all:
$(FC) mkl_dgemm.f90 $(LDFLAGS)
```

The MKL version that I have access to is: MKL 2020.4.304,

and the Intel Fortran compiler is: ifort (IFORT) 2021.3.0 20210609

Here's an example of output from *./a.out* when I'm using 4 cores:

```
[babreu@r002 intel]$ ./a.out
Running Intel(R) MKL from 1 to 4 threads
Requesting Intel(R) MKL to use 1 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== completed at 378.01160 milliseconds ==
== using 1 thread(s) ==
Requesting Intel(R) MKL to use 2 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== completed at 377.27408 milliseconds ==
== using 2 thread(s) ==
Requesting Intel(R) MKL to use 3 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== completed at 379.07949 milliseconds ==
== using 3 thread(s) ==
Requesting Intel(R) MKL to use 4 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== completed at 377.69205 milliseconds ==
== using 4 thread(s) ==
```

This machine that I am using has AMD EPYC 7742 cpus. I am happy to provide any other information that you may find useful.

Thanks!

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

It seems the OpenMP runtime doesn’t support Non-Intel architecture.

You can try to take the latest MKL 2021 and

set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Dear Gennady,

Many thanks for bringing that to my attention. I had many discussions with several other people and a lot of possibilities were lined up. I tried exactly the same code as above on an Intel Xeon Phi 7250 ("Knights Landing") node and, sure enough, the scaling is there.

```
c455-004[knl](172)$ OMP_NUM_THREADS=68 ./a.out
Running Intel(R) MKL from 1 to 68 threads
Requesting Intel(R) MKL to use 1 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 627.42681 milliseconds ==
== using 1 thread(s) ==
Requesting Intel(R) MKL to use 2 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 529.81592 milliseconds ==
== using 2 thread(s) ==
Requesting Intel(R) MKL to use 3 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 351.05031 milliseconds ==
== using 3 thread(s) ==
Requesting Intel(R) MKL to use 4 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 261.95568 milliseconds ==
== using 4 thread(s) ==
Requesting Intel(R) MKL to use 5 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 205.79281 milliseconds ==
== using 5 thread(s) ==
Requesting Intel(R) MKL to use 6 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 169.52665 milliseconds ==
== using 6 thread(s) ==
Requesting Intel(R) MKL to use 7 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 143.74205 milliseconds ==
== using 7 thread(s) ==
Requesting Intel(R) MKL to use 8 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 124.09850 milliseconds ==
== using 8 thread(s) ==
Requesting Intel(R) MKL to use 9 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 109.59659 milliseconds ==
== using 9 thread(s) ==
Requesting Intel(R) MKL to use 10 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 97.01271 milliseconds ==
== using 10 thread(s) ==
Requesting Intel(R) MKL to use 11 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 86.46407 milliseconds ==
== using 11 thread(s) ==
Requesting Intel(R) MKL to use 12 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 79.22576 milliseconds ==
== using 12 thread(s) ==
Requesting Intel(R) MKL to use 13 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 72.35889 milliseconds ==
== using 13 thread(s) ==
Requesting Intel(R) MKL to use 14 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 67.00823 milliseconds ==
== using 14 thread(s) ==
Requesting Intel(R) MKL to use 15 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 61.52943 milliseconds ==
== using 15 thread(s) ==
Requesting Intel(R) MKL to use 16 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 57.67981 milliseconds ==
== using 16 thread(s) ==
Requesting Intel(R) MKL to use 17 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 54.55822 milliseconds ==
== using 17 thread(s) ==
Requesting Intel(R) MKL to use 18 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 50.60534 milliseconds ==
== using 18 thread(s) ==
Requesting Intel(R) MKL to use 19 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 48.46043 milliseconds ==
== using 19 thread(s) ==
Requesting Intel(R) MKL to use 20 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 45.59280 milliseconds ==
== using 20 thread(s) ==
Requesting Intel(R) MKL to use 21 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 44.60451 milliseconds ==
== using 21 thread(s) ==
Requesting Intel(R) MKL to use 22 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 42.00900 milliseconds ==
== using 22 thread(s) ==
Requesting Intel(R) MKL to use 23 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 40.46292 milliseconds ==
== using 23 thread(s) ==
Requesting Intel(R) MKL to use 24 thread(s)
== Matrix multiplication using Intel(R) MKL DGEMM ==
== Timed 39.27597 milliseconds ==
== using 24 thread(s) ==
```

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

This query is closing we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page