<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Re:Threaded MKL's DGEMM performance does not improve with increasing threads in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314145#M32039</link>
    <description>&lt;P&gt;Dear Gennady,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many thanks for bringing that to my attention. I had many discussions with several other people and a lot of possibilities were lined up. I tried exactly the same code as above on an Intel Xeon Phi 7250 ("Knights Landing") node and, sure enough, the scaling is there.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="none"&gt;c455-004[knl](172)$ OMP_NUM_THREADS=68 ./a.out 
 Running Intel(R) MKL from 1 to 68 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    627.42681 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    529.81592 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    351.05031 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    261.95568 milliseconds ==
 == using  4 thread(s) ==
 
 Requesting Intel(R) MKL to use  5 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    205.79281 milliseconds ==
 == using  5 thread(s) ==
 
 Requesting Intel(R) MKL to use  6 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    169.52665 milliseconds ==
 == using  6 thread(s) ==
 
 Requesting Intel(R) MKL to use  7 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    143.74205 milliseconds ==
 == using  7 thread(s) ==
 
 Requesting Intel(R) MKL to use  8 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    124.09850 milliseconds ==
 == using  8 thread(s) ==
 
 Requesting Intel(R) MKL to use  9 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    109.59659 milliseconds ==
 == using  9 thread(s) ==
 
 Requesting Intel(R) MKL to use 10 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     97.01271 milliseconds ==
 == using 10 thread(s) ==
 
 Requesting Intel(R) MKL to use 11 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     86.46407 milliseconds ==
 == using 11 thread(s) ==
 
 Requesting Intel(R) MKL to use 12 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     79.22576 milliseconds ==
 == using 12 thread(s) ==
 
 Requesting Intel(R) MKL to use 13 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     72.35889 milliseconds ==
 == using 13 thread(s) ==
 
 Requesting Intel(R) MKL to use 14 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     67.00823 milliseconds ==
 == using 14 thread(s) ==
 
 Requesting Intel(R) MKL to use 15 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     61.52943 milliseconds ==
 == using 15 thread(s) ==
 
 Requesting Intel(R) MKL to use 16 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     57.67981 milliseconds ==
 == using 16 thread(s) ==
 
 Requesting Intel(R) MKL to use 17 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     54.55822 milliseconds ==
 == using 17 thread(s) ==
 
 Requesting Intel(R) MKL to use 18 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     50.60534 milliseconds ==
 == using 18 thread(s) ==
 
 Requesting Intel(R) MKL to use 19 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     48.46043 milliseconds ==
 == using 19 thread(s) ==
 
 Requesting Intel(R) MKL to use 20 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     45.59280 milliseconds ==
 == using 20 thread(s) ==
 
 Requesting Intel(R) MKL to use 21 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     44.60451 milliseconds ==
 == using 21 thread(s) ==
 
 Requesting Intel(R) MKL to use 22 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     42.00900 milliseconds ==
 == using 22 thread(s) ==
 
 Requesting Intel(R) MKL to use 23 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     40.46292 milliseconds ==
 == using 23 thread(s) ==
 
 Requesting Intel(R) MKL to use 24 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     39.27597 milliseconds ==
 == using 24 thread(s) ==
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Mon, 13 Sep 2021 13:09:15 GMT</pubDate>
    <dc:creator>babreu</dc:creator>
    <dc:date>2021-09-13T13:09:15Z</dc:date>
    <item>
      <title>Threaded MKL's DGEMM performance does not improve with increasing threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1313330#M32023</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;I am trying to improve the performance of a Fortran code by making better use of MKL's DGEMM (DGEMV could be used as well). This code basically performs matrix diagonalization, and after profiling it I was able to find that ~70% of the time is spent on calls to DGEMM(V). However, it was not clear to me from the profiling results whether or not these calls were profiting from multithreading. Therefore, I started experimenting with an isolated DGEMM code that it is taken from &lt;A href="https://software.intel.com/content/www/us/en/develop/documentation/mkl-tutorial-fortran/top/measuring-performance-with-intel-mkl-support-functions.html" target="_self"&gt;here&lt;/A&gt;. To my surprise, I don't seem to be gaining any performance. The total run-time is always the same, regardless of how many threads are called. I understand that MKL can be doing all sorts of optimization/smart choices, but it is quite hard to tell what they are. Would you have any suggestions or comments on this issue?&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The code that I am running is:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="fortran"&gt;program mkl_dgemm
      use, intrinsic :: iso_fortran_env
      use :: mkl_service
      implicit none
      include "mkl_lapack.fi"
      integer, parameter :: dp = REAL64 ! double precision float
      integer, parameter :: i32 = INT32 ! 32-bit integer
      integer(i32), parameter :: ord1=40000_i32  ! leading dim of matrix
      integer(i32), parameter :: ord2=20000_i32   ! lower dim of matrix
      real(dp) :: startT, endT
      real(dp), dimension(:,:), allocatable :: m, v, p
      integer(i32) :: MAX_THREADS, l, i

      ! allocate
      allocate(m(ord1, ord2))
      allocate(v(ord2,1))
      allocate(p(ord1,1))

      ! fill in with random stuff
      call random_seed()
      call random_number(m)
      call random_number(v)
      p = 0.0_dp

      MAX_THREADS = MKL_GET_MAX_THREADS()
      PRINT 20," Running Intel(R) MKL from 1 to ",MAX_THREADS," threads"
 20   FORMAT(A,I2,A)
      PRINT *, ""

      do l = 1, MAX_THREADS
        PRINT 30, " Requesting Intel(R) MKL to use ", l," thread(s)"
 30     FORMAT(A,I2,A)
        CALL MKL_SET_NUM_THREADS(l)

        ! call MKL (syntax below))
        ! dgemm('N', 'N', M, N, K, ALPHA, A, M, B, K, BETA, C, M)
        call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)

        startT = dsecnd()
        !startT = omp_get_wtime()
        !call cpu_time(startT)
        do i = 1, 1
                call dgemm('N', 'N', ord1, 1, ord2, 1.0_dp, m, ord1, v, ord2, 0.0_dp, p, ord1)
        enddo
        !call cpu_time(endT)
        !endT = omp_get_wtime()
        endT = dsecnd()

        PRINT *, "== Matrix multiplication using Intel(R) MKL DGEMM =="
        PRINT 50, " == completed at ",(endT-startT)*1000," milliseconds =="
        PRINT 60, " == using ",l," thread(s) =="
 50     FORMAT(A,F12.5,A)
 60     FORMAT(A,I2,A)
        PRINT *, ""
      enddo

end program mkl_dgemm
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;My compiling options were taken from &lt;A href="https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/onemkl/link-line-advisor.html" target="_self"&gt;Link Line Advisor&lt;/A&gt;, which are:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;FC=ifort
MKLPATH=${MKLROOT}/lib/intel64
MKLINCLUDE=${MKLROOT}/mkl/include

LDFLAGS=-mkl=parallel -L${MKLPATH} -I${MKLINCLUDE} -I${MKLINCLUDE}/intel64/lp64 -lmkl_lapack95_lp64 -Wl,--start-group ${MKLPATH}/libmkl_intel_lp64.a ${MKLPATH}/libmkl_intel_thread.a ${MKLPATH}/libmkl_core.a -Wl,--end-group -liomp5 -lpthread -lm

all:
	$(FC) mkl_dgemm.f90 $(LDFLAGS)
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The MKL version that I have access to is: MKL 2020.4.304,&lt;/P&gt;
&lt;P&gt;and the Intel Fortran compiler is: ifort (IFORT) 2021.3.0 20210609&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Here's an example of output from &lt;EM&gt;./a.out&lt;/EM&gt; when I'm using 4 cores:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="none"&gt;[babreu@r002 intel]$ ./a.out 
 Running Intel(R) MKL from 1 to  4 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    378.01160 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    377.27408 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    379.07949 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == completed at    377.69205 milliseconds ==
 == using  4 thread(s) ==
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;This machine that I am using has AMD EPYC 7742 cpus. I am happy to provide any other information that you may find useful.&lt;/P&gt;
&lt;P&gt;Thanks!&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 09 Sep 2021 15:18:13 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1313330#M32023</guid>
      <dc:creator>babreu</dc:creator>
      <dc:date>2021-09-09T15:18:13Z</dc:date>
    </item>
    <item>
      <title>Re:Threaded MKL's DGEMM performance does not improve with increasing threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314115#M32034</link>
      <description>&lt;P&gt;It seems the OpenMP runtime doesn’t support Non-Intel architecture.&lt;/P&gt;&lt;P&gt;You can try to take the latest MKL 2021 and &lt;/P&gt;&lt;P&gt;set the environment variable export OMP_NUM_THREADS=24 ( #number of physical threads) and check the scalability once again, but I have to note that we don't validate this behaviour on our end on non-intel based system. &lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Mon, 13 Sep 2021 10:45:08 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314115#M32034</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2021-09-13T10:45:08Z</dc:date>
    </item>
    <item>
      <title>Re: Re:Threaded MKL's DGEMM performance does not improve with increasing threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314145#M32039</link>
      <description>&lt;P&gt;Dear Gennady,&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Many thanks for bringing that to my attention. I had many discussions with several other people and a lot of possibilities were lined up. I tried exactly the same code as above on an Intel Xeon Phi 7250 ("Knights Landing") node and, sure enough, the scaling is there.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="none"&gt;c455-004[knl](172)$ OMP_NUM_THREADS=68 ./a.out 
 Running Intel(R) MKL from 1 to 68 threads
 
 Requesting Intel(R) MKL to use  1 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    627.42681 milliseconds ==
 == using  1 thread(s) ==
 
 Requesting Intel(R) MKL to use  2 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    529.81592 milliseconds ==
 == using  2 thread(s) ==
 
 Requesting Intel(R) MKL to use  3 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    351.05031 milliseconds ==
 == using  3 thread(s) ==
 
 Requesting Intel(R) MKL to use  4 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    261.95568 milliseconds ==
 == using  4 thread(s) ==
 
 Requesting Intel(R) MKL to use  5 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    205.79281 milliseconds ==
 == using  5 thread(s) ==
 
 Requesting Intel(R) MKL to use  6 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    169.52665 milliseconds ==
 == using  6 thread(s) ==
 
 Requesting Intel(R) MKL to use  7 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    143.74205 milliseconds ==
 == using  7 thread(s) ==
 
 Requesting Intel(R) MKL to use  8 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    124.09850 milliseconds ==
 == using  8 thread(s) ==
 
 Requesting Intel(R) MKL to use  9 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed    109.59659 milliseconds ==
 == using  9 thread(s) ==
 
 Requesting Intel(R) MKL to use 10 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     97.01271 milliseconds ==
 == using 10 thread(s) ==
 
 Requesting Intel(R) MKL to use 11 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     86.46407 milliseconds ==
 == using 11 thread(s) ==
 
 Requesting Intel(R) MKL to use 12 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     79.22576 milliseconds ==
 == using 12 thread(s) ==
 
 Requesting Intel(R) MKL to use 13 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     72.35889 milliseconds ==
 == using 13 thread(s) ==
 
 Requesting Intel(R) MKL to use 14 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     67.00823 milliseconds ==
 == using 14 thread(s) ==
 
 Requesting Intel(R) MKL to use 15 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     61.52943 milliseconds ==
 == using 15 thread(s) ==
 
 Requesting Intel(R) MKL to use 16 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     57.67981 milliseconds ==
 == using 16 thread(s) ==
 
 Requesting Intel(R) MKL to use 17 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     54.55822 milliseconds ==
 == using 17 thread(s) ==
 
 Requesting Intel(R) MKL to use 18 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     50.60534 milliseconds ==
 == using 18 thread(s) ==
 
 Requesting Intel(R) MKL to use 19 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     48.46043 milliseconds ==
 == using 19 thread(s) ==
 
 Requesting Intel(R) MKL to use 20 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     45.59280 milliseconds ==
 == using 20 thread(s) ==
 
 Requesting Intel(R) MKL to use 21 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     44.60451 milliseconds ==
 == using 21 thread(s) ==
 
 Requesting Intel(R) MKL to use 22 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     42.00900 milliseconds ==
 == using 22 thread(s) ==
 
 Requesting Intel(R) MKL to use 23 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     40.46292 milliseconds ==
 == using 23 thread(s) ==
 
 Requesting Intel(R) MKL to use 24 thread(s)
 == Matrix multiplication using Intel(R) MKL DGEMM ==
 == Timed     39.27597 milliseconds ==
 == using 24 thread(s) ==
&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 13 Sep 2021 13:09:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314145#M32039</guid>
      <dc:creator>babreu</dc:creator>
      <dc:date>2021-09-13T13:09:15Z</dc:date>
    </item>
    <item>
      <title>Re:Threaded MKL's DGEMM performance does not improve with increasing threads</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314774#M32062</link>
      <description>&lt;P&gt;This query is closing we will no longer respond to this thread. If you require additional assistance from Intel, please start a new thread. Any further interaction in this thread will be considered community only.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;BR /&gt;&lt;/P&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 15 Sep 2021 11:16:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Threaded-MKL-s-DGEMM-performance-does-not-improve-with/m-p/1314774#M32062</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2021-09-15T11:16:52Z</dc:date>
    </item>
  </channel>
</rss>

