<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Your poisson solver probably in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108790#M24237</link>
    <description>&lt;P&gt;Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Tue, 15 Dec 2015 01:15:54 GMT</pubDate>
    <dc:creator>mecej4</dc:creator>
    <dc:date>2015-12-15T01:15:54Z</dc:date>
    <item>
      <title>Question about DGEMM</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108789#M24236</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I am using DGEMM from MKL to do multiplication between matrix and vectors. &lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I found when I test in simple program, just calling DGEMM 200000 times to compute 256*256 matrix times 256*1 vector, it takes only about 7 seconds (nthreads=8).&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;I my real poisson solver, which need this multiplication 200000 times, still 256*256 matrix times 256*1 vector, it takes 2 min, which is much much slower than in simple test.&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Could anyone suggest any reason about this low performance? My poisson solver is openmp code.&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Thanks in advance!&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Sincerely,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Xuan&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 14 Dec 2015 23:22:14 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108789#M24236</guid>
      <dc:creator>Xuan_Z_</dc:creator>
      <dc:date>2015-12-14T23:22:14Z</dc:date>
    </item>
    <item>
      <title>Your poisson solver probably</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108790#M24237</link>
      <description>&lt;P&gt;Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Dec 2015 01:15:54 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108790#M24237</guid>
      <dc:creator>mecej4</dc:creator>
      <dc:date>2015-12-15T01:15:54Z</dc:date>
    </item>
    <item>
      <title>Hi Xuan, </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108791#M24238</link>
      <description>&lt;P&gt;Hi Xuan,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;You may try export or set &amp;nbsp;MKL_VERBOSE=1 in two cases as&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112" style="font-size: 1em; line-height: 1.5;"&gt;https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112&lt;/A&gt;&amp;nbsp;&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;and let us know the output result. &amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;In addition, c&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;ould you please let us know some basic informations, like &amp;nbsp;CPU/OS,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;MKL version, Compiler version, C or fortran, 32bit or 64bit etc?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;You mentioned&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;My poisson solver is openmp code, so are you using GNU omp or libiomp5.so(Intel OpenMP)? &amp;nbsp;and the&amp;nbsp;&amp;nbsp;200000 times&amp;nbsp;to compute 256*256 matrix times 256*1 vector is in the Openmp loop (so loop nested)? &amp;nbsp;it would be helpful if you provide &amp;nbsp;a reproduced sample code&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN style="font-size: 12px; line-height: 18px;"&gt;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;BR /&gt;
	Ying H.&lt;BR /&gt;
	&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Intel MKL Support&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Tue, 15 Dec 2015 01:22:48 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108791#M24238</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2015-12-15T01:22:48Z</dc:date>
    </item>
    <item>
      <title> </title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108792#M24239</link>
      <description>&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Hi &amp;nbsp;mecej4,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;I did performance profile, and found:&lt;/P&gt;

&lt;P&gt;&lt;IMG src="webkit-fake-url://1d1df959-cb56-42aa-af8b-66fce7689542/image.tiff" /&gt;&lt;/P&gt;

&lt;P&gt;Most of the time is spent by '_kmp_launch_thread' and '_kmp_execute_tasks'. Do you know what are these subroutine used for?&lt;/P&gt;

&lt;P&gt;Thank you!&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;mecej4 wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Your poisson solver probably does something more than just calling DGEMM, so it is not clear what part of the 2 minutes was actually used up by DGEMM. If you are concerned about the time used up in DGEMM, you should take steps to meter the time spent in DGEMM separately from the time spent elsewhere.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Dec 2015 17:15:59 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108792#M24239</guid>
      <dc:creator>Xuan_Z_</dc:creator>
      <dc:date>2015-12-16T17:15:59Z</dc:date>
    </item>
    <item>
      <title>Hi Ying,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108793#M24240</link>
      <description>&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Hi Ying,&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Thank you for the comments!&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;My Compiler is ifort of version 11.1, fortran, 64 bit.&lt;/P&gt;

&lt;P&gt;I use libiopm5.so, the 200000 operations of matrix-vector multiplication is not within the Openmp loop.&amp;nbsp;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;Here is the simple test code for multiplication:&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; program matrixmul&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;implicit none&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;integer*4 :: i,j,k,n,nthreads&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;real*8, allocatable, dimension(:,:) :: a&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;real*8, allocatable, dimension(:) :: v,v2&lt;/STRONG&gt;&lt;/EM&gt;&lt;BR /&gt;
	&lt;EM style="font-size: 1em; line-height: 1.5;"&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM style="font-size: 1em; line-height: 1.5;"&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; n=256&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM style="font-size: 1em; line-height: 1.5;"&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; nthreads=8&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;allocate( a(n,n) )&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;allocate( v(n))&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;allocate( v2(n))&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; call OMP_SET_NUM_THREADS(nthreads)&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;EM&gt;&lt;STRONG&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do j=1,n&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;v(j)=j*2.0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;do i=1,n&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; a(i,j) = j*1.0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;end do&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;end do&amp;nbsp;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;do i=1,200000&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; call DGEMM("N","N",n,1,n,1.d0,a,n,v,n,0.d0,v2,n)&lt;BR /&gt;
	end do&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;&lt;EM&gt;&lt;STRONG&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;print *,'done'&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;

&lt;P&gt;It is faster than my openmp version of multiplication.&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;program matrixmul&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;implicit none&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;integer*4 :: i,j,k,n,nthreads, chunk,m&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;real*8, pointer :: a(:,:)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;real*8, pointer :: v(:),v2(:)&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&lt;SPAN style="font-size: 1em; line-height: 1.5;"&gt;&amp;nbsp; n=256&lt;/SPAN&gt;&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; nthreads=8&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; chunk=10&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; call OMP_SET_NUM_THREADS(nthreads)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;allocate( a(n,n) )&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;allocate( v(n) )&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;allocate( v2(n) )&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; do j=1,n&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;v(j)=j*2.0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;do i=1,n&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; a(i,j) = j*1.0&lt;BR /&gt;
	&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;end do&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;end do&amp;nbsp;&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;do m=1,200000&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;!$OMP PARALLEL SHARED(A,v,v2,CHUNK) PRIVATE(I,K)&amp;nbsp;&lt;/EM&gt;&lt;/STRONG&gt;&lt;BR /&gt;
	&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;!$OMP DO SCHEDULE(STATIC, CHUNK)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; do i = 1, n&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; do k = 1, n&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; v2(i) = v2(i) + a(i,k) * v(k)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; end do&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; end do&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;!$OMP END PARALLEL&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;BR /&gt;
	enddo&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;BR /&gt;
	&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; print *,'done'&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;Then I tested in the real solver.&lt;/P&gt;

&lt;P&gt;Here is part of my solver in which multiplications are done:&amp;nbsp;&lt;BR /&gt;
	&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;EM&gt;! initialize x&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;SPAN style="font-weight: 700;"&gt;&lt;EM&gt;x(0:,1,0:)=0.0&lt;/EM&gt;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;! &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; call mul2(dpinv1(0:,0:,m),cp(0:,1,m),n,x(0:,1,m),nthread)&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;call DGEMM("N","N",n+1,1,n+1,1.d0,dpinv1(0:,0:,m),n+1,cp(0:,1,m),n+1,0.d0,x(0:,1,m),n+1)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;! solve for x from the p&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;do k = m-1, 0, -1&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;! &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; call mul2(p1(0:,0:,k),x(0:,1,k+1),n,tmp(0:,1,k),nthread)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;call DGEMM("N","N",n+1,1,n+1,1.d0,p1(0:,0:,k),n+1,x(0:,1,k+1),n+1,0.d0,tmp(0:,1,k),n+1)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;! &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; call mul2(dpinv1(0:,0:,k),cp(0:,1,k),n,tmp2(0:,1,k),nthread)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;call DGEMM("N","N",n+1,1,n+1,1.d0,dpinv1(0:,0:,k),n+1,cp(0:,1,k),n+1,0.d0,tmp2(0:,1,k),n+1)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;x(0:,1,k) = tmp2(0:,1,k)-tmp(0:,1,k)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;tmp=0.0&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;tmp2=0.0 &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;BR /&gt;
	&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;end do&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;&lt;STRONG style="font-size: 1em; line-height: 1.5;"&gt;&lt;EM&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; psol(0:n,0:m)= x(0:n,1,0:m)&lt;/EM&gt;&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;And the other version is using 'call mul2(...)' to do multiplication and I compared the performance between them.&lt;/P&gt;

&lt;P&gt;In the real solver, it takes 1m53s in mkl version, while in openmp it takes 51s. Performance profile shows that, most of the time is spent on two subroutines: "_kmp_launch_thread" and "_kml_execute_tests". Do you know why these subroutines come up? Thank you!&lt;/P&gt;

&lt;P&gt;Sincerely,&lt;/P&gt;

&lt;P&gt;Xuan&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Ying H (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Hi Xuan,&amp;nbsp;&lt;/P&gt;

&lt;P&gt;You may try export or set &amp;nbsp;MKL_VERBOSE=1 in two cases as&amp;nbsp;&lt;A href="https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112"&gt;https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112&lt;/A&gt;&amp;nbsp;&amp;nbsp;and let us know the output result. &amp;nbsp;&lt;/P&gt;

&lt;P&gt;In addition, could you please let us know some basic informations, like &amp;nbsp;CPU/OS,&amp;nbsp;&amp;nbsp;MKL version, Compiler version, C or fortran, 32bit or 64bit etc?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;You mentioned&amp;nbsp;My poisson solver is openmp code, so are you using GNU omp or libiomp5.so(Intel OpenMP)? &amp;nbsp;and the&amp;nbsp;&amp;nbsp;200000 times&amp;nbsp;to compute 256*256 matrix times 256*1 vector is in the Openmp loop (so loop nested)? &amp;nbsp;it would be helpful if you provide &amp;nbsp;a reproduced sample code&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Best Regards,&lt;BR /&gt;
	Ying H.&lt;BR /&gt;
	Intel MKL Support&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 16 Dec 2015 17:42:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108793#M24240</guid>
      <dc:creator>Xuan_Z_</dc:creator>
      <dc:date>2015-12-16T17:42:00Z</dc:date>
    </item>
    <item>
      <title>Hi Xuan,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108794#M24241</link>
      <description>&lt;P&gt;Hi Xuan,&lt;/P&gt;
&lt;P&gt;The "_kmp_launch_thread" and "_kml_execute_tests" is from Intel OpenMP thread library, which used by the OpenMP directive&lt;/P&gt;
&lt;P&gt;and MKL internally.&lt;/P&gt;
&lt;P&gt;I&amp;nbsp;can't see the profile image you attached (maybe because of&amp;nbsp;internal network issue), do you know how many OpenMP in the application? you can try the command:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;export OMP_AFFINITY=verbose to see if there is over or (nested) threads?&lt;/P&gt;
&lt;P&gt;compute 256*256 matrix times 256*1 vector, it should be dgemv&amp;nbsp; operation, right?&amp;nbsp;&amp;nbsp; I&amp;nbsp; recalled we should improve the&lt;/P&gt;
&lt;P&gt;multi-thread performance of for such operations in later version,&lt;/P&gt;
&lt;P&gt;So I may suggest you to try&amp;nbsp; the latest MKL version, MKL 11.3.1, you can apply and get it from&lt;/P&gt;
&lt;P&gt;&lt;A href="http://software.intel.com/sites/campaigns/nest/"&gt;http://software.intel.com/sites/campaigns/nest/&lt;/A&gt; freely.&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;BR /&gt;Ying&lt;/P&gt;</description>
      <pubDate>Wed, 23 Dec 2015 01:58:22 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108794#M24241</guid>
      <dc:creator>Ying_H_Intel</dc:creator>
      <dc:date>2015-12-23T01:58:22Z</dc:date>
    </item>
    <item>
      <title>It is always a good idea to</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108795#M24242</link>
      <description>&lt;P&gt;It is always a good idea to have some idea of how long a computation *should* take....&amp;nbsp; Part of that requires that we know what hardware is being used, which still has not been specified.&lt;/P&gt;

&lt;P&gt;As noted above, the 256x256 matrix multiplying a 256x1 vector is a DGEMV operation.&amp;nbsp;&amp;nbsp; Although it is a valid DGEMM operation, the DGEMM code is heavily optimized for the data re-use that is available with multiple columns in both arguments, so it is probably not the most efficient way of handling this case -- perhaps by a fairly large ratio.&amp;nbsp;&lt;/P&gt;

&lt;P&gt;It is possible to estimate how long a DGEMV of this size should run....&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;The 256x256 array of "real*8" variables occupies 512 KiB, which means that it is larger than L2, but smaller than L3.&amp;nbsp;&amp;nbsp;&lt;/LI&gt;
	&lt;LI&gt;But in the OpenMP version above, each of the 8 threads will only access 1/8 of the array in order to update their portion of the output.&amp;nbsp;
		&lt;UL&gt;
			&lt;LI&gt;This means that each thread will only access 512/8=64 KiB of data from the "a" array, which should fit easily into the 256 KiB private L2 cache of most recent Intel processors.&amp;nbsp;&lt;/LI&gt;
		&lt;/UL&gt;
	&lt;/LI&gt;
	&lt;LI&gt;The "v" and "v2" arrays are only 4KiB each, so they should stay in the L1 Data Cache.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;On Sandy Bridge and Ivy Bridge cores, I typically measure L2 bandwidth at about 14 Bytes/cycle per core.&amp;nbsp; With 8 threads, each thread will need to read 64 KiB (1/8 of the rows of array "a") each iteration, so 200,000 iterations means that each thread must read 13.1e9 bytes.&amp;nbsp; At 14 Bytes/cycle, this is just under 1 billion cycles, which is less than 1/2 second on a system running at 2 GHz or faster.&amp;nbsp;&amp;nbsp;&amp;nbsp; Haswell cores have about twice the L2 bandwidth, so should take about half as long.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 28 Dec 2015 19:01:52 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108795#M24242</guid>
      <dc:creator>McCalpinJohn</dc:creator>
      <dc:date>2015-12-28T19:01:52Z</dc:date>
    </item>
    <item>
      <title>As you're using ifort, if you</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108796#M24243</link>
      <description>&lt;P&gt;As you're using ifort, if you don't care to think about which BLAS function fits your matrix multiplication requirements, you can use the Fortran intrinsic MATMUL with the compile option -opt-matmul (implied by -O3) to have the choice of library entry points made by the compiler.&lt;/P&gt;

&lt;P&gt;I guess the corresponding option in gfortran will never call dgemv, so there may be a case for choosing it explicitly.&lt;/P&gt;

&lt;P&gt;When I saw this thread, I assumed a choice of dgemm on account of having multiple vectors (stored as a matrix) to process at one time.&lt;/P&gt;</description>
      <pubDate>Mon, 28 Dec 2015 21:17:24 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/Question-about-DGEMM/m-p/1108796#M24243</guid>
      <dc:creator>TimP</dc:creator>
      <dc:date>2015-12-28T21:17:24Z</dc:date>
    </item>
  </channel>
</rss>

