<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Hi Gennady, in Intel® oneAPI Math Kernel Library</title>
    <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132567#M25740</link>
    <description>&lt;P&gt;Hi Gennady,&lt;/P&gt;

&lt;P&gt;Its 64-bit.&lt;/P&gt;

&lt;P&gt;The issue is 100% reproducible when running our regression testing, but I expect, of course, it will likely be very difficult to reproduce in a test example. As I noted, changing the code to use zgemm, causes the issue to go away. Change it back to zgemm3m, and the issue returns.&lt;/P&gt;

&lt;P&gt;If I break when the problematic code is running I see only one active thread , stopped in mkl_avx.dll. The other omp threads are present but sleeping. I don't see any other problems like this when calling other BLAS/LAPACK functions.&lt;/P&gt;

&lt;P&gt;Andrew&lt;/P&gt;</description>
    <pubDate>Wed, 27 Sep 2017 18:54:07 GMT</pubDate>
    <dc:creator>AndrewC</dc:creator>
    <dc:date>2017-09-27T18:54:07Z</dc:date>
    <item>
      <title>zgemm3m using 1 thread ( MKL 2017 and 2018)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132564#M25737</link>
      <description>&lt;P&gt;I am seeing some performance regression with MKL2017/2018 with zgemm3m&lt;/P&gt;

&lt;P&gt;zgemm3m , in some cases , appears to be only using 1 thread (with a negative impact on elapsed time) despite the matrix being 'large'&lt;/P&gt;

&lt;P&gt;This behaviour appeared in MKL 2017 and MKL 2018 but &lt;STRONG&gt;is not in &lt;/STRONG&gt;MKL 2015&lt;/P&gt;

&lt;P&gt;The call to zgemm3m&amp;nbsp; takes two 4122x4122 double complex matrices. Windows 7 4 Core Xeon machine with HT.&lt;/P&gt;

&lt;P&gt;&lt;SPAN class="parmname"&gt;transa=transb='N', m=n=k=4122. lda=4122,ldb=4122,alpha=1,beta=0,ldc=4122&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;We are essentially&amp;nbsp; looping&amp;nbsp; and calling zgemm3m with the same dimensions and matrix structure each time through the loop.&lt;/P&gt;

&lt;P&gt;The loop is not OpenMP parallelized. Running in the "main" thread.&lt;/P&gt;

&lt;P&gt;First time through the loop, zgemm3m uses all cores&lt;/P&gt;

&lt;P&gt;Second time through the loop zgemm3m uses only one core ( and runs MUCH slower that the first call ).&lt;/P&gt;

&lt;P&gt;It's very obvious in the debugger that zgemm3m is not using multiple threads the second time it is called. I tried to 'force' the correct # of threads before the call, with no change in behaviour.&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;		int numThreads = MKL_Get_Max_Threads();
		cout &amp;lt;&amp;lt; "MKL Threads " &amp;lt;&amp;lt; numThreads &amp;lt;&amp;lt; endl;
		MKL_Set_Num_Threads(numThreads);
		int numOMPThreads = omp_get_max_threads();
		cout &amp;lt;&amp;lt; "OMP Threads " &amp;lt;&amp;lt; numOMPThreads &amp;lt;&amp;lt; endl;
		omp_set_num_threads(numOMPThreads);
		mkl_set_dynamic(false);
                zgemm3m(....)
&lt;/PRE&gt;

&lt;P&gt;The output of above code trying to force the expected behaviour is always&lt;/P&gt;

&lt;P&gt;MKL Threads 4&lt;BR /&gt;
	OMP Threads 8&lt;/P&gt;

&lt;P&gt;What would cause zgemm3m to "turn off" threading?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;Andrew&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 16:21:11 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132564#M25737</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-27T16:21:11Z</dc:date>
    </item>
    <item>
      <title>Interesting , if I switch to</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132565#M25738</link>
      <description>&lt;P&gt;Interesting , if I switch to &lt;STRONG&gt;zgemm &lt;/STRONG&gt;the observed&amp;nbsp; problem goes away. Also note I do have MKL_DIRECT=1 set&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 16:46:58 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132565#M25738</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-27T16:46:58Z</dc:date>
    </item>
    <item>
      <title>Andrew, we didn't chance the</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132566#M25739</link>
      <description>&lt;P&gt;Andrew,&lt;/P&gt;

&lt;P&gt;We did not change the behavior of this routine&amp;nbsp;&lt;SPAN style="font-size: 1em;"&gt;from threading point of view. We need to check the problem on&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;

&lt;P&gt;our side. Is that 64 bit code?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;--Gennady&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 18:18:00 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132566#M25739</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2017-09-27T18:18:00Z</dc:date>
    </item>
    <item>
      <title>Hi Gennady,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132567#M25740</link>
      <description>&lt;P&gt;Hi Gennady,&lt;/P&gt;

&lt;P&gt;Its 64-bit.&lt;/P&gt;

&lt;P&gt;The issue is 100% reproducible when running our regression testing, but I expect, of course, it will likely be very difficult to reproduce in a test example. As I noted, changing the code to use zgemm, causes the issue to go away. Change it back to zgemm3m, and the issue returns.&lt;/P&gt;

&lt;P&gt;If I break when the problematic code is running I see only one active thread , stopped in mkl_avx.dll. The other omp threads are present but sleeping. I don't see any other problems like this when calling other BLAS/LAPACK functions.&lt;/P&gt;

&lt;P&gt;Andrew&lt;/P&gt;</description>
      <pubDate>Wed, 27 Sep 2017 18:54:07 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132567#M25740</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-27T18:54:07Z</dc:date>
    </item>
    <item>
      <title>Ok, Thanks Andrew.</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132568#M25741</link>
      <description>&lt;P&gt;Ok, Thanks Andrew.&lt;/P&gt;

&lt;P&gt;1. I am not sure I understand reason&amp;nbsp; /DMKL_DIRECT_CALL option for such problem sizes. May you try don't use this option and then set MKL_VERBOSE to check how many threads would e used by zgemm3m?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;2. you said -- MKL 2015. it seem you mentioned MKL v 11.3. Could you please have a look at the mkl_version.h file and let me know the exact version from there?&amp;nbsp;&lt;/P&gt;

&lt;P&gt;regards, Gennady&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 07:30:35 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132568#M25741</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2017-09-28T07:30:35Z</dc:date>
    </item>
    <item>
      <title>We use MKL_DIRECT=1 in our</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132569#M25742</link>
      <description>&lt;P&gt;We use MKL_DIRECT=1 in our code because problem sizes vary from 4x4 matrices to 11,000x11,000 matrices. When I say MKL 2015, I mean the version shipped with Intel Parallel Studio 2015.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 14:37:09 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132569#M25742</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-28T14:37:09Z</dc:date>
    </item>
    <item>
      <title>Gennady</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132570#M25743</link>
      <description>&lt;P&gt;Gennady&lt;/P&gt;

&lt;P&gt;Here are some interesting results&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;With MKL_DIRECT=1, MKL_VERBOSE=1 MKL_DIRECT_CALL_SEQ is not defined.&lt;/STRONG&gt;&lt;/P&gt;

&lt;P&gt;I do not see any output from calls to 'zgemm3m' ( though I do see output from some other MKL routines). I am assuming this means zgemm3m_direct does not print anything with MKL_VERBOSE=1&lt;/P&gt;

&lt;P&gt;&lt;STRONG&gt;With MKL_DIRECT undefined, MKL_VERBOSE=1&lt;/STRONG&gt;&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;The issue I was seeing goes away ( all threads are used in all calls to zgemm3m)&lt;/LI&gt;
	&lt;LI&gt;Sample output below&lt;/LI&gt;
&lt;/UL&gt;

&lt;PRE class="brush:;"&gt;MKL_VERBOSE ZGEMM3M(N,N,4122,4122,4122,000000000012E150,00000000A6740040,4122,00
00000160700040,4122,000000000012E1B8,0000000140060040,4122) 5.08s CNR:OFF Dyn:1
FastMM:1 TID:0  NThr:4 WDiv:HOST:+0.000&lt;/PRE&gt;

&lt;P&gt;So my conclusion would be that 'something' in zgemm3m_direct that turns off threading even for large matrices - but not always?&lt;/P&gt;

&lt;P&gt;Obviously the workaround is to turn off MKL_DIRECT, this is acceptable for some small loss of performance for some cases.&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 16:20:45 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132570#M25743</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-28T16:20:45Z</dc:date>
    </item>
    <item>
      <title>You are right, MKL_VERBOSE</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132571#M25744</link>
      <description>&lt;P&gt;You are right, MKL_VERBOSE does not work when MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ is defined.&lt;/P&gt;

&lt;P&gt;MKL_DIRECT_CALL_SEQ tells MKL to run sequentially. If you have large matrices where threading can help, then we need to define MKL_DIRECT_CALL only. If we also define MKL_DIRECT_CALL_SEQ, then MKL will run all GEMMs in single thread.&lt;/P&gt;

&lt;P&gt;Looking at the dll file above, you saw this on 4-core Windows AVX&amp;nbsp;system, is this correct?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 18:29:10 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132571#M25744</guid>
      <dc:creator>Murat_G_Intel</dc:creator>
      <dc:date>2017-09-28T18:29:10Z</dc:date>
    </item>
    <item>
      <title>Processor   Intel(R) Xeon(R)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132572#M25745</link>
      <description>&lt;P&gt;Processor&amp;nbsp;&amp;nbsp; Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz, 3701 Mhz, 4 Core(s), 8 Logical Processor(s)&lt;/P&gt;

&lt;P&gt;Just to be clear&lt;/P&gt;

&lt;UL&gt;
	&lt;LI&gt;I do not #define MKL_DIRECT_CALL_SEQ&lt;/LI&gt;
	&lt;LI&gt;And the issue is that the threading behaviour of zgemm3m(_direct) seems to change during the execution of a running program&amp;nbsp; when passed the same matrix structure (square 4122x4122). Looking at mkl_direct_call.h I understand the mkl_direct_call_flag is passed as either 1,0 to indicate sequential or parallel operation, but I can't see any issue there as it's a local variable.&lt;/LI&gt;
&lt;/UL&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 19:00:16 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132572#M25745</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-28T19:00:16Z</dc:date>
    </item>
    <item>
      <title>I see, you observe this issue</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132573#M25746</link>
      <description>&lt;P&gt;I see, you observe this issue when you define MKL_DIRECT_CALL only? Yes, this is a local variable and its value only depends on whether MKL_DIRECT_CALL_SEQ or MKL_DIRECT_CALL is defined. The value should remain the same for each call.&lt;/P&gt;

&lt;P&gt;You only observe 1-thread execution when MKL_DIRECT_CALL is defined. If you undefine it, everything works as expected, is this correct? And, zgemm doesn't suffer from the same problem, right?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 19:46:15 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132573#M25746</guid>
      <dc:creator>Murat_G_Intel</dc:creator>
      <dc:date>2017-09-28T19:46:15Z</dc:date>
    </item>
    <item>
      <title>Quote:Murat Efe Guney (Intel)</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132574#M25747</link>
      <description>&lt;P&gt;&lt;/P&gt;&lt;BLOCKQUOTE&gt;Murat Efe Guney (Intel) wrote:&lt;BR /&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;I see, you observe this issue when you define MKL_DIRECT_CALL only? Yes, this is a local variable and its value only depends on whether MKL_DIRECT_CALL_SEQ or MKL_DIRECT_CALL is defined. The value should remain the same for each call.&lt;/P&gt;

&lt;P&gt;You only observe 1-thread execution when MKL_DIRECT_CALL is defined. If you undefine it, everything works as expected, is this correct? And, zgemm doesn't suffer from the same problem, right?&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;P&gt;&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;&lt;/P&gt;

&lt;P&gt;Correct. There are two separate workarounds&lt;/P&gt;

&lt;P&gt;- #undefine MKL_DIRECT&lt;/P&gt;

&lt;P&gt;OR&lt;/P&gt;

&lt;P&gt;- Replace zgemm3m by zgemm&lt;/P&gt;</description>
      <pubDate>Thu, 28 Sep 2017 21:19:37 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132574#M25747</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-28T21:19:37Z</dc:date>
    </item>
    <item>
      <title>Andrew,</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132575#M25748</link>
      <description>&lt;P&gt;Andrew,&lt;/P&gt;

&lt;P&gt;Do you have enough free RAM available on the system when execute this case? We asking because of MKL allocated different memory pool depends of #of threads. For example specifically with your case, zgemm3m, 4122x4122,&lt;/P&gt;

&lt;P&gt;MKL 2018 allocates (this is easy to check by using mkl_mem_stat() routine:&lt;/P&gt;

&lt;P&gt;1 thr:&amp;nbsp; 883.356850 MB or 926266792 bytes in 7 buffers&lt;/P&gt;

&lt;P&gt;2 thr:&amp;nbsp; 894.398720 MB or 937845032 bytes in 11 buffers&lt;/P&gt;

&lt;P&gt;4 thr:&amp;nbsp; 916.482460 MB or 961001512 bytes in 19 buffers&lt;/P&gt;

&lt;P&gt;8 thr: &amp;nbsp;960.649940 MB or 1007314472 bytes in 35 buffers&lt;/P&gt;</description>
      <pubDate>Fri, 29 Sep 2017 07:42:03 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132575#M25748</guid>
      <dc:creator>Gennady_F_Intel</dc:creator>
      <dc:date>2017-09-29T07:42:03Z</dc:date>
    </item>
    <item>
      <title>Many gigabytes of free RAM</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132576#M25749</link>
      <description>&lt;P&gt;Many gigabytes of free RAM&lt;/P&gt;</description>
      <pubDate>Fri, 29 Sep 2017 22:23:06 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132576#M25749</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-09-29T22:23:06Z</dc:date>
    </item>
    <item>
      <title>The only way I found around</title>
      <link>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132577#M25750</link>
      <description>&lt;P&gt;The only way I found around this issue was to change my code that calls zgemm3m by "expanding" the zgemm3m macro myself and making sure that the 'real' zgemm3m is called , not zgemm3m_direct&lt;/P&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;

&lt;PRE class="brush:cpp;"&gt;#ifdef MKL_DIRECT_CALL
#undef zgemm3m
		if (MKL_DC_GEMM3M_CHECKSIZE(&amp;amp;m, &amp;amp;n, &amp;amp;k)) {
			mkl_dc_zgemm((char *)&amp;amp;transa, (char *)&amp;amp;transb, (int *)&amp;amp;m, (int *)&amp;amp;n, (int *)&amp;amp;k, (MKL_Complex16 *)&amp;amp;alpha, (MKL_Complex16 *)a, (int *)&amp;amp;lda, (MKL_Complex16 *)b, (int *)&amp;amp;ldb, (MKL_Complex16 *)&amp;amp;beta, (MKL_Complex16 *)c, (int *)&amp;amp;ldc);
		}
		else {
			zgemm3m((char *)&amp;amp;transa, (char *)&amp;amp;transb, (int *)&amp;amp;m, (int *)&amp;amp;n, (int *)&amp;amp;k, (MKL_Complex16 *)&amp;amp;alpha, (MKL_Complex16 *)a, (int *)&amp;amp;lda, (MKL_Complex16 *)b, (int *)&amp;amp;ldb, (MKL_Complex16 *)&amp;amp;beta, (MKL_Complex16 *)c, (int *)&amp;amp;ldc);
		}
#else
		zgemm3m((char *)&amp;amp;transa, (char *)&amp;amp;transb, (int *)&amp;amp;m, (int *)&amp;amp;n, (int *)&amp;amp;k, (MKL_Complex16 *)&amp;amp;alpha, (MKL_Complex16 *)a, (int *)&amp;amp;lda, (MKL_Complex16 *)b, (int *)&amp;amp;ldb, (MKL_Complex16 *)&amp;amp;beta, (MKL_Complex16 *)c, (int *)&amp;amp;ldc);
#endif&lt;/PRE&gt;

&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 19 Oct 2017 17:54:40 GMT</pubDate>
      <guid>https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/zgemm3m-using-1-thread-MKL-2017-and-2018/m-p/1132577#M25750</guid>
      <dc:creator>AndrewC</dc:creator>
      <dc:date>2017-10-19T17:54:40Z</dc:date>
    </item>
  </channel>
</rss>

