topic Hi Gennady, in Intel® oneAPI Math Kernel Library

zgemm3m using 1 thread ( MKL 2017 and 2018)

AndrewC — Wed, 27 Sep 2017 16:21:11 GMT

I am seeing some performance regression with MKL2017/2018 with zgemm3m

zgemm3m , in some cases , appears to be only using 1 thread (with a negative impact on elapsed time) despite the matrix being 'large'

This behaviour appeared in MKL 2017 and MKL 2018 but is not in MKL 2015

The call to zgemm3m takes two 4122x4122 double complex matrices. Windows 7 4 Core Xeon machine with HT.

transa=transb='N', m=n=k=4122. lda=4122,ldb=4122,alpha=1,beta=0,ldc=4122

We are essentially looping and calling zgemm3m with the same dimensions and matrix structure each time through the loop.

The loop is not OpenMP parallelized. Running in the "main" thread.

First time through the loop, zgemm3m uses all cores

Second time through the loop zgemm3m uses only one core ( and runs MUCH slower that the first call ).

It's very obvious in the debugger that zgemm3m is not using multiple threads the second time it is called. I tried to 'force' the correct # of threads before the call, with no change in behaviour.

		int numThreads = MKL_Get_Max_Threads();
		cout << "MKL Threads " << numThreads << endl;
		MKL_Set_Num_Threads(numThreads);
		int numOMPThreads = omp_get_max_threads();
		cout << "OMP Threads " << numOMPThreads << endl;
		omp_set_num_threads(numOMPThreads);
		mkl_set_dynamic(false);
                zgemm3m(....)

The output of above code trying to force the expected behaviour is always

MKL Threads 4
OMP Threads 8

What would cause zgemm3m to "turn off" threading?

Andrew

Interesting , if I switch to

AndrewC — Wed, 27 Sep 2017 16:46:58 GMT

Interesting , if I switch to zgemm the observed problem goes away. Also note I do have MKL_DIRECT=1 set

Andrew, we didn't chance the

Gennady_F_Intel — Wed, 27 Sep 2017 18:18:00 GMT

Andrew,

We did not change the behavior of this routine from threading point of view. We need to check the problem on

our side. Is that 64 bit code?

--Gennady

Hi Gennady,

AndrewC — Wed, 27 Sep 2017 18:54:07 GMT

Hi Gennady,

Its 64-bit.

The issue is 100% reproducible when running our regression testing, but I expect, of course, it will likely be very difficult to reproduce in a test example. As I noted, changing the code to use zgemm, causes the issue to go away. Change it back to zgemm3m, and the issue returns.

If I break when the problematic code is running I see only one active thread , stopped in mkl_avx.dll. The other omp threads are present but sleeping. I don't see any other problems like this when calling other BLAS/LAPACK functions.

Andrew

Ok, Thanks Andrew.

Gennady_F_Intel — Thu, 28 Sep 2017 07:30:35 GMT

Ok, Thanks Andrew.

1. I am not sure I understand reason /DMKL_DIRECT_CALL option for such problem sizes. May you try don't use this option and then set MKL_VERBOSE to check how many threads would e used by zgemm3m?

2. you said -- MKL 2015. it seem you mentioned MKL v 11.3. Could you please have a look at the mkl_version.h file and let me know the exact version from there?

regards, Gennady

We use MKL_DIRECT=1 in our

AndrewC — Thu, 28 Sep 2017 14:37:09 GMT

We use MKL_DIRECT=1 in our code because problem sizes vary from 4x4 matrices to 11,000x11,000 matrices. When I say MKL 2015, I mean the version shipped with Intel Parallel Studio 2015.

Gennady

AndrewC — Thu, 28 Sep 2017 16:20:45 GMT

Gennady

Here are some interesting results

With MKL_DIRECT=1, MKL_VERBOSE=1 MKL_DIRECT_CALL_SEQ is not defined.

I do not see any output from calls to 'zgemm3m' ( though I do see output from some other MKL routines). I am assuming this means zgemm3m_direct does not print anything with MKL_VERBOSE=1

With MKL_DIRECT undefined, MKL_VERBOSE=1

The issue I was seeing goes away ( all threads are used in all calls to zgemm3m)
Sample output below

MKL_VERBOSE ZGEMM3M(N,N,4122,4122,4122,000000000012E150,00000000A6740040,4122,00
00000160700040,4122,000000000012E1B8,0000000140060040,4122) 5.08s CNR:OFF Dyn:1
FastMM:1 TID:0  NThr:4 WDiv:HOST:+0.000

So my conclusion would be that 'something' in zgemm3m_direct that turns off threading even for large matrices - but not always?

Obviously the workaround is to turn off MKL_DIRECT, this is acceptable for some small loss of performance for some cases.

You are right, MKL_VERBOSE

Murat_G_Intel — Thu, 28 Sep 2017 18:29:10 GMT

You are right, MKL_VERBOSE does not work when MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ is defined.

MKL_DIRECT_CALL_SEQ tells MKL to run sequentially. If you have large matrices where threading can help, then we need to define MKL_DIRECT_CALL only. If we also define MKL_DIRECT_CALL_SEQ, then MKL will run all GEMMs in single thread.

Looking at the dll file above, you saw this on 4-core Windows AVX system, is this correct?

Processor Intel(R) Xeon(R)

AndrewC — Thu, 28 Sep 2017 19:00:16 GMT

Processor Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz, 3701 Mhz, 4 Core(s), 8 Logical Processor(s)

Just to be clear

I do not #define MKL_DIRECT_CALL_SEQ
And the issue is that the threading behaviour of zgemm3m(_direct) seems to change during the execution of a running program when passed the same matrix structure (square 4122x4122). Looking at mkl_direct_call.h I understand the mkl_direct_call_flag is passed as either 1,0 to indicate sequential or parallel operation, but I can't see any issue there as it's a local variable.

I see, you observe this issue

Murat_G_Intel — Thu, 28 Sep 2017 19:46:15 GMT

I see, you observe this issue when you define MKL_DIRECT_CALL only? Yes, this is a local variable and its value only depends on whether MKL_DIRECT_CALL_SEQ or MKL_DIRECT_CALL is defined. The value should remain the same for each call.

You only observe 1-thread execution when MKL_DIRECT_CALL is defined. If you undefine it, everything works as expected, is this correct? And, zgemm doesn't suffer from the same problem, right?

Quote:Murat Efe Guney (Intel)

AndrewC — Thu, 28 Sep 2017 21:19:37 GMT

Murat Efe Guney (Intel) wrote:

I see, you observe this issue when you define MKL_DIRECT_CALL only? Yes, this is a local variable and its value only depends on whether MKL_DIRECT_CALL_SEQ or MKL_DIRECT_CALL is defined. The value should remain the same for each call.

You only observe 1-thread execution when MKL_DIRECT_CALL is defined. If you undefine it, everything works as expected, is this correct? And, zgemm doesn't suffer from the same problem, right?

Correct. There are two separate workarounds

- #undefine MKL_DIRECT

- Replace zgemm3m by zgemm

Andrew,

Gennady_F_Intel — Fri, 29 Sep 2017 07:42:03 GMT

Andrew,

Do you have enough free RAM available on the system when execute this case? We asking because of MKL allocated different memory pool depends of #of threads. For example specifically with your case, zgemm3m, 4122x4122,

MKL 2018 allocates (this is easy to check by using mkl_mem_stat() routine:

1 thr: 883.356850 MB or 926266792 bytes in 7 buffers

2 thr: 894.398720 MB or 937845032 bytes in 11 buffers

4 thr: 916.482460 MB or 961001512 bytes in 19 buffers

8 thr: 960.649940 MB or 1007314472 bytes in 35 buffers

Many gigabytes of free RAM

AndrewC — Fri, 29 Sep 2017 22:23:06 GMT

Many gigabytes of free RAM

The only way I found around

AndrewC — Thu, 19 Oct 2017 17:54:40 GMT

The only way I found around this issue was to change my code that calls zgemm3m by "expanding" the zgemm3m macro myself and making sure that the 'real' zgemm3m is called , not zgemm3m_direct

#ifdef MKL_DIRECT_CALL
#undef zgemm3m
		if (MKL_DC_GEMM3M_CHECKSIZE(&m, &n, &k)) {
			mkl_dc_zgemm((char *)&transa, (char *)&transb, (int *)&m, (int *)&n, (int *)&k, (MKL_Complex16 *)&alpha, (MKL_Complex16 *)a, (int *)&lda, (MKL_Complex16 *)b, (int *)&ldb, (MKL_Complex16 *)&beta, (MKL_Complex16 *)c, (int *)&ldc);
		}
		else {
			zgemm3m((char *)&transa, (char *)&transb, (int *)&m, (int *)&n, (int *)&k, (MKL_Complex16 *)&alpha, (MKL_Complex16 *)a, (int *)&lda, (MKL_Complex16 *)b, (int *)&ldb, (MKL_Complex16 *)&beta, (MKL_Complex16 *)c, (int *)&ldc);
		}
#else
		zgemm3m((char *)&transa, (char *)&transb, (int *)&m, (int *)&n, (int *)&k, (MKL_Complex16 *)&alpha, (MKL_Complex16 *)a, (int *)&lda, (MKL_Complex16 *)b, (int *)&ldb, (MKL_Complex16 *)&beta, (MKL_Complex16 *)c, (int *)&ldc);
#endif