- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I am seeing some performance regression with MKL2017/2018 with zgemm3m
zgemm3m , in some cases , appears to be only using 1 thread (with a negative impact on elapsed time) despite the matrix being 'large'
This behaviour appeared in MKL 2017 and MKL 2018 but is not in MKL 2015
The call to zgemm3m takes two 4122x4122 double complex matrices. Windows 7 4 Core Xeon machine with HT.
transa=transb='N', m=n=k=4122. lda=4122,ldb=4122,alpha=1,beta=0,ldc=4122
We are essentially looping and calling zgemm3m with the same dimensions and matrix structure each time through the loop.
The loop is not OpenMP parallelized. Running in the "main" thread.
First time through the loop, zgemm3m uses all cores
Second time through the loop zgemm3m uses only one core ( and runs MUCH slower that the first call ).
It's very obvious in the debugger that zgemm3m is not using multiple threads the second time it is called. I tried to 'force' the correct # of threads before the call, with no change in behaviour.
int numThreads = MKL_Get_Max_Threads(); cout << "MKL Threads " << numThreads << endl; MKL_Set_Num_Threads(numThreads); int numOMPThreads = omp_get_max_threads(); cout << "OMP Threads " << numOMPThreads << endl; omp_set_num_threads(numOMPThreads); mkl_set_dynamic(false); zgemm3m(....)
The output of above code trying to force the expected behaviour is always
MKL Threads 4
OMP Threads 8
What would cause zgemm3m to "turn off" threading?
Andrew
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Interesting , if I switch to zgemm the observed problem goes away. Also note I do have MKL_DIRECT=1 set
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew,
We did not change the behavior of this routine from threading point of view. We need to check the problem on
our side. Is that 64 bit code?
--Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Gennady,
Its 64-bit.
The issue is 100% reproducible when running our regression testing, but I expect, of course, it will likely be very difficult to reproduce in a test example. As I noted, changing the code to use zgemm, causes the issue to go away. Change it back to zgemm3m, and the issue returns.
If I break when the problematic code is running I see only one active thread , stopped in mkl_avx.dll. The other omp threads are present but sleeping. I don't see any other problems like this when calling other BLAS/LAPACK functions.
Andrew
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, Thanks Andrew.
1. I am not sure I understand reason /DMKL_DIRECT_CALL option for such problem sizes. May you try don't use this option and then set MKL_VERBOSE to check how many threads would e used by zgemm3m?
2. you said -- MKL 2015. it seem you mentioned MKL v 11.3. Could you please have a look at the mkl_version.h file and let me know the exact version from there?
regards, Gennady
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We use MKL_DIRECT=1 in our code because problem sizes vary from 4x4 matrices to 11,000x11,000 matrices. When I say MKL 2015, I mean the version shipped with Intel Parallel Studio 2015.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Gennady
Here are some interesting results
With MKL_DIRECT=1, MKL_VERBOSE=1 MKL_DIRECT_CALL_SEQ is not defined.
I do not see any output from calls to 'zgemm3m' ( though I do see output from some other MKL routines). I am assuming this means zgemm3m_direct does not print anything with MKL_VERBOSE=1
With MKL_DIRECT undefined, MKL_VERBOSE=1
- The issue I was seeing goes away ( all threads are used in all calls to zgemm3m)
- Sample output below
MKL_VERBOSE ZGEMM3M(N,N,4122,4122,4122,000000000012E150,00000000A6740040,4122,00 00000160700040,4122,000000000012E1B8,0000000140060040,4122) 5.08s CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:4 WDiv:HOST:+0.000
So my conclusion would be that 'something' in zgemm3m_direct that turns off threading even for large matrices - but not always?
Obviously the workaround is to turn off MKL_DIRECT, this is acceptable for some small loss of performance for some cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
You are right, MKL_VERBOSE does not work when MKL_DIRECT_CALL or MKL_DIRECT_CALL_SEQ is defined.
MKL_DIRECT_CALL_SEQ tells MKL to run sequentially. If you have large matrices where threading can help, then we need to define MKL_DIRECT_CALL only. If we also define MKL_DIRECT_CALL_SEQ, then MKL will run all GEMMs in single thread.
Looking at the dll file above, you saw this on 4-core Windows AVX system, is this correct?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Processor Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz, 3701 Mhz, 4 Core(s), 8 Logical Processor(s)
Just to be clear
- I do not #define MKL_DIRECT_CALL_SEQ
- And the issue is that the threading behaviour of zgemm3m(_direct) seems to change during the execution of a running program when passed the same matrix structure (square 4122x4122). Looking at mkl_direct_call.h I understand the mkl_direct_call_flag is passed as either 1,0 to indicate sequential or parallel operation, but I can't see any issue there as it's a local variable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I see, you observe this issue when you define MKL_DIRECT_CALL only? Yes, this is a local variable and its value only depends on whether MKL_DIRECT_CALL_SEQ or MKL_DIRECT_CALL is defined. The value should remain the same for each call.
You only observe 1-thread execution when MKL_DIRECT_CALL is defined. If you undefine it, everything works as expected, is this correct? And, zgemm doesn't suffer from the same problem, right?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Murat Efe Guney (Intel) wrote:
I see, you observe this issue when you define MKL_DIRECT_CALL only? Yes, this is a local variable and its value only depends on whether MKL_DIRECT_CALL_SEQ or MKL_DIRECT_CALL is defined. The value should remain the same for each call.
You only observe 1-thread execution when MKL_DIRECT_CALL is defined. If you undefine it, everything works as expected, is this correct? And, zgemm doesn't suffer from the same problem, right?
Correct. There are two separate workarounds
- #undefine MKL_DIRECT
OR
- Replace zgemm3m by zgemm
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Andrew,
Do you have enough free RAM available on the system when execute this case? We asking because of MKL allocated different memory pool depends of #of threads. For example specifically with your case, zgemm3m, 4122x4122,
MKL 2018 allocates (this is easy to check by using mkl_mem_stat() routine:
1 thr: 883.356850 MB or 926266792 bytes in 7 buffers
2 thr: 894.398720 MB or 937845032 bytes in 11 buffers
4 thr: 916.482460 MB or 961001512 bytes in 19 buffers
8 thr: 960.649940 MB or 1007314472 bytes in 35 buffers
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Many gigabytes of free RAM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
The only way I found around this issue was to change my code that calls zgemm3m by "expanding" the zgemm3m macro myself and making sure that the 'real' zgemm3m is called , not zgemm3m_direct
#ifdef MKL_DIRECT_CALL #undef zgemm3m if (MKL_DC_GEMM3M_CHECKSIZE(&m, &n, &k)) { mkl_dc_zgemm((char *)&transa, (char *)&transb, (int *)&m, (int *)&n, (int *)&k, (MKL_Complex16 *)&alpha, (MKL_Complex16 *)a, (int *)&lda, (MKL_Complex16 *)b, (int *)&ldb, (MKL_Complex16 *)&beta, (MKL_Complex16 *)c, (int *)&ldc); } else { zgemm3m((char *)&transa, (char *)&transb, (int *)&m, (int *)&n, (int *)&k, (MKL_Complex16 *)&alpha, (MKL_Complex16 *)a, (int *)&lda, (MKL_Complex16 *)b, (int *)&ldb, (MKL_Complex16 *)&beta, (MKL_Complex16 *)c, (int *)&ldc); } #else zgemm3m((char *)&transa, (char *)&transb, (int *)&m, (int *)&n, (int *)&k, (MKL_Complex16 *)&alpha, (MKL_Complex16 *)a, (int *)&lda, (MKL_Complex16 *)b, (int *)&ldb, (MKL_Complex16 *)&beta, (MKL_Complex16 *)c, (int *)&ldc); #endif
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page