Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

Parallel cblass-dgemm functions

Gheibi__Sanaz
Beginner
613 Views
  Hi, 
 We want to run two MKL cblass-dgemm functions in parallel on a KNL platform. We want these two functions to run on two disjoint   set of cores. As the total number of threads on our KNL is 64, we would like the first function to run on 32 cores, and the second   function to run on another set of 32 cores, disjoint from the first one, and in parallel. Our current code is something like this : 
 ...
 omp_set_num_threads( 64 );
 ....
 #pragma omp parallel num_threads(2)
 {
      if (omp_get_thread_num() == 0){
              omp_set_num_threads(32);
              cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, A, p, B, n, 0, C1, n);
      }else{
              omp_set_num_threads(32);
              cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, A, p, B, n, 0, C2, n);
      }
 }
 ....
 The problem is that running those two functions serially takes less time than running them in parallel. Could you please help us   figure out what is wrong with this code section and how to have two cblas_dgemm calls run in parallel?
 
 Thank you very much
 
0 Kudos
7 Replies
Ying_H_Intel
Employee
613 Views

Hi 

What is the size of m, n, p and how do you set KMP_AFFINITY for the operation  

Could you please set MKL_VERBOSE=1  and KMP_AFFINITY=compact

or expose the MKL_VERBOSE=1 and your.exe and obverse the result?

and Please submit your question to our official support channel: Online Service Center - Intel Support

Best regards,

Ying 

0 Kudos
TimP
Honored Contributor III
613 Views
You should be able to set both number of threads and core affinity by use of kmp_hw_subset to choose non overlapping tile groups. As ying suggested, adding mkl_verbose should help with diagnosis. You probably need to run from a script so the 2 jobs don't see each others hw_subset offset and resulting hw thread assignment. The suggestion about kmp_affinity =compact seems more applicable to knc where you might use 4 threads per core. Failing to affinitize to distinct tile sets might be expected to yield poor performance.
0 Kudos
Gheibi__Sanaz
Beginner
613 Views

Thank you very much Ying and Tim for your help. 

In our case, m, n, p are all set to 100.We did a test in which we set set MKL_VERBOSE=1  and KMP_AFFINITY=compact, and here is the result we got: 

 

For the parallel version: 

 

MKL_VERBOSE Intel(R) MKL 2018.0 Update 1 Product build 20171007 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processors, Lnx 1.30GHz lp64 intel_thread NMICDev:0

MKL_VERBOSE DGEMM(N,N,100,100,100,0x7f43897fd890,0x1291a00,100,0x127e100,100,0x7f43897fd898,0x12b8b80,100) 81.96ms CNR:OFF Dyn:0 FastMM:1  TID:1   NThr:32 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,100,100,100,0x7f43897fd890,0x1291a00,100,0x127e100,100,0x7f43897fd898,0x12b8b80,100) 81.96ms CNR:OFF Dyn:0 FastMM:1  TID:1   NThr:32 WDiv:HOST:+0.000
time elapsed is 0.174498
 
For the serial version:
 
MKL_VERBOSE Intel(R) MKL 2018.0 Update 1 Product build 20171007 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processors, Lnx 1.30GHz lp64 intel_thread NMICDev:0
MKL_VERBOSE DGEMM(N,N,100,100,100,0x7ffdb6480090,0x80ca00,100,0x7f9100,100,0x7ffdb6480098,0x8202c0,100) 132.72ms CNR:OFF Dyn:0 FastMM:1  TID:0  NThr:64 WDiv:HOST:+0.000
MKL_VERBOSE DGEMM(N,N,100,100,100,0x7ffdb6480090,0x80ca00,100,0x7f9100,100,0x7ffdb6480098,0x833b80,100) 140.48us CNR:OFF Dyn:0 FastMM:1  TID:0   NThr:64 WDiv:HOST:+0.000
time elapsed is 0.134626
 
As you see, the elapsed time is larger for the parallel case than for the serial case. 
Another confusing issue is that in the parallel case, both the TID values are 1 . This is not what we wanted. As you can see from the code in our original post, we produced two threads using "#pragma omp parallel num_threads(2)" each of which was meant to further divide into 32 threads. What we would expect is to have two different TIDs for the parallel case. We don't know what is going wrong here, and we would really appreciate your help. 
 
Thank you very much
 
0 Kudos
TimP
Honored Contributor III
613 Views
If your intent is to use nested omp parallelism, you must activate omp_nested and set omp_num_threads =2,32 . With the diagnostics enable, you may be able to see whether the mkl default affinity is spreading the threads correctly across tiles.
0 Kudos
Gheibi__Sanaz
Beginner
613 Views

Thank you very much Tim. It will help us a lot. Thanks again !

0 Kudos
Gheibi__Sanaz
Beginner
613 Views

Hi again, 

We still have another question, and we will really appreciate your help: 

How can we know which threads are executing a certain cblas-dgemm function? If we could know that, we would be able to put those threads close to each other using proc_list with KMP_AFFINITY

Thank you very much

0 Kudos
Ying_H_Intel
Employee
613 Views

Hi Gheibi,

manually, you  would be able to put those threads close to each other using proc_list with KMP_AFFINITY.  and get information for which threads are executing a certain cblas-dgemm function. but  it may bring all kind of technique discussion.  So you may do that to  set cblas-dgemm's openmp threads to proc_list by KMP_AFFINITY

MKL threading is based on OpenMP.  you can control them  as MKL developer guide mentioned:  https://software.intel.com/en-us/node/528550

or intel compiler documentation https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference-thread-affinity-interface-linux-and-windows#LOW_LEVEL_AFFINITY_API

https://software.intel.com/en-us/node/528546#92D6DAD0-A858-4824-9A90-AC2AD2A9C2E1

and other discussion

https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application

https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/283564

theoretically, we don't recommend that.

about the performance, as you tested, if same sgemm function in multi-thread call,  then use MKL internal multi-thread may better than your design thread affinity.

Best Regards,

Ying

0 Kudos
Reply