Parallel cblass-dgemm functions

Gheibi__Sanaz · ‎12-04-2017

Hi,

We want to run two MKL cblass-dgemm functions in parallel on a KNL platform. We want these two functions to run on two disjoint set of cores. As the total number of threads on our KNL is 64, we would like the first function to run on 32 cores, and the second function to run on another set of 32 cores, disjoint from the first one, and in parallel. Our current code is something like this :

...

omp_set_num_threads( 64 );

....

#pragma omp parallel num_threads(2)

{

if (omp_get_thread_num() == 0){

omp_set_num_threads(32);

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, A, p, B, n, 0, C1, n);

}else{

omp_set_num_threads(32);

cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, A, p, B, n, 0, C2, n);

}

....

The problem is that running those two functions serially takes less time than running them in parallel. Could you please help us figure out what is wrong with this code section and how to have two cblas_dgemm calls run in parallel?

Thank you very much

Ying_H_Intel · ‎12-06-2017

Hi

What is the size of m, n, p and how do you set KMP_AFFINITY for the operation

Could you please set MKL_VERBOSE=1 and KMP_AFFINITY=compact

or expose the MKL_VERBOSE=1 and your.exe and obverse the result?

and Please submit your question to our official support channel: Online Service Center - Intel Support

Best regards,

Ying

TimP · ‎12-06-2017

You should be able to set both number of threads and core affinity by use of kmp_hw_subset to choose non overlapping tile groups. As ying suggested, adding mkl_verbose should help with diagnosis. You probably need to run from a script so the 2 jobs don't see each others hw_subset offset and resulting hw thread assignment. The suggestion about kmp_affinity =compact seems more applicable to knc where you might use 4 threads per core. Failing to affinitize to distinct tile sets might be expected to yield poor performance.

Gheibi__Sanaz · ‎12-08-2017

Thank you very much Ying and Tim for your help.

In our case, m, n, p are all set to 100.We did a test in which we set set MKL_VERBOSE=1 and KMP_AFFINITY=compact, and here is the result we got:

For the parallel version:

MKL_VERBOSE Intel(R) MKL 2018.0 Update 1 Product build 20171007 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processors, Lnx 1.30GHz lp64 intel_thread NMICDev:0

MKL_VERBOSE DGEMM(N,N,100,100,100,0x7f43897fd890,0x1291a00,100,0x127e100,100,0x7f43897fd898,0x12b8b80,100) 81.96ms CNR:OFF Dyn:0 FastMM:1 TID:1 NThr:32 WDiv:HOST:+0.000

time elapsed is 0.174498

For the serial version:

MKL_VERBOSE Intel(R) MKL 2018.0 Update 1 Product build 20171007 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) for Intel(R) Many Integrated Core Architecture (Intel(R) MIC Architecture) enabled processors, Lnx 1.30GHz lp64 intel_thread NMICDev:0

MKL_VERBOSE DGEMM(N,N,100,100,100,0x7ffdb6480090,0x80ca00,100,0x7f9100,100,0x7ffdb6480098,0x8202c0,100) 132.72ms CNR:OFF Dyn:0 FastMM:1 TID:0 NThr:64 WDiv:HOST:+0.000

MKL_VERBOSE DGEMM(N,N,100,100,100,0x7ffdb6480090,0x80ca00,100,0x7f9100,100,0x7ffdb6480098,0x833b80,100) 140.48us CNR:OFF Dyn:0 FastMM:1 TID:0 NThr:64 WDiv:HOST:+0.000

time elapsed is 0.134626

As you see, the elapsed time is larger for the parallel case than for the serial case.

Another confusing issue is that in the parallel case, both the TID values are 1 . This is not what we wanted. As you can see from the code in our original post, we produced two threads using "#pragma omp parallel num_threads(2)" each of which was meant to further divide into 32 threads. What we would expect is to have two different TIDs for the parallel case. We don't know what is going wrong here, and we would really appreciate your help.

Thank you very much

TimP · ‎12-09-2017

If your intent is to use nested omp parallelism, you must activate omp_nested and set omp_num_threads =2,32 . With the diagnostics enable, you may be able to see whether the mkl default affinity is spreading the threads correctly across tiles.

Gheibi__Sanaz · ‎12-10-2017

Thank you very much Tim. It will help us a lot. Thanks again !

Gheibi__Sanaz · ‎12-11-2017

Hi again,

We still have another question, and we will really appreciate your help:

How can we know which threads are executing a certain cblas-dgemm function? If we could know that, we would be able to put those threads close to each other using proc_list with KMP_AFFINITY.

Thank you very much

Ying_H_Intel · ‎12-13-2017

Hi Gheibi,

manually, you would be able to put those threads close to each other using proc_list with KMP_AFFINITY. and get information for which threads are executing a certain cblas-dgemm function. but it may bring all kind of technique discussion. So you may do that to set cblas-dgemm's openmp threads to proc_list by KMP_AFFINITY

MKL threading is based on OpenMP. you can control them as MKL developer guide mentioned: https://software.intel.com/en-us/node/528550

or intel compiler documentation https://software.intel.com/en-us/cpp-compiler-18.0-developer-guide-and-reference-thread-affinity-interface-linux-and-windows#LOW_LEVEL_AFFINITY_API

https://software.intel.com/en-us/node/528546#92D6DAD0-A858-4824-9A90-AC2AD2A9C2E1

and other discussion

https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application

https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/283564

theoretically, we don't recommend that.

about the performance, as you tested, if same sgemm function in multi-thread call, then use MKL internal multi-thread may better than your design thread affinity.

Best Regards,

Ying