- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have a problem regarding mkl threads and we really appreciate your valuable help. we are using mkl function calls in the nested parallel region below:
omp_set_num_threads( NUM_OF_THREADS ); omp_set_nested(1); omp_set_max_active_levels(2); #pragma omp parallel num_threads(2) { if (omp_get_thread_num() == 0){ mkl_set_num_threads_local(16); printf("My ID is %d\n", omp_get_thread_num()); cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, p, 1, pA, p, pB, n, 0, pC1, n); }else{ mkl_set_num_threads_local(16); printf("My ID is %d\n", omp_get_thread_num()); cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, m, n, p, 1, pD, p, pE, n, 0, pC2, n); } }
Using VTune Amplifier, we can verify that the correct number of 32 threads are produced. However, the output of the print statements is as follows:
My ID is 0 My ID is 1
It seems like we cannot access "mkl" threads using "omp_get_thread_num()". Is there any similar function for accessing thread IDs of mkl threads? Or is there a way to do that? (We need such information for affinity and thread placement decisions).
Thank you very much,
Sanaz
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sanaz,
As i understand the MD is 0 and MD is 1 are from #pragma omp parallel num_threads(2) and printf(
"My ID is %d\n"
, omp_get_thread_num()); reflect that.
But it should be ok to
spawn 2 external OPENMP thread and each of them spawn 16 MKL thread to implement MKL function. for example,
ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions).
You may refer to MKL user guide, which have some discussion about this or
the article https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application
and some discussion in the forum like : https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/296195
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much Ying,
The resources were very useful for setting the affinity of MKL threads. However, before trying to do the binding, we want to know which mkl threads execute each of the cblas_dgemm() functions. For example, using KMP_AFFINITY=verbose environment variable, we can observe that for example thread # 5 is bound to proc set{15}. But that doesn't give us much insight because we don't know what exactly this thread #5 is doing ( which of the cblas_dgemm() functions this thread is executing ). We will really appreciate your help regarding that.
Best Regards,
Sanaz
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Sanaz,
Right, you can't know what exactly thread is doing which of cblas_dgemm() function. Or you can't control every single mkl internal threads in openMP nested environment. But let's come back the original problem, you expected 2 task and each task execute on half of your physical cpu cores, so get best performance.
As the paper mentioned, you actually don't need to dive into every single mkl internal threads. the Linux os and KMP_AFFINITY can do that that for you.
No sure if you already did that by environment , your code seems miss one key code : mkl_set_dynamic(0);
after add that, you may see expected performance and CPU usage.
NOTE
If your application uses OpenMP* threading, you may need to provide additional settings:
• Set the environment variable OMP_NESTED=TRUE, or alternatively call omp_set_nested(1), to
enable OpenMP nested parallelism.
• Set the environment variable MKL_DYNAMIC=FALSE, or alternatively call mkl_set_dynamic(0), to
prevent Intel MKL from dynamically reducing the number of OpenMP threads in nested parallel
regions.
I attached one for your reference.
Best Regards,
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Attach the file
omp_set_nested(1);
omp_set_max_active_levels(2);
mkl_set_dynamic(0);
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
mkl_set_num_threads(32);
printf("My ID is %d \n", omp_get_thread_num());
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, A, p, B, n, 0, C1, n);
}else{
Thanks
Ying
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thank you very much Ying.
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page