topic Thank you very much Ying. in Intel® oneAPI Math Kernel Library

getting MKL thread IDs

Gheibi__Sanaz — Tue, 08 May 2018 20:22:14 GMT

Hi,

We have a problem regarding mkl threads and we really appreciate your valuable help. we are using mkl function calls in the nested parallel region below:

        omp_set_num_threads( NUM_OF_THREADS );
        omp_set_nested(1);
        omp_set_max_active_levels(2);


	#pragma omp parallel num_threads(2)
        {
                if (omp_get_thread_num() == 0){

                        mkl_set_num_threads_local(16);

                        printf("My ID is %d\n", omp_get_thread_num());
                       	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, 1, pA, p, pB, n, 0, pC1, n);
                }else{
                        mkl_set_num_threads_local(16);

                        printf("My ID is %d\n", omp_get_thread_num());
                       	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                        m, n, p, 1, pD, p, pE, n, 0, pC2, n);

                }
        }

Using VTune Amplifier, we can verify that the correct number of 32 threads are produced. However, the output of the print statements is as follows:

My ID is 0
My ID is 1

It seems like we cannot access "mkl" threads using "omp_get_thread_num()". Is there any similar function for accessing thread IDs of mkl threads? Or is there a way to do that? (We need such information for affinity and thread placement decisions).

Thank you very much,

Sanaz

Hi Sanaz,

Ying_H_Intel — Wed, 09 May 2018 01:58:45 GMT

Hi Sanaz,

As i understand the MD is 0 and MD is 1 are from #pragma omp parallel num_threads(2) and printf("My ID is %d\n", omp_get_thread_num()); reflect that.

But it should be ok to spawn 2 external OPENMP thread and each of them spawn 16 MKL thread to implement MKL function. for example, ensure envvars OMP_DYNAMIC=false and MKL_DYNAMIC=false to allow MKL thread in nested parallel regions).

You may refer to MKL user guide, which have some discussion about this or

the article https://software.intel.com/en-us/articles/using-threaded-intel-mkl-in-multi-thread-application
and some discussion in the forum like : https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/296195

Best Regards,
Ying

Thank you very much Ying,

Gheibi__Sanaz — Wed, 09 May 2018 17:14:03 GMT

Thank you very much Ying,

The resources were very useful for setting the affinity of MKL threads. However, before trying to do the binding, we want to know which mkl threads execute each of the cblas_dgemm() functions. For example, using KMP_AFFINITY=verbose environment variable, we can observe that for example thread # 5 is bound to proc set{15}. But that doesn't give us much insight because we don't know what exactly this thread #5 is doing ( which of the cblas_dgemm() functions this thread is executing ). We will really appreciate your help regarding that.

Best Regards,

Sanaz

Hi Sanaz,

Ying_H_Intel — Tue, 15 May 2018 03:00:00 GMT

Hi Sanaz,

Right, you can't know what exactly thread is doing which of cblas_dgemm() function. Or you can't control every single mkl internal threads in openMP nested environment. But let's come back the original problem, you expected 2 task and each task execute on half of your physical cpu cores, so get best performance.

As the paper mentioned, you actually don't need to dive into every single mkl internal threads. the Linux os and KMP_AFFINITY can do that that for you.

No sure if you already did that by environment , your code seems miss one key code : mkl_set_dynamic(0);

after add that, you may see expected performance and CPU usage.

NOTE
If your application uses OpenMP* threading, you may need to provide additional settings:
• Set the environment variable OMP_NESTED=TRUE, or alternatively call omp_set_nested(1), to
enable OpenMP nested parallelism.
• Set the environment variable MKL_DYNAMIC=FALSE, or alternatively call mkl_set_dynamic(0), to
prevent Intel MKL from dynamically reducing the number of OpenMP threads in nested parallel
regions.
I attached one for your reference.

Best Regards,

Ying

Attach the file

Ying_H_Intel — Tue, 15 May 2018 03:03:46 GMT

Attach the file

omp_set_nested(1);
omp_set_max_active_levels(2);
mkl_set_dynamic(0);
#pragma omp parallel num_threads(2)
{

if (omp_get_thread_num() == 0){

              mkl_set_num_threads(32);
             printf("My ID is %d \n", omp_get_thread_num());
              cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,m, n, p, 1, A, p, B, n, 0, C1, n);

}else{

Thanks

Ying

Thank you very much Ying.

Gheibi__Sanaz — Thu, 07 Jun 2018 22:21:58 GMT

Thank you very much Ying.