Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
Announcements
This community is designed for sharing of public information. Please do not share Intel or third-party confidential information here.

Performance issue with MKL + TBB

Xiaohui_Z_Intel
Employee
230 Views

Hi

We are working on RNN kernel optimization and we are trying to parallel 2 SGEMM on 2 socket SKX6148 server( 20 core per socket).

The SGEMM size is M = 20, N = 2400, K = 800.

We measured the GFLOPS with this benchmark(https://github.com/xhzhao/GemmEfficiency/tree/tbb), and got the following performance:

  • OMP 1 x 40                2261 GFLOPS
  • Pthread 2 * 20            3550 GFLOPS
  • OMP Nested 2 x 20   1068 GFLOPS
  • TBB Nested 2 x 20     752 GFLOPS

I found that the TBB performance is not as good as we expect, and i'm not sure if i miss something with TBB.

Line to launch TBB parallel_for: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159 ;

 

0 Kudos
2 Replies
Alexei_K_Intel
Employee
230 Views

Hi,

I do not know exactly what is going wrong; however, I would recommend to consider the following ideas:

  1. Are you sure that MKL uses TBB? If it uses OMP, it is a mix of two parallel runtimes. Usually it is not a good idea. Consider https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application
  2. I believe "mkl_set_num_threads_local(20)"  is not applicable if MKL uses TBB. Consider: https://software.intel.com/en-us/mkl-developer-reference-c-2019-beta-threading-control. Usually, you do not need to control the number of threads with TBB.

Regards,
Alex

 

Xiaohui_Z_Intel
Employee
230 Views

hi alex

thanks for your reply.

i double checked my code with your 2 consideration:

1. dynamic link between MKL and TBB

i use the mkl link advisor and  this is my cmake setting for the library link:https://github.com/xhzhao/GemmEfficiency/blob/tbb/CMakeLists.txt#L21

The real linkage is list as follows:

[zhaoxiao@mlt-skx084 GemmEfficiency]$ ldd build/test_tbb
        linux-vdso.so.1 =>  (0x00007ffef22ce000)
        libmkl_tbb_thread.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_tbb_thread.so (0x00007f978c91e000)
        libmkl_core.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007f978a78b000)
        libmkl_intel_ilp64.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.so (0x00007f9789d5c000)
        libtbb.so.2 => /opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64/gcc4.7/libtbb.so.2 (0x00007f9789b00000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f97898d2000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00007f97895d0000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f97893cb000)
        libstdc++.so.6 => /home/zhaoxiao/anaconda3-cpu/lib/libstdc++.so.6 (0x00007f9789091000)
        libgcc_s.so.1 => /home/zhaoxiao/anaconda3-cpu/lib/libgcc_s.so.1 (0x00007f9788e7f000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f9788abb000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00007f97888b3000)
        /lib64/ld-linux-x86-64.so.2 (0x0000561caa285000)

2 Threading control for MKL

you are right on this point, and i tried to replace "mkl_set_num_threads_local(20)" with "tbb::task_scheduler_init init(20);", but the performance is the same as before. i don't know if this the right way to set the MKL thread number in a nested for loop.

BTW, i could not find any code example about MKL+TBB to reproduce the performance on this link:(https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application)

Do you know any code example about MKL+TBB?

Reply