Performance issue with MKL + TBB

Xiaohui_Z_Intel · ‎07-24-2018

Hi

We are working on RNN kernel optimization and we are trying to parallel 2 SGEMM on 2 socket SKX6148 server( 20 core per socket).

The SGEMM size is M = 20， N = 2400， K = 800.

We measured the GFLOPS with this benchmark(https://github.com/xhzhao/GemmEfficiency/tree/tbb), and got the following performance:

OMP 1 x 40 2261 GFLOPS
Pthread 2 * 20 3550 GFLOPS
OMP Nested 2 x 20 1068 GFLOPS
TBB Nested 2 x 20 752 GFLOPS

I found that the TBB performance is not as good as we expect, and i'm not sure if i miss something with TBB.

Line to launch TBB parallel_for: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159 ;

Alexei_K_Intel · ‎07-25-2018

Hi,

I do not know exactly what is going wrong; however, I would recommend to consider the following ideas:

Are you sure that MKL uses TBB? If it uses OMP, it is a mix of two parallel runtimes. Usually it is not a good idea. Consider https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application
I believe "mkl_set_num_threads_local(20)" is not applicable if MKL uses TBB. Consider: https://software.intel.com/en-us/mkl-developer-reference-c-2019-beta-threading-control. Usually, you do not need to control the number of threads with TBB.

Regards,
Alex

Xiaohui_Z_Intel · ‎07-25-2018

hi alex

thanks for your reply.

i double checked my code with your 2 consideration:

1. dynamic link between MKL and TBB

i use the mkl link advisor and this is my cmake setting for the library link:https://github.com/xhzhao/GemmEfficiency/blob/tbb/CMakeLists.txt#L21

The real linkage is list as follows:

[zhaoxiao@mlt-skx084 GemmEfficiency]$ ldd build/test_tbb
        linux-vdso.so.1 =>  (0x00007ffef22ce000)
        libmkl_tbb_thread.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_tbb_thread.so (0x00007f978c91e000)
        libmkl_core.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007f978a78b000)
        libmkl_intel_ilp64.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.so (0x00007f9789d5c000)
        libtbb.so.2 => /opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64/gcc4.7/libtbb.so.2 (0x00007f9789b00000)
        libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f97898d2000)
        libm.so.6 => /usr/lib64/libm.so.6 (0x00007f97895d0000)
        libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f97893cb000)
        libstdc++.so.6 => /home/zhaoxiao/anaconda3-cpu/lib/libstdc++.so.6 (0x00007f9789091000)
        libgcc_s.so.1 => /home/zhaoxiao/anaconda3-cpu/lib/libgcc_s.so.1 (0x00007f9788e7f000)
        libc.so.6 => /usr/lib64/libc.so.6 (0x00007f9788abb000)
        librt.so.1 => /usr/lib64/librt.so.1 (0x00007f97888b3000)
        /lib64/ld-linux-x86-64.so.2 (0x0000561caa285000)

2 Threading control for MKL

you are right on this point, and i tried to replace "mkl_set_num_threads_local(20)" with "tbb::task_scheduler_init init(20);", but the performance is the same as before. i don't know if this the right way to set the MKL thread number in a nested for loop.

BTW, i could not find any code example about MKL+TBB to reproduce the performance on this link:(https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application)

Do you know any code example about MKL+TBB?