We are working on RNN kernel optimization and we are trying to parallel 2 SGEMM on 2 socket SKX6148 server( 20 core per socket).
The SGEMM size is M = 20， N = 2400， K = 800.
We measured the GFLOPS with this benchmark(https://github.com/xhzhao/GemmEfficiency/tree/tbb), and got the following performance:
- OMP 1 x 40 2261 GFLOPS
- Pthread 2 * 20 3550 GFLOPS
- OMP Nested 2 x 20 1068 GFLOPS
- TBB Nested 2 x 20 752 GFLOPS
I found that the TBB performance is not as good as we expect, and i'm not sure if i miss something with TBB.
Line to launch TBB parallel_for: https://github.com/xhzhao/GemmEfficiency/blob/tbb/test_tbb.cpp#L159 ;
I do not know exactly what is going wrong; however, I would recommend to consider the following ideas:
- Are you sure that MKL uses TBB? If it uses OMP, it is a mix of two parallel runtimes. Usually it is not a good idea. Consider https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application
- I believe "()" is not applicable if MKL uses TBB. Consider: https://software.intel.com/en-us/mkl-developer-reference-c-2019-beta-threading-control. Usually, you do not need to control the number of threads with TBB.
thanks for your reply.
i double checked my code with your 2 consideration:
1. dynamic link between MKL and TBB
i use the mkl link advisor and this is my cmake setting for the library link:https://github.com/xhzhao/GemmEfficiency/blob/tbb/CMakeLists.txt#L21
The real linkage is list as follows:
[zhaoxiao@mlt-skx084 GemmEfficiency]$ ldd build/test_tbb linux-vdso.so.1 => (0x00007ffef22ce000) libmkl_tbb_thread.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_tbb_thread.so (0x00007f978c91e000) libmkl_core.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_core.so (0x00007f978a78b000) libmkl_intel_ilp64.so => /opt/intel/compilers_and_libraries_2018.1.163/linux/mkl/lib/intel64_lin/libmkl_intel_ilp64.so (0x00007f9789d5c000) libtbb.so.2 => /opt/intel/compilers_and_libraries_2018.1.163/linux/tbb/lib/intel64/gcc4.7/libtbb.so.2 (0x00007f9789b00000) libpthread.so.0 => /usr/lib64/libpthread.so.0 (0x00007f97898d2000) libm.so.6 => /usr/lib64/libm.so.6 (0x00007f97895d0000) libdl.so.2 => /usr/lib64/libdl.so.2 (0x00007f97893cb000) libstdc++.so.6 => /home/zhaoxiao/anaconda3-cpu/lib/libstdc++.so.6 (0x00007f9789091000) libgcc_s.so.1 => /home/zhaoxiao/anaconda3-cpu/lib/libgcc_s.so.1 (0x00007f9788e7f000) libc.so.6 => /usr/lib64/libc.so.6 (0x00007f9788abb000) librt.so.1 => /usr/lib64/librt.so.1 (0x00007f97888b3000) /lib64/ld-linux-x86-64.so.2 (0x0000561caa285000)
2 Threading control for MKL
you are right on this point, and i tried to replace "mkl_set_num_threads_local(20)" with "tbb::task_scheduler_init init(20);", but the performance is the same as before. i don't know if this the right way to set the MKL thread number in a nested for loop.
BTW, i could not find any code example about MKL+TBB to reproduce the performance on this link:(https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application)
Do you know any code example about MKL+TBB?