I am testing the performance of some code that calls MKL cblas cgemm() from within a parallel TBB section. I am using the MKL 2017 update 1, linking with the MKL_intel_thread.dll on a windows machine. My machine has 4 physical cores (8 logical threads).
I have about 10k matrices to multiply using this program. The matrices are of size ~ 500x500. tbb::parallel_for is used to parallel the work load with each thread taking a chunk of the matrices and do the calculation using MKL cgemm().
To avoid oversubscription, I call mkl_set_num_thread( 1 ).
Here are the time data I collected when I use 1, 2, 4 and 8 threads:
1 thread: 2350
2 threads: 1222
4 threads: 781
8 threads: 720
I was hoping to see a close-to-linear speed-up up to at least 4 threads, since I have only 4 physical cores. However, as you can see, the speed up at 4 threads is quite poor, only about 3x, and the speed-up at 8 threads is even worse (but I suppose that could be attributed to the super threading.. not sure if I am correct though)
So my question is, is the 3x speed-up at 4 threads normal? Did I do something wrong? I can understand that the speed-up would be capped/saturated when the number of cores keep increasing, but 4 seems to be way too early.
I tried some other matrix dimensions, but got largely the same data pattern, or sometimes even worse (2.5x speed-up) at 4 threads, depending on the matrix size.
Can anybody please shed some light on this? Thanks!
I recalled there are some cgemm performance problem had been discussed somewhere. like https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/541850. ;
But seems the questions are related to tbb external multi-threading and mkl computation task mkl_set_num_thread( 1 ). You mentioned MKL_intel_thread.dll, which MKL will use OpenMP thread internally by default. How about the performance if you switch off the tbb multithread and switch on MKL internal threads by default , also use latest MKL 2018 update 1 etc.
And please submit your question to Online Service Center with your test case. We may look in detail about the problem.
Thank you so much for your reply.
Yes, the problem is related to using TBB on the external handling multi-threading and using single thread MKL on the internal doing gemm computation. Yes, I used MKL_intel_thread.dll. I tried using MKL_tbb_thread.dll too, but the performance is largely the same. The speed up at 4 and 8 threads are poor.
Per your suggestion, I experimented with switching off TBB multi-thread and switching on MKL internal threads. I got somewhat even worse scalability. The speed up at 4 and 8 threads is about 2.8x (yes, there is no additional speed up from 4 threads to 8 threads).
I found an article you posted "Tips to Measure the Performance of Matrix Multiplication Using Intel® MKL" (https://software.intel.com/en-us/articles/a-simple-example-to-measure-the-performance-of-an-intel-mk.... I will try the tips you provided.
I do have one question hopefully can be confirmed by you. MKL optimized based on physical cores, not logical threads. So on a machine of 4 physical cores (8 logical threads), if I do the following two tests:
1) using MKL only with internal multi-thread, setting MKL threads to 1, 2, 4 and 8 respectively,
2) using TBB multi-thread on the external and single thread MKL on the internal, setting TBB threads to 1, 2 4 and 8 respectively
Is it normal to see no or little performance improvement from 4 threads to 8 threads?
Yes, it is normal to see no or little performance from 4 threads to 8 threads. there is the explanation in mkl user guide: https://software.intel.com/en-us/mkl-linux-developer-guide-using-intel-hyper-threading-technology.
and it is normal to see the 3x speed-up at 4 threads normal with the matrix size (500 x500, medium size), unless the computation is exact ideal balance of memory bound and computation. I may recommend to try some gemm variants, like cgemm_batch etc.
By the way, I noticed you are using MKL 2017 update 1, you may have known the latest is MKL 2018 update 1, so if possible, please upgrade.
Thanks for your reply.
Thanks for confirming that using hyper-threading with MKL would not help with the performance.
As to your comment on "it is normal to see the 3x speed-up at 4 threads .... unless the computation is exact ideal balance of memory bound and computation", do you mean this is due to the 'Data Alignment' issue, that I should try align test arrays on 64-byte boundaries (use mkl_malloc) ?
Thanks for pointing out to me that for small to medium matrices, batch cegmm might be more effective. I researched on batch gemm, and found the performance benchmark published by Intel (https://software.intel.com/en-us/mkl/features/benchmarks). I will try the batch gemm later.
However, I tried on matrices of 1000x1000 and 2000x2000, using cgemm, thinking that would be in the sweet spot of cegmm, but didn't see any obvious improvement on performance at 4 threads either. Maybe this is again due to the 'data alignment' issue... ?
Thanks for suggesting me to get MKL 2018 update 1. Do you think the upgrade will help us get better performance in using TBB multi-threading in conjunction with MKL cgemm? Thanks.
Glad to know you did more exploration about cgemm. As new MKL 2018 u1 have new performance optimization and bug fix https://software.intel.com/en-us/articles/intel-math-kernel-library-intel-mkl-2018-bug-fixes-list, so we recommend use the latest one. It may not directly improve one special function's performance.
Regarding the performance at 4 threads or measure the scalability on muti-cores, it may related to many factors, like your problem size, the parallel algorithm, memory and computation task, and even some times the overhead required for synchronization between threads may outweigh the computational parallelism. The important factor is the balance of memory IO and assigned computation task on one thread. As you knows, the modern processor has shared memory structure, and multiply threads share the same memory. and the computation speed is far faster than memory IO speed. So usually less than the linear scale is expected.
P.S tbb scalability:
Yes, we suspect that memory bandwidth might be one of the issues too. We will look into it further. Also, we will find machines with more cores to see how the performance scales.
Thank you for all the helpful comments!