TBB with MKL cblas and lapack

bohan_w_ · ‎11-22-2016

I'm an experienced user of intel mkl and OpenMP. In my application, the parallelism topology is simple, so although I use OpenMP for a long time I haven't used very complex functionalities of OpenMP. On typical case is that there is a parallel_for loop. Within each loop, there are several cblas or lapack function calls. With MKl compiled with OpenMP, I got very good performance, so I didn't pay attention to the TBB too much. However, a benchmark from Intel MKL official website changed my mind (https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application). What it basically says is that MKL compiled with TBB has roughly 2x faster than MKL compiled with OpenMP when multiple lapack functions are called in parallel. However, when I tried I didn't get the same result. What I did is changing everything from OpenMP to TBB. What do I miss or do I understand anything wrong?

Ying_H_Intel · ‎11-27-2016

Hi Bohan,

Thanks for the report. The data seems out of data, we should update them. The performance data are from MKL 11.3. And the lapack function was improvied in later MKL release ( MKL 2017 update 1). We expected that the OpenMP performance are almost same as TBB as same algorithm. What MKL version and result do you got?

Thanks

Ying

bohan_w_ · ‎11-27-2016

Ying H. (Intel) wrote:

Hi Bohan,

Thanks for the report. The data seems out of data, we should update them. The performance data are from MKL 11.3. And the lapack function was improvied in later MKL release ( MKL 2017 update 1). We expected that the OpenMP performance are almost same as TBB as same algorithm. What MKL version and result do you got?

Thanks

Ying

Thanks for the reply! I'm currently using MKL 2017.1.143. It should be the latest one. Honestly, the performance of MKL with TBB drops roughly 30% compared with the performance of MKL with OpenMP. Still, in my application, the parallel structure is fixed, load-balanced and simple, so I think that in such simple case, OpenMP may perform better.

Konstantin_A_Intel · ‎11-27-2016

Hi Bohan,

Thank you for the question. Let me add a bit more information.

First of all, TBB version of MKL hasn't been ever faster comparing to OpenMP in general. The example you're referring to is about slightly different and quite specific scenario.

Please imagine that your application is based on TBB threading (or pthreads), and every thread can call OpenMP-parallelized MKL. MKL creates 1 thread per hardware core by default. Every MKL routine is not aware about existence of a few more MKL calls in parallel from other threads of your application. As a results, too many OpenMP threads will be created (oversubscription). TBB threading in MKL allows to reduce negative impact of oversubscription as TBB dynamic scheduler is aware about every TBB threads created by the application and can manage threads (workers) smarter. So, you can get benefit from TBB in cases like I described. For example, in not-OpenMP application when you can't control how many instances of MKL will be called in total and how many threads to give to every new MKL routine.

In your case, you are just calling MKL from OpenMP region. By default, MKL will run sequentially if it's called under OpenMP parallel region (to avoid oversubscription). If you distributed computations well, performance will be good. And if you still want to experiment and run threaded MKL, you will need to set OMP_NESTED=true and to set OMP_NUM_THREADS for your application and MKL_NUM_THREADS for MKL so that ( OMP_NUM_THREADS x MKL_NUM_THREADS == number of cores )

Final remark. I don't see any reason for you to use TBB as your application is based on OpenMP.

Regards,

Konstantin

bohan_w_ · ‎11-27-2016

Konstantin Arturov (Intel) wrote:

Hi Bohan,

Thank you for the question. Let me add a bit more information.

First of all, TBB version of MKL hasn't been ever faster comparing to OpenMP in general. The example you're referring to is about slightly different and quite specific scenario.

Please imagine that your application is based on TBB threading (or pthreads), and every thread can call OpenMP-parallelized MKL. MKL creates 1 thread per hardware core by default. Every MKL routine is not aware about existence of a few more MKL calls in parallel from other threads of your application. As a results, too many OpenMP threads will be created (oversubscription). TBB threading in MKL allows to reduce negative impact of oversubscription as TBB dynamic scheduler is aware about every TBB threads created by the application and can manage threads (workers) smarter. So, you can get benefit from TBB in cases like I described. For example, in not-OpenMP application when you can't control how many instances of MKL will be called in total and how many threads to give to every new MKL routine.

In your case, you are just calling MKL from OpenMP region. By default, MKL will run sequentially if it's called under OpenMP parallel region (to avoid oversubscription). If you distributed computations well, performance will be good. And if you still want to experiment and run threaded MKL, you will need to set OMP_NESTED=true and to set OMP_NUM_THREADS for your application and MKL_NUM_THREADS for MKL so that ( OMP_NUM_THREADS x MKL_NUM_THREADS == number of cores )

Final remark. I don't see any reason for you to use TBB as your application is based on OpenMP.

Regards,

Konstantin

Thanks for the reply! I understand what you described, but I have more questions. Suppose I have three programs:

1) OpenMP for loop + OpenMP MKL

#pragma omp parallel for
for (...) {
   cblas call
}

2) TBB for loop + OpenMP MKL

tbb::parallel_for(tbb::blocked_range(0, n, g), [&](const tbb::blocked_range &r) {
for (...) {
   cblas call
});

3) TBB for loop + TBB MKL

tbb::parallel_for(tbb::blocked_range(0, n, g), [&](const tbb::blocked_range &r) {
for (...) {
   cblas call
});

If I understand correctly, you were talking about that case 2) may have oversubscription problem. In the other hand, 1) will not have oversubscription problem, because all cblas calls are sequential by default. In my test, I found that both 1) and 2) are much faster than 3). Is that a reasonable result?

Ying_H_Intel · ‎01-22-2017

Hi Bohan,

Sorry for the missing your reply. Could you please tell which compiler and how do you link the mkl (mkl_thread or mkl_sequential) in your test and what is problem size loop and cblas?

As I understand, you mentioned in 1) and 2) all cbals call are sequential, so only the outer-layer loop works. and 3) maybe oversubscription about CPU resource. how many threads running when the 3) running? (you may use system tools or Intel Vtune to get such information).

Best Regards,

Ying

bohan_w_ · ‎01-25-2017

Hi Bohan,

Sorry for the missing your reply. Could you please tell which compiler and how do you link the mkl (mkl_thread or mkl_sequential) in your test and what is problem size loop and cblas?

As I understand, you mentioned in 1) and 2) all cbals call are sequential, so only the outer-layer loop works. and 3) maybe oversubscription about CPU resource. how many threads running when the 3) running? (you may use system tools or Intel Vtune to get such information).

Best Regards,

Ying

Thanks for the reply again! For case 1) and case 2), I linked the MKL library using $(MKL_LIB) (see the following). For case 3), I linked the MKL using $(MKL_TBB_LIB). Please note that the path prefix of static libraries is removed.

MKL_LIB=-Wl,--start-group libmkl_blas95_lp64.a libmkl_lapack95_lp64.a libmkl_intel_lp64.a libmkl_intel_thread.a libmkl_core.a libiomp5.a -Wl,--end-group -lpthread -lm -ldl
MKL_TBB_LIB=-Wl,--start-group libmkl_blas95_lp64.a libmkl_lapack95_lp64.a libmkl_intel_lp64.alibmkl_tbb_thread.a ibmkl_core.a -Wl,--end-group -lpthread -lm -ldl

My OS is Ubuntu 16.04. I use gnu compiler instead of intel compiler. My machine has two processors, both Xeon E5-2690. I did not explicitly set the number of threads, I have 32 omp threads and 32 tbb threads too. What result would you expect in this case?