Solved: TBB and MKL does it worth the pain?

fbaralli · ‎11-04-2010

Hi,

I'm currently rewriting/re-designing a large scientific application and I'm currently using MKL. However parts of the processing chain can easily be parallelized, I'm thinking at TBB however I know that MKL and TBB do not co-exist very well while OpenMPeven though less efficient is probably more user friendly.

Do you have some experience with similar cases?

Thanks

F.

Chao_Y_Intel · ‎11-04-2010

Hello,

>To summarize are you aware of any benchmark of sequential VS parallel VML and multi dimensional FFT?

This paper includes the performance on MKL DFT functions:
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf

For VML performance, The actual performance depends on a number of factors.You may check MKL manual, VML part document:
The actual performance depends on a number of factors, including vectorization and threading overhead. The recommended usage tips are as follows:
>On vector lengths less than 10-50, use math functions provided by Intel compilers rather than the VML functions
>On vector lengths between 10-50 to 1000-5000, use sequential VML
>On vector lengths larger than 1000-5000, use threaded VML.

Thanks,
Chao

View solution in original post

TimP · ‎11-04-2010

TBB seems a reasonable choice if you plan to do all your threading under C++. It covers functional along with distributed data threading in a way which some may find more versatile than OpenMP, particularly if you plan to combine with cilk+ or ArBB (formerly called Ct). It's difficult to predict what direction things will go under C++, including whether MKL would switch eventually from OpenMP to TBB compatible threading. It's still a question of each software vendor hoping to persuade people to go their way.
In the accelerator device field (CUDA, MIC, ....) more people are betting against OpenMP than with it, but there are working OpenMP and OpenMP-like implementations.
You haven't said enough about similar to what for anyone to risk judgment. OpenMP does an excellent job on many scientific applications, and is likely to continue getting support across a wider variety of platforms.

fbaralli · ‎11-04-2010

Maybe my concerns are due to a misunderstanding about the way MKL is internally parallelized:

if my application is parallelized with something different than OpenMP (i.e. TBB) I have to link it with the sequential version of MKL libraries meaning that I cannot take advantage of the internal parallelization within most of the MKL functions. On the other hand if I use OpenMP for my application I have less control and efficiency over its parallelism but I can use the multi-threaded MKL.

Is this correct?

F.

TimP · ‎11-04-2010

Up to now, and for the near future, MKL uses the same OpenMP support library as the corresponding Intel compilers do for OpenMP and auto-parallel. The C++ specific threading models have a different, incompatible, library, so, as you say, you would require mkl_sequential when using them. When you speak of less control, maybe you mean the auto-parallel option, which may be supported with more pragma assists in the next version.

fbaralli · ‎11-04-2010

tim,

thank you for your prompt reply.

To summarize are you aware of any benchmark of sequential VS parallel VML and multi dimensional FFT?

Back to "less control" I mostly refer to the concurrent containers (i.e. hash-maps) and pipelines that are available in TBB and not in OpenMP

Chao_Y_Intel · ‎11-04-2010

Hello,

>To summarize are you aware of any benchmark of sequential VS parallel VML and multi dimensional FFT?

This paper includes the performance on MKL DFT functions:
http://software.intel.com/sites/products/collateral/hpc/mkl/mkl_indepth.pdf

For VML performance, The actual performance depends on a number of factors.You may check MKL manual, VML part document:
The actual performance depends on a number of factors, including vectorization and threading overhead. The recommended usage tips are as follows:
>On vector lengths less than 10-50, use math functions provided by Intel compilers rather than the VML functions
>On vector lengths between 10-50 to 1000-5000, use sequential VML
>On vector lengths larger than 1000-5000, use threaded VML.

Thanks,
Chao