Intel® oneAPI Math Kernel Library
Ask questions and share information with other developers who use Intel® Math Kernel Library.

threaded mkl: tbb vs openmp

rnickb
Beginner
1,783 Views

Why is the performance for openmp so crummy in the comparison here? https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application

I would have expected it to be about the same as tbb. In two of the cases it's even slower than the single-threaded version.

0 Kudos
1 Solution
Alexander_K_Intel3
1,783 Views

This originates from difference in paradigms:

  • OpenMP (and algorithms developed with it) initially designed for a non-concurrent environment, where it suppose accessing all machine resources for a single parallel region, and there it is capable to show top performance. But, if a concurrency like nested threading happens, such nested calls know almost nothing about each other. Without pretentious special care, this result to an oversubscription (context switching, cache trashing), which leads to a drop in efficiency of each call and overall performance as well. For OpenMP it is more efficient calling these cases just one after another, or carefully partition the machine with affinity settings and adjust number of threads depending on expected timing for a call.
  • 10 parallel calls to sequential MKL do not suffer from oversubscription, but just run with the time of the largest call. Threads completed calls earlier wait on the final barrier doing nothing.
  • TBB designed to handle concurrent environment, and perform dynamic rebalancing with task stealing. It well suits a use case where it is hard to predict size of a call and timing when such call will occur. For the case, it avoids oversubscription and capable to rebalance workload, so that physical CPUs completed small tasks, could help with finishing large tasks. All this leads to superior performance for the concurrent case.

Best regards,
Alexander

View solution in original post

0 Kudos
3 Replies
TimP
Honored Contributor III
1,783 Views

Mkl openmp parallel isn't trivial to use effectively with nested parallelism and it looks like no effort was made to deal with it. Documented methods include mpi thread funneled.  Tbb with appropriate affinity might well be an alternative for such problems, but they didn't explain details.  

0 Kudos
Alexander_K_Intel3
1,784 Views

This originates from difference in paradigms:

  • OpenMP (and algorithms developed with it) initially designed for a non-concurrent environment, where it suppose accessing all machine resources for a single parallel region, and there it is capable to show top performance. But, if a concurrency like nested threading happens, such nested calls know almost nothing about each other. Without pretentious special care, this result to an oversubscription (context switching, cache trashing), which leads to a drop in efficiency of each call and overall performance as well. For OpenMP it is more efficient calling these cases just one after another, or carefully partition the machine with affinity settings and adjust number of threads depending on expected timing for a call.
  • 10 parallel calls to sequential MKL do not suffer from oversubscription, but just run with the time of the largest call. Threads completed calls earlier wait on the final barrier doing nothing.
  • TBB designed to handle concurrent environment, and perform dynamic rebalancing with task stealing. It well suits a use case where it is hard to predict size of a call and timing when such call will occur. For the case, it avoids oversubscription and capable to rebalance workload, so that physical CPUs completed small tasks, could help with finishing large tasks. All this leads to superior performance for the concurrent case.

Best regards,
Alexander

0 Kudos
rnickb
Beginner
1,783 Views

Is there any data for how TBB MKL compares to OpenMP MKL in a non-concurrent environment? Are they about the same in that case?

If so, is there any reason to use OpenMP MKL (if TBB MKL is always as good or better)?

0 Kudos
Reply