Community
cancel
Showing results for 
Search instead for 
Did you mean: 
rnickb
Beginner
499 Views

threaded mkl: tbb vs openmp

Jump to solution

Why is the performance for openmp so crummy in the comparison here? https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application

I would have expected it to be about the same as tbb. In two of the cases it's even slower than the single-threaded version.

0 Kudos
1 Solution
499 Views

This originates from difference in paradigms:

  • OpenMP (and algorithms developed with it) initially designed for a non-concurrent environment, where it suppose accessing all machine resources for a single parallel region, and there it is capable to show top performance. But, if a concurrency like nested threading happens, such nested calls know almost nothing about each other. Without pretentious special care, this result to an oversubscription (context switching, cache trashing), which leads to a drop in efficiency of each call and overall performance as well. For OpenMP it is more efficient calling these cases just one after another, or carefully partition the machine with affinity settings and adjust number of threads depending on expected timing for a call.
  • 10 parallel calls to sequential MKL do not suffer from oversubscription, but just run with the time of the largest call. Threads completed calls earlier wait on the final barrier doing nothing.
  • TBB designed to handle concurrent environment, and perform dynamic rebalancing with task stealing. It well suits a use case where it is hard to predict size of a call and timing when such call will occur. For the case, it avoids oversubscription and capable to rebalance workload, so that physical CPUs completed small tasks, could help with finishing large tasks. All this leads to superior performance for the concurrent case.

Best regards,
Alexander

View solution in original post

3 Replies
TimP
Black Belt
499 Views

Mkl openmp parallel isn't trivial to use effectively with nested parallelism and it looks like no effort was made to deal with it. Documented methods include mpi thread funneled.  Tbb with appropriate affinity might well be an alternative for such problems, but they didn't explain details.  

500 Views

This originates from difference in paradigms:

  • OpenMP (and algorithms developed with it) initially designed for a non-concurrent environment, where it suppose accessing all machine resources for a single parallel region, and there it is capable to show top performance. But, if a concurrency like nested threading happens, such nested calls know almost nothing about each other. Without pretentious special care, this result to an oversubscription (context switching, cache trashing), which leads to a drop in efficiency of each call and overall performance as well. For OpenMP it is more efficient calling these cases just one after another, or carefully partition the machine with affinity settings and adjust number of threads depending on expected timing for a call.
  • 10 parallel calls to sequential MKL do not suffer from oversubscription, but just run with the time of the largest call. Threads completed calls earlier wait on the final barrier doing nothing.
  • TBB designed to handle concurrent environment, and perform dynamic rebalancing with task stealing. It well suits a use case where it is hard to predict size of a call and timing when such call will occur. For the case, it avoids oversubscription and capable to rebalance workload, so that physical CPUs completed small tasks, could help with finishing large tasks. All this leads to superior performance for the concurrent case.

Best regards,
Alexander

View solution in original post

rnickb
Beginner
499 Views

Is there any data for how TBB MKL compares to OpenMP MKL in a non-concurrent environment? Are they about the same in that case?

If so, is there any reason to use OpenMP MKL (if TBB MKL is always as good or better)?

Reply