Solved: threaded mkl: tbb vs openmp

rnickb · ‎07-04-2015

Why is the performance for openmp so crummy in the comparison here? https://software.intel.com/en-us/articles/using-intel-mkl-and-intel-tbb-in-the-same-application

I would have expected it to be about the same as tbb. In two of the cases it's even slower than the single-threaded version.

Alexander_K_Intel3 · ‎07-06-2015

This originates from difference in paradigms:

OpenMP (and algorithms developed with it) initially designed for a non-concurrent environment, where it suppose accessing all machine resources for a single parallel region, and there it is capable to show top performance. But, if a concurrency like nested threading happens, such nested calls know almost nothing about each other. Without pretentious special care, this result to an oversubscription (context switching, cache trashing), which leads to a drop in efficiency of each call and overall performance as well. For OpenMP it is more efficient calling these cases just one after another, or carefully partition the machine with affinity settings and adjust number of threads depending on expected timing for a call.
10 parallel calls to sequential MKL do not suffer from oversubscription, but just run with the time of the largest call. Threads completed calls earlier wait on the final barrier doing nothing.
TBB designed to handle concurrent environment, and perform dynamic rebalancing with task stealing. It well suits a use case where it is hard to predict size of a call and timing when such call will occur. For the case, it avoids oversubscription and capable to rebalance workload, so that physical CPUs completed small tasks, could help with finishing large tasks. All this leads to superior performance for the concurrent case.

Best regards,
Alexander

View solution in original post

TimP · ‎07-05-2015

Mkl openmp parallel isn't trivial to use effectively with nested parallelism and it looks like no effort was made to deal with it. Documented methods include mpi thread funneled. Tbb with appropriate affinity might well be an alternative for such problems, but they didn't explain details.

Alexander_K_Intel3 · ‎07-06-2015

This originates from difference in paradigms:

OpenMP (and algorithms developed with it) initially designed for a non-concurrent environment, where it suppose accessing all machine resources for a single parallel region, and there it is capable to show top performance. But, if a concurrency like nested threading happens, such nested calls know almost nothing about each other. Without pretentious special care, this result to an oversubscription (context switching, cache trashing), which leads to a drop in efficiency of each call and overall performance as well. For OpenMP it is more efficient calling these cases just one after another, or carefully partition the machine with affinity settings and adjust number of threads depending on expected timing for a call.
10 parallel calls to sequential MKL do not suffer from oversubscription, but just run with the time of the largest call. Threads completed calls earlier wait on the final barrier doing nothing.
TBB designed to handle concurrent environment, and perform dynamic rebalancing with task stealing. It well suits a use case where it is hard to predict size of a call and timing when such call will occur. For the case, it avoids oversubscription and capable to rebalance workload, so that physical CPUs completed small tasks, could help with finishing large tasks. All this leads to superior performance for the concurrent case.

Best regards,
Alexander

rnickb · ‎08-08-2015

Is there any data for how TBB MKL compares to OpenMP MKL in a non-concurrent environment? Are they about the same in that case?

If so, is there any reason to use OpenMP MKL (if TBB MKL is always as good or better)?