I have discovered a problem with OpenMP and TBB, but none of Russian Intel engineers on Intel Software Conference 2011 can answer the questions . In the really big project with lots of plugins we have some plugins written using TBB and some plugins without TBB but with IPP or MKL. We can't redesign plugins without using IPP/MKL because of the really great performance of multithreaded versions of IPP/MKL and sometimes we havent source code of plugins.
So, what we get in the Process World of project: TBB and OpenMP (from IPP/MKL) tasks schedulers runs in parallel and each creates N (by number of available hardware threads) worker threads. Usage of N*2 threads leads to great overutilization of process resources, for example frequent thread context switching, cache locality missing, overutilization of other limited hardware resources. Thus, in call tree TBB -> IPP or TBB -> other OpenMP multithreaded code we always have heavy CPU overutilization (-> sign means calls)
When efficient coexistence and cooperation of OpenMP and TBB task schedulers will be available?
P.S. /Qparallel compiler option creates code with OpenMP task scheduler, but Cilk Plus with TBB task scheduler and someone can get the same problem
Do these OpenMP- and Intel TBB-powered plug-ins run simultaneously? In many plug-in environments I am familiar with, only one plug-in is active at a time. If this is the case here, there may be a workable solution. Having both pools present at the same time may cost a little space for the bookkeeping, but should not be a major problem as long as they are used disjointly. It's only when both types of plug-in are run simultaneously that the major resource conflicts occur.
At the other extreme, parallelized plug-ins that call other parallelized plug-ins of the other type should be avoided, as the two pools running simultaneously may interfere with each other. Particularly to be avoided are situations where within a parallel section of one type of plug-in is a call to a component thatspawns a thread pool inthe other type of plug-in. The worse case here is that you'll get a multiplier effect between the thread pools and end up with many times the desired thread count (and probably very little progress).
One other concern to be mentioned: temporal adjacency. There are situations where OpenMP or Intel TBB parallelized code may be called alternately, which might be an issue because of one thing: KMP_BLOCKTIME. This is a delay time that the OpenMP pool will hang around upon reaching the end of a parallel section on the chance that another OpenMPparallel sectionimmediately follows: hanging around for a few msecs is usually faster than putting the pool to sleep and then waking it back up. However, if immediately following is some Intel TBB code, its scheduler (and likewise the Intel Cilk runtime scheduler) will refrain from scheduling while the HW threads are busy OpenMP spinning, resulting in a wasted gap. The simplest solution to this problem is to set KMP_BLOCKTIME to 0, but that may have a negative effect on closely following OpenMP sections--you don't want to do this if the next parallel section to follow is another OpenMP construct. If the code hierarchy has some foreknowledge of what is coming next, there calls in the API to dynamically vary this delay factor and active management of KMP_BLOCKTIME can make a difference if there are lots of such transitions between OpenMP and non-OpenMP thread pools.
I know that the problem exists only when plugins do run simultaneously, and my question was exactly about it. I cant call plugins serially because of heavy computation load. Project often process hundreds of gigabytes of information, so we really do need MKL and IPP and TBB at the same time and the same place. CPU overutilization decreases performance and it is awful
Depending on the circumstance, I can think of basically one way to reduce the oversubscription: limit the OpenMP thread team size--ultimately, force it single threaded. Your description makes it sound like it's more likely an Intel TBB plug-in will be calling a possibly parallelized MKL or IPP function. If the TBB plug-inhas the field, having divided the work across the fabric, limiting OpenMP thread team sizes and nesting should help limit the oversubscription and hopefully some of the thrashing.
Alternatively, if MKL and Intel TBB plug-ins are coequals, perhaps limiting the thread team at the master will leave enough headroom for the TBB scheduler to find some unencumbered threads. This is a trickier matter, though, since the relative amounts of work for OpenMP or Intel TBB plug-ins pose load balance issues, possibly dynamic ones.
I understood almost everything about why and how fast OpenMP is in serial application from Robert Geva speech. But in real multithreaded application there is no guarantee that 2 OpenMP based functions will not be called simultaneously. And as I expected there is a forum thread with similar problem (thanks to Jim).
Is it possible to update (or design new one) MKL and IPP library that uses TBB task scheduler? At the other side, I can prohibit using OpenMP and /Qparallel.
In theory,it is possible. I understand this is not theanswer you want; but since I do not work on MKL and IPP and do not know their business priorities, I can't say anything certain.
Now my turn for a question: how much MKL/IPP performance would you be ready tolose in "I own the machine" mode in order to get better overall performance in "I am not the only one here" mode? I hope you understand what I mean:)