Efficient cooperation with TBB and OpenMP (MKL/IPP/Qparallel)

Roman · ‎10-02-2011

I have discovered a problem with OpenMP and TBB, but none of Russian Intel engineers on Intel Software Conference 2011 can answer the questions . In the really big project with lots of plugins we have some plugins written using TBB and some plugins without TBB but with IPP or MKL. We can't redesign plugins without using IPP/MKL because of the really great performance of multithreaded versions of IPP/MKL and sometimes we havent source code of plugins.

So, what we get in the Process World of project: TBB and OpenMP (from IPP/MKL) tasks schedulers runs in parallel and each creates N (by number of available hardware threads) worker threads. Usage of N*2 threads leads to great overutilization of process resources, for example frequent thread context switching, cache locality missing, overutilization of other limited hardware resources. Thus, in call tree TBB -> IPP or TBB -> other OpenMP multithreaded code we always have heavy CPU overutilization (-> sign means calls)

When efficient coexistence and cooperation of OpenMP and TBB task schedulers will be available?

P.S. /Qparallel compiler option creates code with OpenMP task scheduler, but Cilk Plus with TBB task scheduler and someone can get the same problem

robert-reed · ‎10-02-2011

Do these OpenMP- and Intel TBB-powered plug-ins run simultaneously? In many plug-in environments I am familiar with, only one plug-in is active at a time. If this is the case here, there may be a workable solution. Having both pools present at the same time may cost a little space for the bookkeeping, but should not be a major problem as long as they are used disjointly. It's only when both types of plug-in are run simultaneously that the major resource conflicts occur.

At the other extreme, parallelized plug-ins that call other parallelized plug-ins of the other type should be avoided, as the two pools running simultaneously may interfere with each other. Particularly to be avoided are situations where within a parallel section of one type of plug-in is a call to a component thatspawns a thread pool inthe other type of plug-in. The worse case here is that you'll get a multiplier effect between the thread pools and end up with many times the desired thread count (and probably very little progress).

One other concern to be mentioned: temporal adjacency. There are situations where OpenMP or Intel TBB parallelized code may be called alternately, which might be an issue because of one thing: KMP_BLOCKTIME. This is a delay time that the OpenMP pool will hang around upon reaching the end of a parallel section on the chance that another OpenMPparallel sectionimmediately follows: hanging around for a few msecs is usually faster than putting the pool to sleep and then waking it back up. However, if immediately following is some Intel TBB code, its scheduler (and likewise the Intel Cilk runtime scheduler) will refrain from scheduling while the HW threads are busy OpenMP spinning, resulting in a wasted gap. The simplest solution to this problem is to set KMP_BLOCKTIME to 0, but that may have a negative effect on closely following OpenMP sections--you don't want to do this if the next parallel section to follow is another OpenMP construct. If the code hierarchy has some foreknowledge of what is coming next, there calls in the API to dynamically vary this delay factor and active management of KMP_BLOCKTIME can make a difference if there are lots of such transitions between OpenMP and non-OpenMP thread pools.

Roman · ‎10-02-2011

I know that the problem exists only when plugins do run simultaneously, and my question was exactly about it. I cant call plugins serially because of heavy computation load. Project often process hundreds of gigabytes of information, so we really do need MKL and IPP and TBB at the same time and the same place. CPU overutilization decreases performance and it is awful

robert-reed · ‎10-02-2011

Depending on the circumstance, I can think of basically one way to reduce the oversubscription: limit the OpenMP thread team size--ultimately, force it single threaded. Your description makes it sound like it's more likely an Intel TBB plug-in will be calling a possibly parallelized MKL or IPP function. If the TBB plug-inhas the field, having divided the work across the fabric, limiting OpenMP thread team sizes and nesting should help limit the oversubscription and hopefully some of the thrashing.

Alternatively, if MKL and Intel TBB plug-ins are coequals, perhaps limiting the thread team at the master will leave enough headroom for the TBB scheduler to find some unencumbered threads. This is a trickier matter, though, since the relative amounts of work for OpenMP or Intel TBB plug-ins pose load balance issues, possibly dynamic ones.

Roman · ‎10-02-2011

I don't know at runtimewhich plugins will be dispatched to run and which will be using OpenMP/MKL/IPP/TBB... Thus, limiting the thread team for TBB or OpenMP willdecrease overal performance (for example when running only OpenMP plugins)...

I think the solution somewhere within software cooperation of the task schedulers and not in some tricks.

robert-reed · ‎10-02-2011

Well, until this panacea where all task schedulers fully cooperate and give each other space, tricks may be all that's available. A fundamental choice that complicates the matter is the nature of scheduling in OpenMP: chunks are scheduled but never rescheduled--there's no notion of task stealing in OpenMP. The predetermined nature of scheduling in OpenMP means that it can schedule tasks with lower overhead than more active schedulers, but handles varying loads per chunk with less ability to load balance. Your objection to running OpenMP with less than a full thread team is a recognition of that lack of dynamism.

If there can be no a priori understanding of the balance between plug-ins, perhaps a more heuristic approach might be worth a try. It's true that with a reduced thread team, cycles may be wasted when the activities are purely OpenMP-based, but it may also reduce some of the oversubscription and inevitable thrashing during the mixed activities. The heuristic would be finding the point where these two slowing factors--undersubscription idleness and oversubscription resource thrashing--balance to achieve the best performance.

Alexey-Kukanov · ‎10-03-2011

Hi Roman,

As you may see, Robert tells you the same I told you at theconference: the semantics of OpenMP specification does not allow good cooperation with other parallel solutions. As we sometimes say, OpenMP is not composable even with itself; it was really designed to use in serial applications owning the whole machine - the scenariowhere it shines.

The main problem is in the OpenMP requirement of havinga fixed set of threads (aka a team)in a parallel region, from the beginning to the end. Thus even if everything else is dynamic (work scheduling, the amount of threads in the team), this requirement prevents from adjusting to CPU utilization changes. Consider a parallel region started when there was no other workload, so it took all CPUs; if another component starts more work later (no matter if OpenMP or TBB or something else is used), the parallel region cannot reduce its team size; the result isoversubscription. The opposite case is even worse: if a parallel region started when the machine was busy and only got e.g. half of all CPUs, it has to use only that many till the end; so if more CPUs become free over the course, the result is undersubscription.

No matter how hard you try make the schedulers cooperative, you can't solve this high-level, semantical issue. As an extreme, one can try to serialize all OpenMP regions giving the whole machine to each; but such an extreme cannot be made a default behavior, so it would require a specification extension.

And as I said, this is only one, though biggest, issue. Other are that a rare OpenMP region is written to have dynamic behavior; in many cases (including the default behavior) OpenMP uses static work scheduling and pre-defined number of threads. The runtime can do nothing but comply to the requirements as set by the application (or by specification defaults). It means that not only the schedulers need to cooperate but the application or libraries that use OpenMP need to make its use as cooperative as possible. Unfortunately, in many cases it might also lead to some loss of performance.

I hope you now better understand all the complications of the issue :)

jimdempseyatthecove · ‎10-03-2011

Roman, perhaps an extension of:

http://software.intel.com/en-us/forums/showthread.php?t=86406&o=a&s=lr

might produce a workable solution.

The above link outlines a potential solution where a multi-pthread app calls a DLL that is threaded using OpenMP. When different pthreads call the DLL, then multipl OpenMP thread pools are created. The technique outlines how you can attain cooperative use of the available hardware threads even with each OpenMP instance being fully subscribed.

In your case you have a situation that is slightly different, you have the TBB thread pool sharing the CPU resources with on MKL OpenMP thread pool. It is unclear to me as if each TBB thread entering MKL as to if this creates multiple thread OpenMP thread pools as it did for the user in the other forum. If it does, then additional expansion of the above technique may yield some improvement.

For your case (TBB pool with one or more MKL pools) try placing in your TBB code periodic calls to the function (in above link) determines if the recommended team size has changed and if so, issue a Sleep(0) or yield() or whatever works best. Also, on completion of functional TBB task, when you note (one or more) MKL session has started then spawna worker task (without blocking the current TBB task) that loops in Sleep(0)/yeild() until MKL session(s) have completed.

I hope I have conveyed this concept correctly.

Jim Dempsey

Roman · ‎10-03-2011

Hi, Alexey,

I understood almost everything about why and how fast OpenMP is in serial application from Robert Geva speech. But in real multithreaded application there is no guarantee that 2 OpenMP based functions will not be called simultaneously. And as I expected there is a forum thread with similar problem (thanks to Jim).

Is it possible to update (or design new one) MKL and IPP library that uses TBB task scheduler? At the other side, I can prohibit using OpenMP and /Qparallel.

Roman · ‎10-03-2011

Hi, Jim.

I think transparent (for developers)solution can come from Intel side. I dislike tricks with libraries also. Now I know exactly why OpenMP creates lots of threads...

jimdempseyatthecove · ‎10-03-2011

Roman,

I too prefer a proper soluiton, but then again I do not like to wait.
This is more of a case of why your application is creating a lot of threads as opposed to a fault of OpenMP.

Jim

Alexey-Kukanov · ‎10-07-2011

Quoting Roman

Is it possible to update (or design new one) MKL and IPP library that uses TBB task scheduler?

In theory,it is possible. I understand this is not theanswer you want; but since I do not work on MKL and IPP and do not know their business priorities, I can't say anything certain.

Now my turn for a question: how much MKL/IPP performance would you be ready tolose in "I own the machine" mode in order to get better overall performance in "I am not the only one here" mode? I hope you understand what I mean:)

Roman · ‎10-29-2011

I think, 2-5% is enough to do adaptive thread planning.