Sharing thread pool between TBB and OpenMP ?

stefan_seefeld · ‎11-27-2011

Hello,

I'm working on middleware software that is using OpenMP. I'd also like to use TBB for task-level parallelism. Is there a way to have the two share a single thread pool, or at least share enough information to avoid oversubscribing resources ?

Thanks,
Stefan

RafSchietekat · ‎11-27-2011

Depending on your environment and/or development tools, TBB already does some of that.

See Andrey Marochko's blog "TBB initialization, termination, and resource management details, juicy and gory" for a sketch of where the Resource Management Layer (RML) is used.

For collaboration between TBB and OpenMP, you'll probably need Intel's compiler.

SergeyKostrov · ‎12-02-2011

>>...For collaboration between TBB and OpenMP, you'll probably need Intel's compiler...

Actually, probably not and this is why:

For example,

1. Somebody has downloaded source codes ofTBB and has a Microsoft'sC/C++ compiler;

2. TBBcould be easilycompiled with Microsoft'sC/C++ compiler;

3.Microsoft's C/C++ compiler has a support for OpenMP;

4.Integration\Collaboration ofTBB & OpenMP in one application could be done using Microsoft's C/C++
compiler ( of course, there are some limitations, but we're not speaking about it now! );

5. The same, I mean '2', '3' and'4', could be done with any C/C++ compiler supported by TBB if the compiler supports OpenMP.

PS:

-It is not hard tointegrateaWin32-based multi-threaded processing with OpenMP-based multi-threaded processing;

-It is not hard tointegrate NVIDIA's GPU-based multi-threaded processingwith OpenMP-based multi-threaded processing;

- Etc...

RafSchietekat · ‎12-02-2011

The question is not whether TBB and OpenMP can be used together at all, but whether such a hybrid program will be efficient in the sense that their thread models are composable, either by sharing or by another form of coordination. As far as I know, this has only been done with the Intel compiler, unless it comes indirectly with the system-wide coordination on Windows (but I don't know the details of what RML does on Windows).

(Edited) I have been revising this online (my mistake!), and apparently in #4 Sergey has replied to an intermediate version, so he is quoting from a version no longer here. Real-life equivalent of a data race. :-)

SergeyKostrov · ‎12-02-2011

In case of Windows platforms...

>>...use of RML on Windows (I don't know the details there), but does that apply to OpenMP and TBB in
>>the same application?..

I don't know.

>>...how would using TBB on an OpenMP-capable compiler imply thread coordination between
>>both models?..

Look at:
http://software.intel.com/en-us/articles/openmp-and-win32-threads-usage-example

>>...Is there a standard API that I've overlooked...

In case of threads synchronization look at MSDN or Microsoft's Platform SDK. Also,there are lots
ofreliableAPIs developed byMicrosoft. For example:

RPC( Remote Procedure Call ) API
DDE( Dynamic Data Exchange ) APINote: You can't imaging how reliable it is! I love it!
OLE ( Object Linking and Embedding ) API
Win32 API functions for synchronization
MFC classes for synchronization
COM
Memory-Mapped files ( Win32 API & MFC)
WinSock API
Pipes API
RTC( Real-Time Communications ) API

Is there anything that could prevent me to use all of these technologies in one multi-threaded
application?

No.

I usedmany of them in different applications and I know that all thesetechnologies could be integrated
together. Does it make sense to integrate all of them in a real-life application? It depends on a project...

PS: In one of my projectDDE "solved" a problem of interaction a Win32-based application with16-bit and
32-bit DLLs. It was done usingaclassic client-server scheme.

SergeyKostrov · ‎12-02-2011

>>...The question is not whether TBB and OpenMP can be used together at all, but whether
>>such a hybrid program will be efficient...

It depends on a project and lots of different problems are possible.

>>...this has only been done with the Intel compiler...

Could you give me a simpleexample of such coordination, please? Let's speak about some practical things, not abstract...

RafSchietekat · ‎12-02-2011

I'm afraid that you are missing the point, unless you are asserting that the Resource Management Layer (RML) provides no added value, or unless I am mistaken about its role in keeping TBB and OpenMP threads out of each other's hair, which I think I read/heard is provided with the Intel compiler (presumably independent of O.S.) and on Windows might otherwise possibly also be an indirect effect of its systemwide role (if I'm not mistaken about that, and I don't know whether all compilers, including GNU, allow this).

Would you please first consult internally with somebody who worked on RML?

It would be very interesting to myself as well as to Stefan Seefeld to know what effort, if any, is needed in that area, in my case on GNU/Linux. From the information I have, optimal performance would require further development in this area, but it would be interesting to know what gains have been observed elsewhere.

jimdempseyatthecove · ‎12-03-2011

Stefan,

OpenMP and TBB will share the system's hardware thread "pool", but not each applications software thread pools.

OpenMP assumes, from the top level, that all threads in its thread pool are available (IOW nothing else is running). Using static scheduling would require the availability of each (all)of the threads of the OpenMP thread pool (or subset during nesting of levels). Whereas using dynamic scheduling will permit program progress with the lack of availability of one or more of the threads. (there is also preemption between the OpenMP and TBB threads, but that is a different issue).

TBB can be thought of more like OpenMP's dynamic scheduling (work is partitioned, then any available thread can take a chunk).

Both OpenMP (dynamic)and TBB share the characteristic that once a software thread acquires a chunk/task, that that thread alone will run the chunk to completion (which may also include nested levels of parallelism). Should the acquiring thread get preempted, then progress (by this software thread) is suspended until the O/S scheduler gives it another timeslice.

OpenMP can tune the number of threads per parallel region on-the-fly. Once in the parallel region it cannot readjust the number of threads. However, you can partition the work into more pieces than you have threads (using chunk), thus you can getprogress of the work even with one or more (but not all) of the threads preempted (to run the TBB threads or other applicaitons).

Note now: Should OpenMP and TBB share a common thread pool, as you request, what do you suggest should happen with:

#pragma omp parallel for
for(...

How many threads should be assumed to be available? How many partitions should the loop be made into? Should the OpenMP partitioner/scheduler know in advance that TBB will require a thread just after OpenMP partitions and schedules the parallel for loop?

The availibility of threads, either at start of loop, or subsiquent to loop start, is an unknown.

You have a couple of choices:

a) Code entirely in OpenMP or TBB (or Cilk+).

b) Undersubscribe each thread pool. Not necessarily using equal numbers of threads. With/without overlap (e.g. on 8 hw thread system using 5 and 4 threads).

c) Assure that both systems set their respective spin waits (KMP_BLOCK_TIME, ...) to 0. Note the spinwait timeout can be dynamically be adjusted based on which pool is more active at a given point in time.

d) Have your TBB portion set estimated load weights for subsequent tasksfor use by your OpenMP portion in tuning the number of threads for the next parallel region as well as tuning the chunking values, spinweights, dynamic/static scheduling and anything else that matters.

Jim Dempsey

SergeyKostrov · ‎12-05-2011

>>...or at least share enough information to avoid oversubscribing resources?..

Isearched TBB files and here are a couple comments\notesregarding oversubcribing:

..\TBB40\Doc\Release_Notes.txt

...
- If you are using Intel Threading Building Blocks and OpenMP*
constructs mixed together in rapid succession in the same
program, and you are using Intel compilers for your OpenMP*
code, set KMP_BLOCKTIME to a small value (e.g., 20 milliseconds)
to improve performance. This setting can also be made within
your OpenMP* code via the kmp_set_blocktime() library call. See
the Intel compiler OpenMP* documentation for more details on
KMP_BLOCKTIME and kmp_set_blocktime().
...

..\TBB40\Src\Rml\Test\test_rml_tbb.cpp

...
void FireUpJobs( ... )
{
...
// Experiments indicate that when oversubscribing, the main thread should wait a little
// while for the RML worker threads to do some work.
if( checker )
{
// Give RML time to respond to change in number of threads.
MilliSleep(1);
...
}
...
}
...

..\TBB40\Src\Rml\Server\rml_server.cpp

...
#if !RML_USE_WCRM
/* to deal with cases where the machine is oversubscribed; we want each thread to trip to
try_process() at least once */
/* this should not involve computing the_balance */
...

..\TBB40\Src\Test\test_concurrent_vector.cpp

...
void TestFindPrimes()
{
// Time fully subscribed run.
double t2 = TimeFindPrimes( tbb::task_scheduler_init::automatic );

// Time parallel run that is very likely oversubscribed.
#if _XBOX
double t128 = TimeFindPrimes(32); //XBOX360 can't handle too many threads
#else
double t128 = TimeFindPrimes(128);
#endif
REMARK("TestFindPrimes: t2==%g t128=%g k=%g\n", t2, t128, t128/t2);

// We allow the 128-thread run a little extra time to allow for thread overhead.
// Theoretically, following test will fail on machine with >128 processors.
// But that situation is not going to come up in the near future,
// and the generalization to fix the issue is not worth the trouble.
if( t128 > 1.3*t2 )
{
...
}
}

RafSchietekat · ‎12-05-2011

Hmm, strange, where did I get the idea that Sergey is from Intel... and why did somebody give #2 a 5-star rating?

Anyway, in #8 I don't see the relevance of these particular quotes.

The 20 milliseconds is curious, though: what would be the normal value, and why not set to 0 as Jim advises?

jimdempseyatthecove · ‎12-05-2011

>>The 20 milliseconds is curious, though: what would be the normal value, and why not set to 0 as Jim advises?

If you want "simply" a set-and-forget number, you would run benchmarks of what you thought would be a typical load using 0, then 1, then... n ms block times (and where block times may differ between OpenMP and TBB). I would suggest a heuristic approach where, I think it would be easier to introduce into TBB, one of the thread models produces hints for both to use. e.g. (your version of) TBB could modify a shared variable with your OpenMP app indicating a preferred number of threads to use and/or block time. Call this cooperative multi-threading. This is not unlike the thread scheduler in the O/S dynamically changing priorities and/or time slices. On the TBB side you have less flexibility, ex-post-facto, to change the number of threads for the next parallel_for, but you could change a tuning variable used by the iteration space partitioner. While this will not affect currently partitioned iterations, it would affect subsequent partitioning. Some cooperation is better than no cooperation.

Jim Dempsey

Jim Dempsey