I'm working on middleware software that is using OpenMP. I'd also like to use TBB for task-level parallelism. Is there a way to have the two share a single thread pool, or at least share enough information to avoid oversubscribing resources ?
See Andrey Marochko's blog "TBB initialization, termination, and resource management details, juicy and gory" for a sketch of where the Resource Management Layer (RML) is used.
For collaboration between TBB and OpenMP, you'll probably need Intel's compiler.
Actually, probably not and this is why:
1. Somebody has downloaded source codes ofTBB and has a Microsoft'sC/C++ compiler;
2. TBBcould be easilycompiled with Microsoft'sC/C++ compiler;
3.Microsoft's C/C++ compiler has a support for OpenMP;
4.Integration\Collaboration ofTBB & OpenMP in one application could be done using Microsoft's C/C++
compiler ( of course, there are some limitations, but we're not speaking about it now! );
5. The same, I mean '2', '3' and'4', could be done with any C/C++ compiler supported by TBB if the compiler supports OpenMP.
-It is not hard tointegrateaWin32-based multi-threaded processing with OpenMP-based multi-threaded processing;
-It is not hard tointegrate NVIDIA's GPU-based multi-threaded processingwith OpenMP-based multi-threaded processing;
(Edited) I have been revising this online (my mistake!), and apparently in #4 Sergey has replied to an intermediate version, so he is quoting from a version no longer here. Real-life equivalent of a data race. :-)
>>...use of RML on Windows (I don't know the details there), but does that apply to OpenMP and TBB in
>>the same application?..
I don't know.
>>...how would using TBB on an OpenMP-capable compiler imply thread coordination between
>>...Is there a standard API that I've overlooked...
In case of threads synchronization look at MSDN or Microsoft's Platform SDK. Also,there are lots
ofreliableAPIs developed byMicrosoft. For example:
RPC( Remote Procedure Call ) API
DDE( Dynamic Data Exchange ) APINote: You can't imaging how reliable it is! I love it!
OLE ( Object Linking and Embedding ) API
Win32 API functions for synchronization
MFC classes for synchronization
Memory-Mapped files ( Win32 API & MFC)
RTC( Real-Time Communications ) API
Is there anything that could prevent me to use all of these technologies in one multi-threaded
I usedmany of them in different applications and I know that all thesetechnologies could be integrated
together. Does it make sense to integrate all of them in a real-life application? It depends on a project...
PS: In one of my projectDDE "solved" a problem of interaction a Win32-based application with16-bit and
32-bit DLLs. It was done usingaclassic client-server scheme.
>>such a hybrid program will be efficient...
It depends on a project and lots of different problems are possible.
>>...this has only been done with the Intel compiler...
Could you give me a simpleexample of such coordination, please? Let's speak about some practical things, not abstract...
Would you please first consult internally with somebody who worked on RML?
It would be very interesting to myself as well as to Stefan Seefeld to know what effort, if any, is needed in that area, in my case on GNU/Linux. From the information I have, optimal performance would require further development in this area, but it would be interesting to know what gains have been observed elsewhere.
OpenMP and TBB will share the system's hardware thread "pool", but not each applications software thread pools.
OpenMP assumes, from the top level, that all threads in its thread pool are available (IOW nothing else is running). Using static scheduling would require the availability of each (all)of the threads of the OpenMP thread pool (or subset during nesting of levels). Whereas using dynamic scheduling will permit program progress with the lack of availability of one or more of the threads. (there is also preemption between the OpenMP and TBB threads, but that is a different issue).
TBB can be thought of more like OpenMP's dynamic scheduling (work is partitioned, then any available thread can take a chunk).
Both OpenMP (dynamic)and TBB share the characteristic that once a software thread acquires a chunk/task, that that thread alone will run the chunk to completion (which may also include nested levels of parallelism). Should the acquiring thread get preempted, then progress (by this software thread) is suspended until the O/S scheduler gives it another timeslice.
OpenMP can tune the number of threads per parallel region on-the-fly. Once in the parallel region it cannot readjust the number of threads. However, you can partition the work into more pieces than you have threads (using chunk), thus you can getprogress of the work even with one or more (but not all) of the threads preempted (to run the TBB threads or other applicaitons).
Note now: Should OpenMP and TBB share a common thread pool, as you request, what do you suggest should happen with:
#pragma omp parallel for
How many threads should be assumed to be available? How many partitions should the loop be made into? Should the OpenMP partitioner/scheduler know in advance that TBB will require a thread just after OpenMP partitions and schedules the parallel for loop?
The availibility of threads, either at start of loop, or subsiquent to loop start, is an unknown.
You have a couple of choices:
a) Code entirely in OpenMP or TBB (or Cilk+).
b) Undersubscribe each thread pool. Not necessarily using equal numbers of threads. With/without overlap (e.g. on 8 hw thread system using 5 and 4 threads).
c) Assure that both systems set their respective spin waits (KMP_BLOCK_TIME, ...) to 0. Note the spinwait timeout can be dynamically be adjusted based on which pool is more active at a given point in time.
d) Have your TBB portion set estimated load weights for subsequent tasksfor use by your OpenMP portion in tuning the number of threads for the next parallel region as well as tuning the chunking values, spinweights, dynamic/static scheduling and anything else that matters.
Isearched TBB files and here are a couple comments\notesregarding oversubcribing:
- If you are using Intel Threading Building Blocks and OpenMP*
constructs mixed together in rapid succession in the same
program, and you are using Intel compilers for your OpenMP*
code, set KMP_BLOCKTIME to a small value (e.g., 20 milliseconds)
to improve performance. This setting can also be made within
your OpenMP* code via the kmp_set_blocktime() library call. See
the Intel compiler OpenMP* documentation for more details on
KMP_BLOCKTIME and kmp_set_blocktime().
void FireUpJobs( ... )
// Experiments indicate that when oversubscribing, the main thread should wait a little
// while for the RML worker threads to do some work.
if( checker )
// Give RML time to respond to change in number of threads.
/* to deal with cases where the machine is oversubscribed; we want each thread to trip to
try_process() at least once */
/* this should not involve computing the_balance */
// Time fully subscribed run.
double t2 = TimeFindPrimes( tbb::task_scheduler_init::automatic );
// Time parallel run that is very likely oversubscribed.
double t128 = TimeFindPrimes(32); //XBOX360 can't handle too many threads
double t128 = TimeFindPrimes(128);
REMARK("TestFindPrimes: t2==%g t128=%g k=%g\n", t2, t128, t128/t2);
// We allow the 128-thread run a little extra time to allow for thread overhead.
// Theoretically, following test will fail on machine with >128 processors.
// But that situation is not going to come up in the near future,
// and the generalization to fix the issue is not worth the trouble.
if( t128 > 1.3*t2 )
Anyway, in #8 I don't see the relevance of these particular quotes.
The 20 milliseconds is curious, though: what would be the normal value, and why not set to 0 as Jim advises?
If you want "simply" a set-and-forget number, you would run benchmarks of what you thought would be a typical load using 0, then 1, then... n ms block times (and where block times may differ between OpenMP and TBB). I would suggest a heuristic approach where, I think it would be easier to introduce into TBB, one of the thread models produces hints for both to use. e.g. (your version of) TBB could modify a shared variable with your OpenMP app indicating a preferred number of threads to use and/or block time. Call this cooperative multi-threading. This is not unlike the thread scheduler in the O/S dynamically changing priorities and/or time slices. On the TBB side you have less flexibility, ex-post-facto, to change the number of threads for the next parallel_for, but you could change a tuning variable used by the iteration space partitioner. While this will not affect currently partitioned iterations, it would affect subsequent partitioning. Some cooperation is better than no cooperation.