Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Adapting worker threads count during execution

Anastopoulos__Nikos
588 Views

Hi, 

I have an application with multiple parallel regions (with serial portions in between), each having different characteristics in terms of scalability (i.e. some scale strongly, other weakly, etc.), and thus different requirements in terms of optimal number of cores to use. Is it possible in TBBs to dynamically change the number of worker threads for a specific parallel region? Would the creation of multiple task_scheduler_init objects in different scopes (i.e. "{ }" blocks) work? Ideally, I would like to avoid the continuous creation/destruction of worker threads (due to the relatively large overhead), and employ a less disruptive scheme to suspend/resume workers on demand. 

Thanks in advance, 

Nick

0 Kudos
10 Replies
SergeyKostrov
Valued Contributor II
587 Views
>>...Would the creation of multiple task_scheduler_init objects in different scopes (i.e. "{ }" blocks) work?.. I found a statement in a TBB documentation that says: '...A thread may construct multiple task_scheduler_inits...'. However, I didn't try to test this. Best regards, Sergey
0 Kudos
RafSchietekat
Valued Contributor III
587 Views
It is possible to let an application thread use different degrees of parallelism by way of scoped task_scheduler_init instances. Note that only the outermost instance on the stack counts (unless I missed something), and this may be an implicit instance arising from any use of certain TBB scheduler-related features (things like atomics or passive containers don't count, but try to avoid them anyway). That means that you should have one application thread to start the TBB work from blocks with their own specific task_scheduler_init instances, and another thread to keep the scheduler features alive for reuse, including the TBB thread pool; it's probably easiest to do the latter in the main thread and the former in an explicitly created thread. Please let us know whether this works out well for you.
0 Kudos
Anastopoulos__Nikos
588 Views
Thank you both for your answers! It seems that Raf's solution (if I have understood it correctly) works. If I try to create multiple task_scheduler_init instances within the same (main) thread, the number of total worker threads created is determined by the argument passed to the first instance. E.g., in the following scenario: [cpp] tbb::task_scheduler_init init(nthreads); { tbb::task_scheduler_init init1(nthreads1); //tbb::parallel_for } { tbb::task_scheduler_init init2(nthreads2); //tbb::parallel_for } { tbb::task_scheduler_init init3(nthreads3); //tbb::parallel_for } [/cpp] each parallel region will always execute in time proportional to that of "nthreads" workers. However, by employing some kind of nested parallelism like this: [cpp] #pragma omp parallel { #pragma omp sections { //dummy section -- hopefully sleeps politely until the other section finishes #pragma omp section { tbb::task_scheduler_init init(1); } //the "useful" section, corresponding to application code #pragma omp section { { tbb::task_scheduler_init init1(nthreads1); //tbb::parallel_for 1 } { tbb::task_scheduler_init init2(nthreads2); //tbb::parallel_for 2 } { tbb::task_scheduler_init init3(nthreads3); //tbb::parallel_for 3 } } } } [/cpp] each one of the 3 parallel_for's executes with the worker threads it requests within its block. Anyway, it would be a good feature if a future version of TBBs could support a more elegant way to accomplish this kind of malleability, as OpenMP does with the omp_set_num_threads function.
0 Kudos
RafSchietekat
Valued Contributor III
588 Views
The "however" code does not meaningfully differ from the code above it: it's just 4 separate blocks, without any task_scheduler_init scope nesting. I'll leave the interpretation of the OpenMP pragma's and their effect on TBB to others, though. (Added 2012-11-19) Sorry, it seems I missed the point, so please ignore that... I'll use as an excuse that the provided code doesn't do what its author intended: when the first block exits, the task_scheduler_instance immediately disappears, and after that it doesn't matter whether this section's thread lasts as long as that of the other section or not. I don't know enough about OpenMP to confidently suggest another way to emulate what I've suggested above (using explicit threads).
0 Kudos
jimdempseyatthecove
Honored Contributor III
587 Views
Why not manage the number of threads per region of you code by managing the number of tasks generated to execute those regions. Example: create a partitioner that places an upper limit on the number of partitions. Jim Dempsey
0 Kudos
Anastopoulos__Nikos
588 Views
jimdempseyatthecove wrote:

Why not manage the number of threads per region of you code by managing the number of tasks generated to execute those regions. Example: create a partitioner that places an upper limit on the number of partitions.

Jim Dempsey

I would like to maintain the so-called "parallel slackness" property (tasks >> workers) that libraries such as Cilk or TBBs implement, and which subsequently guarantees proper load balancing. I am not sure that the solution you propose would maintain that. But anyway, my applications do not only include parallel skeletons with partitionable iteration spaces (e.g. parallel_for), but also other constructs such as raw tasks.
0 Kudos
jimdempseyatthecove
Honored Contributor III
588 Views
A different option (hack) is to temporarily remove some threads from the idle tread pool by having those threads wait for an event or condition variable. Once the region where you desire to be run with diminished thread count completes you set the event/condition variable. This does result in those threads not being available for other tasks. There are ways to work around that too. parallel_invoke( [](){WaitForSingleEvent(...); }, // remove 1st thread [](){WaitForSingleEvent(...); }, // remove 2nd thread [](){ parallel_for(...); // do work SetEvent(...)} // release waiting threads ); // parallel_invoke Jim Dempsey
0 Kudos
RafSchietekat
Valued Contributor III
588 Views
I'm afraid that blocking a thread may trap some work on the stack, with possibly "unwelcome" results.
0 Kudos
Anastopoulos__Nikos
588 Views
Raf Schietekat wrote:

I'm afraid that blocking a thread may trap some work on the stack, with possibly "unwelcome" results.

OK. Here is another question: I was browsing the TBB source code to find functionality related to adding/removing workers. I came up to the following function in market.h: //! Request that arena's need in workers should be adjusted. /** Concurrent invocations are possible only on behalf of different arenas. **/ void adjust_demand ( arena&, int delta ); which seems to end up waking (or launching) "delta" extra workers (unfortunately "delta" cannot be negative, as far as I understand, to facilitate implementing parallelism shrinkage). For the case where more parallelism is needed, would it be a "proper" solution to call this library routine from user code, or would it lead to unexpected behaviour?
0 Kudos
RafSchietekat
Valued Contributor III
588 Views
Unless it is documented, it is liable to change in a new release, especially in areas that have been discussed recently. See my first reaction in this thread for my still-current best suggestion on how to handle this situation.
0 Kudos
Reply