Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
The Intel sign-in experience has changed to support enhanced security controls. If you sign in, click here for more information.

Changing concurrency level at runtime

One of the main strengths of task-based programming models is their ability to efficiently handle fine-grain parallelism. This enables applications to get good performance on a variety of platforms, configurations, and even different runs, in a way that is transparent to the programmer. In other words, applications can see improvements from each additional core, without additional programming effort.

In the case of TBB, concurrency level is controlled by means of task_scheduler_init objects. Once such an object is created, the concurrency level of parallel algorithms that run in the same context remains fixed until the object destruction. In this way, however, an application cannot benefit from cores becoming dynamically available, e.g. after another application terminates or throttles its parallelism. Or similarly, when an application runs with all the hardware threads at its disposal and another comes in, it would not be possible for the former to shrink its concurrency for better co-existence.
My question is: how easy would it be to extend the current TBB task scheduler to support dynamic deactivation/activation of worker threads in a specific arena? My guess is that (e.g. in the deactivation case) it would suffice to stop a specific worker from further executing local tasks or stealing remote ones, and exclude its queue from accepting new tasks. Of course, the scheduler would have to be augmented with information about which workers are online/offline anytime. Would that be enough? And do you think the extra bookeeping would incur significant overhead?

0 Kudos
3 Replies

The task_scheduler_init object controls number of worker threads. Yes, this number is not changed until the object is destruted. But actual concurrency level is not total number of worker thread, but number of threads active at the moment.
E.g. if machine is 8 core, task_scheduler_init creates 7 worker threads (one core is left for the master thread). This is default value and may be changed by developer. Even if CPU is busy by any other threads and applications, there will be created 7 thread. Some of them will execute, if they have tasks and CPU resource from OS. Some will wait inactively.
Number of currently executed threads is changed dynamically. It adopts tochanges in CPU availabilitydynamically the similar way you're talking about. This is because of task based parallelism, so tasks can be load-balanced between workers. It's not needed to create and remove workers each time, it's enough to make some of them inactive, if CPU is busy by others. Having some threads in waiting state doesn't give much overhead.
So it works almost the same way you described.

Black Belt
I think the question was how things behave in the presence of threads other than TBB's own, in the same program or in others. There is no general solution for this yet (you'll have to be lucky enough to be in a situation that's already supported or dive deep into the code). Note that adding worker threads is (conceptually) easier than borrowing one for other purposes while none are currently truly idle (not waiting for anything).
Yes, what I have in mind is closer to what Raf said. I assume an environment where the number of available processors for a TBB application may change over time (basically due to co-execution of other applications, but also in other, more "extreme" scenarios such as processor faults, etc.). The application itself should be able to dynamically adjust its concurrency in the most efficient way, meaning that it should use at any time exactly the same number of workers as the available processors, avoiding any over- and undersubscription of workers.Generally speaking, the ideal case would be an intermediate layer between TBB runtime and OS to manage concurrency and hardware affinity (processor and memory) for all simultaneously running TBB apps. This would allow efficient space sharing between them (plus fairness, prioritization possibilities, etc.).

The naive solution that came first to my mind was to initialize each application's task scheduler with the maximum (meaningful) number of workers (i.e. hardware concurrency -1), "quiescing" some of them whenever the application must shrink its concurrency, and recruiting them back when it should increase it. No worker movement across apps, no dynamic worker creation/destruction. Strictly speaking, this scheme involves worker oversubscription. But essentially, only those workers that have their local queues active are accounted; the idle ones may be silently suspended in kernel, without incurring overhead while waiting. Would the implementation of such a scheme be as straightforward as it seems?