Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Number of threads used by Intel TBB

Constantin_Christman
2,209 Views

How does Intel TBB choose the number of threads to used for a parallel section?

Is there some kind of specification available?

All I found was:
"The scheduler tries to avoid oversubscription, by having one logical thread per physical thread, and mapping tasks to logical threads, in a way that tolerates interference by other threads from the same or other processes."

But how many physical threads are created by TBB? Always the number of available cores?
0 Kudos
1 Solution
ARCH_R_Intel
Employee
2,209 Views
There's currently no such heuristic in the TBB implementation. Because TBB does not use a team/barrier model, we believe it is much less sensitive to oversubscription of the system.

Coming up with a good heuristic, and showing that it works better than do nothing,would be interesting avenue of research. These sort of feedback loops are notorious for being difficult to stabilize.

View solution in original post

0 Kudos
9 Replies
Alexey-Kukanov
Employee
2,209 Views
If you use TBB with default settings, then yes, it's always the number of available cores (as reported by means of a particular OS).
0 Kudos
Constantin_Christman
2,209 Views
OpenMP offers a OMP_DYNAMIC environment variable, where the number of threads is decided based on the load of the system.

How this is achieved is implementation dependent, i.e. the gcc implementation uses getloadavg() as some kind of heuristic to determine the optimal number of threads...

Just to clarify, TBB doesn't contain such heuristic load dependent mechanism?




0 Kudos
Alexey-Kukanov
Employee
2,209 Views
For the moment, it does not.
And for TBB it is much less important, because it does not have OpenMP's "team and barrier" semantics that requires a certain number of threads to process a parallel region. Basically, OMP_DYNAMIC is a half way out of this prison: it allows the implementation to choose the number of threads in the team; however if the system load changes during the course of OpenMP parallel region, the team size cannot be adjusted.
In contrast, TBB parallel constructs do not depend on how many threads execute it,and the number of threads working on the region can change on the fly. Therefore, TBB is much more tolerable to oversubscription of the system, provided that OS does its job of (somewhat) fair scheduling of threads.
0 Kudos
Constantin_Christman
2,209 Views

Hi Alexey,

sorry, if I misinterpret your post, but you seem to say opposite things :)

You said that TBB doesn't has a heuristic to determine the number of threads based on the system load, but then you said:

"the number of threads working on the region can change on the fly"

This would mean that TBB does indeed have a mechanism to prevent overloading the system.

How does TBB decide when to change the number of threads working on a region?

Cheers,

Constantin

0 Kudos
ARCH_R_Intel
Employee
2,209 Views
There are no regions in TBB. Hence no decision is necessary.

From an OpenMP programmer's viewpoint, TBB parallel constructs like parallel_for act somewhat like a region. But this isonly a way of trying to forceTBB into an OpenMP view of the world. Internally, TBB has no notion of parallel regions. There are just tasks that are ready to execute. As long as there are sufficient tasks, the machine stays busy.
0 Kudos
Constantin_Christman
2,209 Views
"Internally, TBB has no notion of parallel regions. There are just tasks that are ready to execute. As long as there are sufficient tasks, the machine stays busy."

Yes, I understand that, but assume the folowing situation:
My application parallelized with TBB runs concurrently to another application on the system and the other app is occupying half of the available cores.
Now if my app creates many TBB tasks they become executed on logical TBB threads.
This logical threads are scheduled together with the threads of the other application onto the cores - which may cause overloading of the system - especially if TBB uses always N logical threads for a system with N cores.
My question is: does TBB vary the number of used logical threads to prevent such overloading?
This was the reason for my reference to OpenMP as it does offer such load-dependend heuristic.
0 Kudos
ARCH_R_Intel
Employee
2,210 Views
There's currently no such heuristic in the TBB implementation. Because TBB does not use a team/barrier model, we believe it is much less sensitive to oversubscription of the system.

Coming up with a good heuristic, and showing that it works better than do nothing,would be interesting avenue of research. These sort of feedback loops are notorious for being difficult to stabilize.
0 Kudos
Constantin_Christman
2,209 Views
Ok, I understand. Thanks for your replies!
0 Kudos
jimdempseyatthecove
Honored Contributor III
2,209 Views
Both OpenMP and TBB will "suffer" (if that is the right word for it) consequences of system oversubscription. When OpenMP has a parallel for using dynamic scheduling it behaves similar to the TBB task model. An iteration space is partitioned and then consumed by threads on a first come first serve bases. With OpenMP static scheduling the partitions are assigned to assumed available threads (entire thread pool or the explicit number requested). TBB partitioning defaults do the same (is similar). Both have a means to alter the partitioning.

Note, for bothTBB and OpenMP using dynamic scheduling, once a partition is grabbed by a thread, the thread could get preempted due to system oversubscription. *see hack below

The OpenMP schedule(runtime) is an attempt to mitigate the oversubscription problem by dynamically changing the partitioning (number of partitions).

When system load varies, it becomes very difficult to predict what processing resources will be available in the near future.

Due to preemption by O/S for thread in other process, the application programmer may need to test and tune under various synthetic load conditions. I wrote an article on this very issue titled "White Rabbits" in the ISN blogs section.

There is no magic cure for this. The consequence of guessing wrong is slower code.

*hack

A while back, while running some test code on Windows I noticed a curious behavior. A section of code issuing SwitchToThread() will get rescheduled _prior_ to a thread completing I/O as well as _prior_ to a thread being migrated from core to core. Therefore, if at some point in time you are the only thread "running" on a given core, then using SwitchToThread might reduce preemption (at the expense of the overhead of the function).

Of course now, when the author of the other application learns about your little trick, he will employ it too. And then you both suffer from slower code.

Jim Dempsey
0 Kudos
Reply