- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When we use tbb like tbb::parallel_for, the default number of threads in tbb is the number of cpu-cores minus 1.
Most of the time the number of CPUs is sufficient, but sometimes there will be other threads executing on the process at the same time like ros or fastdds.
When there are not enough threads, the kernel will give up a thread of tbb after the time slice is exhausted. This will cause 10-20ms offcpu, that is, all other tbb threads are waiting for this thread to be scheduled back.
Is there any good way to solve this problem other than reducing the number of threads in tbb?
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for posting on Intel communities.
To resolve your issue, we suggest you to use task_arena or global_control.
- task_arena: This feature allows you to create a controlled execution environment for a group of tasks. By using a task_arena, you can limit the number of threads that TBB uses for a specific set of tasks, isolating them from other tasks. This can help you avoid contention issues when other threads or processes are running simultaneously.
- global_control: This feature allows you to control the behavior of TBB at a global level. You can temporarily change the number of threads available to TBB or modify other runtime parameters. This can be useful for dynamic adjustments to TBB's behavior.
For more details refer to the below links,
If you still have any issues, please let us know.
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks for your reply!
Do you mean that I can control tbb::parallel_for and other threads in different task_arenas? But I'm afraid there are many threads in the code that are uncontrolled, such as ros. I can only call its publish function, but it actually sets up 4 threads to run this part of the logic. We cannot put it in task_arena.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you please share us the following details so that we can investigate more on your issue,
1. A sample reproducer with complete steps
2. How are you getting to know that some threads are uncontrolled and what are the expected results?
3. Could you please explain more about your use case?
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have not heard back from you. Could you please provide us with an update on your issue?
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We are working on your query internally, We'll get back to you soon.
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your patience and understanding.
The parallel_for work needs to be granular enough for effective decomposition: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html
We would need more details (or a small reproducer) to better understand the type of work that is being done and how the parallel_for and the ROS I/O tasks interoperate.
If, during the execution of a parallel_for, the OS schedules some oneTBB worker threads to be off-CPU, the remaining ones continue the work, potentially "stealing" tasks from others. Hence, it appears to be that the fact that "linux will not allow tbb always hold the time slice, it will schedule tbb thread offcpu and do ros oncpu" is the desired behavior. (https://oneapisrc.github.io/oneTBB/main/tbb_userguide/How_Task_Scheduler_Works.html). This ensures that on-CPU oneTBB threads do not wait for others, optimizing overall performance.
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Could you kindly share a sample reproducer with us? This will help us better understand your issue and assist you in resolving it.
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
So sorry that we do not have a simple example to reproduce it and do not have time to do this. Maybe you can try to make tbb::parallel_for and ros receive topic work at the same time and see.
Meanwhile it is sure that kernel think tbb::parallel_for need to give the time slices to other thread because the thread num is larger than cpu number. It will not occur when other thread is produced by std::async. I think maybe ros has some epoll process which has high priority.
Set tbb thread nice value can fix it but not recommend because may other problems happens.
Not sure what tbb can do in kernel level.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Meanwhile,we also wonder what if we have two thread, and each thread will run tbb:parallel_for at the same time, will all be schedule together by tbb? What if one of the parallel_for is in tbb:task_arena, will still be schedule together?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks for your patience and understanding.
We have provided information on
- how the scheduler dynamically redistributes work among on-CPU workers
- how to adjust work granularity for the decomposition and redistribution to be more effective
- how to isolate work (if truly needed) by using task_arena or global_control (even just locally before the relevant parallel_for)
Please try to apply the above to your code and see the effects.
>>Meanwhile,we also wonder what if we have two thread, and each thread will run tbb:parallel_for at the same time, will all be schedule together by tbb? What if one of the parallel_for is in tbb:task_arena, will still be schedule together?
Yes, in both cases.
Each user thread that invokes any parallel construction outside an explicit task_arena uses an implicit task arena representation object associated with the calling thread (https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html), so in either case there will be two task arenas (implicit in one case, explicit in the other) executing concurrently. The worker threads will be divided in proportion to the need of each task arena.
Thanks & Regards,
Vankudothu Vaishnavi.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
Thanks a lot for your help and I have try to write a demo to explain my previous problem.
We have used lttng to grep the problem like this. As you can see in the picture, we run on 11 cpu core and the thread number used default for tbb::parallel_for. Then is four AsyncThread at the same time do some work to keep oncpu.
Worker_Run represent tbb threads. Dark colors in the same block represent oncpu, and light colors represent offcpu. You can see two tbb::parallel_for will offcpu about 16ms.
The most troublesome thing for us is this kind of offcpu, although this does not necessarily happen every time. If offcpu is turned off, can the task of the current tbb thread be switched to other tbb threads, instead of all tbb threads waiting for this thread to finish executing? Binding cores of different threads can definitely solve this problem, but if Asyncthread is executed only 10% of the time, this will cause a waste of CPU.
And this is the demo code:
int main() {
TRACE_EVENT_SCOPE(planner, MainThread);
auto start_time = std::chrono::high_resolution_clock::now();
std::vector<std::future<bool>> future_results;
for (int i = 0; i < 4; ++i) {
future_results.push_back(std::async([](){
auto start_time = std::chrono::high_resolution_clock::now();
usleep(3000);
TRACE_EVENT_SCOPE(planner, AsyncThread);
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 20) {
break;
}
}
return true;
}));
}
tbb::parallel_for(tbb::blocked_range<int>(0, 49), [](tbb::blocked_range<int> r) {
for (int i = r.begin(); i != r.end(); ++i) {
auto start_time = std::chrono::high_resolution_clock::now();
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 1) {
break;
}
}
}
});
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Meanwhile, if we run a tbb::parallel_for in the main thread, another tbb::parallel_for in the async thread. We can also see the offcpu and the exec time of async thread is 18ms.
Demo code is like this:
int main() {
TRACE_EVENT_SCOPE(planner, MainThread);
auto start_time = std::chrono::high_resolution_clock::now();
std::vector<std::future<bool>> future_results;
for (int i = 0; i < 1; ++i) {
future_results.push_back(std::async([](){
usleep(3000);
TRACE_EVENT_SCOPE(planner, AsyncThread);
tbb::parallel_for(tbb::blocked_range<int>(0, 19), [](tbb::blocked_range<int> r) {
for (int i = r.begin(); i != r.end(); ++i) {
auto start_time = std::chrono::high_resolution_clock::now();
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 1) {
break;
}
}
}
});
return true;
}));
}
tbb::parallel_for(tbb::blocked_range<int>(0, 199), [](tbb::blocked_range<int> r) {
for (int i = r.begin(); i != r.end(); ++i) {
auto start_time = std::chrono::high_resolution_clock::now();
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 1) {
break;
}
}
}
});
std::for_each(future_results.begin(), future_results.end(), [](std::future<bool>& future) {
bool result = future.get();
});
auto end_time = std::chrono::high_resolution_clock::now();
auto run_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);
LOG(ERROR) << "run time is: " << run_time.count();
}
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi @garfield,
Unfortunately, there is no way to steal or pass task that already started executing by the worker thread that was switch off by the OS.
I understand that in this situation the time execution time of parallel region will be increased by the switch time. But all other threads inside scheduler are ready to execute other task so technically this OffCPU switch blocks execution of parallel construction but not all the threads inside scheduler.
As you described the probability of this situation already pretty low and unfortunately we can not avoid it fully without reducing number of TBB workers (because of oversubscriptions effects) but you can reduce its effect even lower by decreasing work granularity.
It will result in more tasks per parallel construction so effect of OffCPU thread should not be so noticable.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Thanks a lot for help.
I can understand if it is threads from different open source libraries, but if it is offcpu caused by two tbb::parallel_for competing with each other. I understand that this part should be uniformly scheduled. Can this be avoided?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe OS might switch off threads for many reasons even in situation without oversubscription and we don't have a control over it (but we can try to reduce the effect of switches by adjusting the work granularity).
In your case you observe it because of the oversubscription so the only way to fully avoid it is to limit total amount of threads in application to hardware concurrency.
So in case where you have 1 user thread + TBB thread pool total amount of threads will be = hardware concurrency so there should not be too many switches.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @garfield,
We have not heard from you for a while. Could you comment if the issue is still relevant to you?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello @garfield,
Since we have not heard from you, this topic will no longer be monitored by Intel. Thank you for posting at oneTBB Community Forum.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page