Re: Problem of thread in tbb waiting for kernel scheduling caused by insufficient number of cpu-core

garfield · ‎10-30-2023

When we use tbb like tbb::parallel_for, the default number of threads in tbb is the number of cpu-cores minus 1.

Most of the time the number of CPUs is sufficient, but sometimes there will be other threads executing on the process at the same time like ros or fastdds.

When there are not enough threads, the kernel will give up a thread of tbb after the time slice is exhausted. This will cause 10-20ms offcpu, that is, all other tbb threads are waiting for this thread to be scheduled back.

Is there any good way to solve this problem other than reducing the number of threads in tbb?

VaishnaviV_Intel · ‎11-02-2023

Hi,

Thanks for posting on Intel communities.

To resolve your issue, we suggest you to use task_arena or global_control.

task_arena: This feature allows you to create a controlled execution environment for a group of tasks. By using a task_arena, you can limit the number of threads that TBB uses for a specific set of tasks, isolating them from other tasks. This can help you avoid contention issues when other threads or processes are running simultaneously.
global_control: This feature allows you to control the behavior of TBB at a global level. You can temporarily change the number of threads available to TBB or modify other runtime parameters. This can be useful for dynamic adjustments to TBB's behavior.

For more details refer to the below links,

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html

If you still have any issues, please let us know.

Thanks & Regards,

Vankudothu Vaishnavi.

garfield · ‎11-05-2023

Thanks for your reply！

Do you mean that I can control tbb::parallel_for and other threads in different task_arenas? But I'm afraid there are many threads in the code that are uncontrolled, such as ros. I can only call its publish function, but it actually sets up 4 threads to run this part of the logic. We cannot put it in task_arena.

VaishnaviV_Intel · ‎11-09-2023

Hi,

Could you please share us the following details so that we can investigate more on your issue,

1. A sample reproducer with complete steps

2. How are you getting to know that some threads are uncontrolled and what are the expected results?

3. Could you please explain more about your use case?

Thanks & Regards,

Vankudothu Vaishnavi.

VaishnaviV_Intel · ‎11-14-2023

Hi,

We have not heard back from you. Could you please provide us with an update on your issue?

Thanks & Regards,

Vankudothu Vaishnavi.

garfield · ‎11-16-2023

Dear Vankudothu,

Thanks for your reply. Sorry for so long not see my email.

I have not write a example, but t is just like this:

1. we have 14 cpu core to run our process;

2. we use a tbb::parallel_for function which will set 13 threads as default to do a lot compution in for loop;

3. we have another 4 threads like ros to run other compution at the same time.

4. linux will not allow tbb always hold the time slice, it will sched tbb thread offcpu and do ros oncpu.

5. so if we set tbb theads num to be 9 and it will not grap cpu between tbb and ros. but in actually use, the 4 theads and 13 tbb theads not always compete the cpu at same time. so can we make tbb smart when cpu is busy it will not let cpu out.

VaishnaviV_Intel · ‎11-23-2023

Hi,

We are working on your query internally, We'll get back to you soon.

Thanks & Regards,

Vankudothu Vaishnavi.

VaishnaviV_Intel · ‎12-05-2023

Hi,

Thanks for your patience and understanding.

The parallel_for work needs to be granular enough for effective decomposition: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html

We would need more details (or a small reproducer) to better understand the type of work that is being done and how the parallel_for and the ROS I/O tasks interoperate.

If, during the execution of a parallel_for, the OS schedules some oneTBB worker threads to be off-CPU, the remaining ones continue the work, potentially "stealing" tasks from others. Hence, it appears to be that the fact that "linux will not allow tbb always hold the time slice, it will schedule tbb thread offcpu and do ros oncpu" is the desired behavior. (https://oneapisrc.github.io/oneTBB/main/tbb_userguide/How_Task_Scheduler_Works.html). This ensures that on-CPU oneTBB threads do not wait for others, optimizing overall performance.

Thanks & Regards,

Vankudothu Vaishnavi.

VaishnaviV_Intel · ‎12-11-2023

Hi,

Could you kindly share a sample reproducer with us? This will help us better understand your issue and assist you in resolving it.

Thanks & Regards,

Vankudothu Vaishnavi.

garfield · ‎12-12-2023

So sorry that we do not have a simple example to reproduce it and do not have time to do this. Maybe you can try to make tbb::parallel_for and ros receive topic work at the same time and see.

Meanwhile it is sure that kernel think tbb::parallel_for need to give the time slices to other thread because the thread num is larger than cpu number. It will not occur when other thread is produced by std::async. I think maybe ros has some epoll process which has high priority.

Set tbb thread nice value can fix it but not recommend because may other problems happens.

Not sure what tbb can do in kernel level.

garfield · ‎12-12-2023

Meanwhile，we also wonder what if we have two thread, and each thread will run tbb:parallel_for at the same time, will all be schedule together by tbb? What if one of the parallel_for is in tbb:task_arena, will still be schedule together?

VaishnaviV_Intel · ‎12-20-2023

Hi,

Thanks for your patience and understanding.

We have provided information on

how the scheduler dynamically redistributes work among on-CPU workers
how to adjust work granularity for the decomposition and redistribution to be more effective
how to isolate work (if truly needed) by using task_arena or global_control (even just locally before the relevant parallel_for)

Please try to apply the above to your code and see the effects.

>>Meanwhile，we also wonder what if we have two thread, and each thread will run tbb:parallel_for at the same time, will all be schedule together by tbb? What if one of the parallel_for is in tbb:task_arena, will still be schedule together?

Yes, in both cases.

Each user thread that invokes any parallel construction outside an explicit task_arena uses an implicit task arena representation object associated with the calling thread (https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html), so in either case there will be two task arenas (implicit in one case, explicit in the other) executing concurrently. The worker threads will be divided in proportion to the need of each task arena.

Thanks & Regards,

Vankudothu Vaishnavi.

garfield · ‎12-21-2023

Hi，

Thanks a lot for your help and I have try to write a demo to explain my previous problem.

We have used lttng to grep the problem like this. As you can see in the picture, we run on 11 cpu core and the thread number used default for tbb::parallel_for. Then is four AsyncThread at the same time do some work to keep oncpu.

Worker_Run represent tbb threads. Dark colors in the same block represent oncpu, and light colors represent offcpu. You can see two tbb::parallel_for will offcpu about 16ms.

The most troublesome thing for us is this kind of offcpu, although this does not necessarily happen every time. If offcpu is turned off, can the task of the current tbb thread be switched to other tbb threads, instead of all tbb threads waiting for this thread to finish executing? Binding cores of different threads can definitely solve this problem, but if Asyncthread is executed only 10% of the time, this will cause a waste of CPU.

And this is the demo code:

int main() {

TRACE_EVENT_SCOPE(planner, MainThread);
auto start_time = std::chrono::high_resolution_clock::now();

std::vector<std::future<bool>> future_results;

for (int i = 0; i < 4; ++i) {
future_results.push_back(std::async([](){
auto start_time = std::chrono::high_resolution_clock::now();
usleep(3000);
TRACE_EVENT_SCOPE(planner, AsyncThread);
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 20) {
break;
}
}

return true;
}));
}

tbb::parallel_for(tbb::blocked_range<int>(0, 49), [](tbb::blocked_range<int> r) {
for (int i = r.begin(); i != r.end(); ++i) {
auto start_time = std::chrono::high_resolution_clock::now();
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 1) {
break;
}
}
}
});

}

garfield · ‎12-21-2023

Meanwhile, if we run a tbb::parallel_for in the main thread, another tbb::parallel_for in the async thread. We can also see the offcpu and the exec time of async thread is 18ms.

Demo code is like this:

int main() {

TRACE_EVENT_SCOPE(planner, MainThread);
auto start_time = std::chrono::high_resolution_clock::now();

std::vector<std::future<bool>> future_results;

for (int i = 0; i < 1; ++i) {
future_results.push_back(std::async([](){
usleep(3000);
TRACE_EVENT_SCOPE(planner, AsyncThread);
tbb::parallel_for(tbb::blocked_range<int>(0, 19), [](tbb::blocked_range<int> r) {
for (int i = r.begin(); i != r.end(); ++i) {
auto start_time = std::chrono::high_resolution_clock::now();
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 1) {
break;
}
}
}
});

return true;
}));
}

tbb::parallel_for(tbb::blocked_range<int>(0, 199), [](tbb::blocked_range<int> r) {
for (int i = r.begin(); i != r.end(); ++i) {
auto start_time = std::chrono::high_resolution_clock::now();
while (true) {
auto current_time = std::chrono::high_resolution_clock::now();
auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
if (elapsed_time.count() >= 1) {
break;
}
}
}
});

std::for_each(future_results.begin(), future_results.end(), [](std::future<bool>& future) {
bool result = future.get();
});

auto end_time = std::chrono::high_resolution_clock::now();
auto run_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

LOG(ERROR) << "run time is: " << run_time.count();

}

Pavel_K_Intel1 · ‎01-03-2024

Hi @garfield,
Unfortunately, there is no way to steal or pass task that already started executing by the worker thread that was switch off by the OS.
I understand that in this situation the time execution time of parallel region will be increased by the switch time. But all other threads inside scheduler are ready to execute other task so technically this OffCPU switch blocks execution of parallel construction but not all the threads inside scheduler.
As you described the probability of this situation already pretty low and unfortunately we can not avoid it fully without reducing number of TBB workers (because of oversubscriptions effects) but you can reduce its effect even lower by decreasing work granularity.
It will result in more tasks per parallel construction so effect of OffCPU thread should not be so noticable.

garfield · ‎01-03-2024

Thanks a lot for help.

I can understand if it is threads from different open source libraries, but if it is offcpu caused by two tbb::parallel_for competing with each other. I understand that this part should be uniformly scheduled. Can this be avoided?

Pavel_K_Intel1 · ‎01-04-2024

I believe OS might switch off threads for many reasons even in situation without oversubscription and we don't have a control over it (but we can try to reduce the effect of switches by adjusting the work granularity).
In your case you observe it because of the oversubscription so the only way to fully avoid it is to limit total amount of threads in application to hardware concurrency.
So in case where you have 1 user thread + TBB thread pool total amount of threads will be = hardware concurrency so there should not be too many switches.

Mark_L_Intel · ‎01-22-2024

Hello @garfield,

We have not heard from you for a while. Could you comment if the issue is still relevant to you?

Mark_L_Intel · ‎01-26-2024

Hello @garfield,

Since we have not heard from you, this topic will no longer be monitored by Intel. Thank you for posting at oneTBB Community Forum.

Problem of thread in tbb waiting for kernel scheduling caused by insufficient number of cpu-cores