Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2477 Discussions

Problem of thread in tbb waiting for kernel scheduling caused by insufficient number of cpu-cores

garfield
Beginner
3,647 Views

When we use tbb like tbb::parallel_for, the default number of threads in tbb is the number of cpu-cores minus 1.

Most of the time the number of CPUs is sufficient, but sometimes there will be other threads executing on the process at the same time like ros or fastdds.

When there are not enough threads, the kernel will give up a thread of tbb after the time slice is exhausted. This will cause 10-20ms offcpu, that is, all other tbb threads are waiting for this thread to be scheduled back.

Is there any good way to solve this problem other than reducing the number of threads in tbb?

0 Kudos
18 Replies
VaishnaviV_Intel
Employee
3,592 Views

Hi,

 

Thanks for posting on Intel communities.

To resolve your issue, we suggest you to use task_arena or global_control.

  • task_arena: This feature allows you to create a controlled execution environment for a group of tasks. By using a task_arena, you can limit the number of threads that TBB uses for a specific set of tasks, isolating them from other tasks. This can help you avoid contention issues when other threads or processes are running simultaneously.
  • global_control: This feature allows you to control the behavior of TBB at a global level. You can temporarily change the number of threads available to TBB or modify other runtime parameters. This can be useful for dynamic adjustments to TBB's behavior.

For more details refer to the below links,

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html

https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/scheduling_controls/global_control_cls.html

 

If you still have any issues, please let us know.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
garfield
Beginner
3,554 Views

Thanks for your reply!

Do you mean that I can control tbb::parallel_for and other threads in different task_arenas? But I'm afraid there are many threads in the code that are uncontrolled, such as ros. I can only call its publish function, but it actually sets up 4 threads to run this part of the logic. We cannot put it in task_arena.

0 Kudos
VaishnaviV_Intel
Employee
3,488 Views

Hi,

 

Could you please share us the following details so that we can investigate more on your issue,

1. A sample reproducer with complete steps

2. How are you getting to know that some threads are uncontrolled and what are the expected results?

3. Could you please explain more about your use case?

 

Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
3,434 Views

Hi,


We have not heard back from you. Could you please provide us with an update on your issue?


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
garfield
Beginner
3,409 Views
Dear Vankudothu,
  Thanks for your reply. Sorry for so long not see my email.
  I have not write a example, but t is just like this:
  1. we have 14 cpu core to run our process;
  2. we use a tbb::parallel_for function which will set 13 threads as default to do a lot compution in for loop;
  3. we have another 4 threads like ros to run other compution at the same time.
  4. linux will not allow tbb always hold the time slice, it will sched tbb thread offcpu and do ros oncpu.
  5. so if we set tbb theads num to be 9 and it will not grap cpu between tbb and ros. but in actually use, the 4 theads and 13 tbb theads not always compete the cpu at same time. so can we make tbb smart when cpu is busy it will not let cpu out.
0 Kudos
VaishnaviV_Intel
Employee
3,166 Views

Hi,


We are working on your query internally, We'll get back to you soon.


Thanks & Regards,

Vankudothu Vaishnavi.


0 Kudos
VaishnaviV_Intel
Employee
3,060 Views

Hi,

 

Thanks for your patience and understanding.

 

The parallel_for work needs to be granular enough for effective decomposition: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html

We would need more details (or a small reproducer) to better understand the type of work that is being done and how the parallel_for and the ROS I/O tasks interoperate.

If, during the execution of a parallel_for, the OS schedules some oneTBB worker threads to be off-CPU, the remaining ones continue the work, potentially "stealing" tasks from others. Hence, it appears to be that the fact that "linux will not allow tbb always hold the time slice, it will schedule tbb thread offcpu and do ros oncpu" is the desired behavior. (https://oneapisrc.github.io/oneTBB/main/tbb_userguide/How_Task_Scheduler_Works.html). This ensures that on-CPU oneTBB threads do not wait for others, optimizing overall performance.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
VaishnaviV_Intel
Employee
2,998 Views

Hi,

 

Could you kindly share a sample reproducer with us? This will help us better understand your issue and assist you in resolving it.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
garfield
Beginner
2,979 Views

So sorry that we do not have a simple example to reproduce it and do not have time to do this. Maybe you can try to make tbb::parallel_for and ros receive topic work at the same time and see.

Meanwhile it is sure that kernel think tbb::parallel_for need to give the time slices to other thread because the thread num is larger than cpu number. It will not occur when other thread is produced by std::async. I think maybe ros has some epoll process which has high priority.

Set tbb thread nice value can fix it but not recommend because may other problems happens.

Not sure what tbb can do in kernel level.

0 Kudos
garfield
Beginner
2,967 Views

Meanwhile,we also wonder what if we have two thread, and each thread will run tbb:parallel_for at the same time, will all be schedule together by tbb? What if one of the parallel_for is in tbb:task_arena, will still be schedule together?

0 Kudos
VaishnaviV_Intel
Employee
2,820 Views

Hi,

 

Thanks for your patience and understanding.

We have provided information on

  • how the scheduler dynamically redistributes work among on-CPU workers
  • how to adjust work granularity for the decomposition and redistribution to be more effective
  • how to isolate work (if truly needed) by using task_arena or global_control (even just locally before the relevant parallel_for)

Please try to apply the above to your code and see the effects.

 

>>Meanwhile,we also wonder what if we have two thread, and each thread will run tbb:parallel_for at the same time, will all be schedule together by tbb? What if one of the parallel_for is in tbb:task_arena, will still be schedule together?

 

Yes, in both cases.

Each user thread that invokes any parallel construction outside an explicit task_arena uses an implicit task arena representation object associated with the calling thread (https://spec.oneapi.io/versions/latest/elements/oneTBB/source/task_scheduler/task_arena/task_arena_cls.html), so in either case there will be two task arenas (implicit in one case, explicit in the other) executing concurrently. The worker threads will be divided in proportion to the need of each task arena.

 

Thanks & Regards,

Vankudothu Vaishnavi.

 

0 Kudos
garfield
Beginner
2,792 Views

Hi,

Thanks a lot for your help and I have try to write a demo to explain my previous problem.

We have used lttng to grep the problem like this. As you can see in the picture, we run on 11 cpu core and the thread number used default for tbb::parallel_for. Then is four AsyncThread at the same time do some work to keep oncpu. 

Worker_Run represent tbb threads. Dark colors in the same block represent oncpu, and light colors represent offcpu. You can see two tbb::parallel_for will offcpu about 16ms.

The most troublesome thing for us is this kind of offcpu, although this does not necessarily happen every time. If offcpu is turned off, can the task of the current tbb thread be switched to other tbb threads, instead of all tbb threads waiting for this thread to finish executing? Binding cores of different threads can definitely solve this problem, but if Asyncthread is executed only 10% of the time, this will cause a waste of CPU.

 

test.png

And this is the demo code:

int main() {

  TRACE_EVENT_SCOPE(planner, MainThread);
  auto start_time = std::chrono::high_resolution_clock::now();

  std::vector<std::future<bool>> future_results;

  for (int i = 0; i < 4; ++i) {
    future_results.push_back(std::async([](){
      auto start_time = std::chrono::high_resolution_clock::now();
      usleep(3000);
      TRACE_EVENT_SCOPE(planner, AsyncThread);
      while (true) {
        auto current_time = std::chrono::high_resolution_clock::now();
        auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
        if (elapsed_time.count() >= 20) {
          break;
        }
      }

      return true;
    }));
  }

  tbb::parallel_for(tbb::blocked_range<int>(0, 49), [](tbb::blocked_range<int> r) {
    for (int i = r.begin(); i != r.end(); ++i) {
      auto start_time = std::chrono::high_resolution_clock::now();
      while (true) {
        auto current_time = std::chrono::high_resolution_clock::now();
        auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
        if (elapsed_time.count() >= 1) {
          break;
        }
      }
    }
  });

}

 

0 Kudos
garfield
Beginner
2,785 Views

Meanwhile, if we run a tbb::parallel_for in the main thread, another tbb::parallel_for in the async thread. We can also see the offcpu and the exec time of async thread is 18ms.testtest.png

Demo code is like this:

int main() {

  TRACE_EVENT_SCOPE(planner, MainThread);
  auto start_time = std::chrono::high_resolution_clock::now();

  std::vector<std::future<bool>> future_results;

  for (int i = 0; i < 1; ++i) {
    future_results.push_back(std::async([](){
      usleep(3000);
      TRACE_EVENT_SCOPE(planner, AsyncThread);
      tbb::parallel_for(tbb::blocked_range<int>(0, 19), [](tbb::blocked_range<int> r) {
        for (int i = r.begin(); i != r.end(); ++i) {
          auto start_time = std::chrono::high_resolution_clock::now();
          while (true) {
            auto current_time = std::chrono::high_resolution_clock::now();
            auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
            if (elapsed_time.count() >= 1) {
              break;
            }
          }
        }
      });

      return true;
    }));
  }

  tbb::parallel_for(tbb::blocked_range<int>(0, 199), [](tbb::blocked_range<int> r) {
    for (int i = r.begin(); i != r.end(); ++i) {
      auto start_time = std::chrono::high_resolution_clock::now();
      while (true) {
        auto current_time = std::chrono::high_resolution_clock::now();
        auto elapsed_time = std::chrono::duration_cast<std::chrono::milliseconds>(current_time - start_time);
        if (elapsed_time.count() >= 1) {
          break;
        }
      }
    }
  });

  std::for_each(future_results.begin(), future_results.end(), [](std::future<bool>& future) {
    bool result = future.get();
  });

  auto end_time = std::chrono::high_resolution_clock::now();
  auto run_time = std::chrono::duration_cast<std::chrono::milliseconds>(end_time - start_time);

  LOG(ERROR) << "run time is: " << run_time.count();

}

0 Kudos
Pavel_K_Intel1
Employee
2,654 Views

Hi @garfield,
Unfortunately, there is no way to steal or pass task that already started executing by the worker thread that was switch off by the OS.
I understand that in this situation the time execution time of parallel region will be increased by the switch time. But all other threads inside scheduler are ready to execute other task so technically this OffCPU switch blocks execution of parallel construction but not all the threads inside scheduler.
As you described the probability of this situation already pretty low and unfortunately we can not avoid it fully without reducing number of TBB workers (because of oversubscriptions effects) but you can reduce its effect even lower by decreasing work granularity.
It will result in more tasks per parallel construction so effect of OffCPU thread should not be so noticable.

0 Kudos
garfield
Beginner
2,637 Views

Thanks a lot for help.

I can understand if it is threads from different open source libraries, but if it is offcpu caused by two tbb::parallel_for competing with each other. I understand that this part should be uniformly scheduled. Can this be avoided?

0 Kudos
Pavel_K_Intel1
Employee
2,627 Views

I believe OS might switch off threads for many reasons even in situation without oversubscription and we don't have a control over it (but we can try to reduce the effect of switches by adjusting the work granularity).
In your case you observe it because of the oversubscription so the only way to fully avoid it is to limit total amount of threads in application to hardware concurrency.  
So in case where you have 1 user thread + TBB thread pool total amount of threads will be = hardware concurrency so there should not be too many switches.

0 Kudos
Mark_L_Intel
Moderator
2,418 Views

Hello @garfield,

 

   We have not heard from you for a while. Could you comment if the issue is still relevant to you?  

0 Kudos
Mark_L_Intel
Moderator
2,314 Views

Hello @garfield,

  Since we have not heard from you, this topic will no longer be monitored by Intel.  Thank you for posting at oneTBB Community Forum.

0 Kudos
Reply