Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2467 Discussions

Immediately scheduling task on a different thread


I need to launch tasks from a thread that won't be participating in executing them (it's participating in a different event loop and will be blocking).  It seems to take on the order of ~50usec for TBB to run the task on a different thread.  Is there a way to get this work scheduled faster?

Below is some test code.

Results from calling from a non-arena thread:


[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:604] Time until first submit=30
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=56497
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=38618
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=40086
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=41376
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=44238
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=60127
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=60355
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=64096
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=63806
[2024-04-20 12:10:59.628] [info] [TBBTest.cpp:609] latency=64776


Results from calling from within another task:


[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:605] Time until first submit=39873
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=5609
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=990
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=870
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=780
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=690
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=600
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=520
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=420
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=360
[2024-04-20 12:04:18.297] [info] [TBBTest.cpp:610] latency=290



The test code:


// try_put to a multifunction_node in a graph takes about ~50us for the first task to be scheduled
// on a different thread.  Same for task group.
TEST_F(TBBTest, testLatency) {

  // test the time it takes for another thread to work-steal a task from a blocked thread.
  class Msg {
    Msg() = default;
    Msg(const Msg&) = delete;
    Msg& operator=(const Msg&) = delete;

    std::chrono::time_point<std::chrono::high_resolution_clock> submitTime;
    std::chrono::time_point<std::chrono::high_resolution_clock> processTime;

    void start() {
      submitTime = std::chrono::high_resolution_clock::now();

    void submitted() {
      processTime = std::chrono::high_resolution_clock::now();

  int n = 10;
  std::vector<Msg> messages(n);

  oneapi::tbb::task_group tg;

  // Lets try warming up the arena / task group.  Nope... didn't help.
  int res=0;
  for (int i=0; i<10; i++) {[&]{res++;});
  std::this_thread::sleep_for(std::chrono::milliseconds(1));  // give chance for warmup tasks to run

  auto start = std::chrono::high_resolution_clock::now();

  for (auto& msg : messages) {
    auto* mptr = &msg;

 // submit tasks from within the task group.  Doesn't solve the problem of latency of *this* task though.[&]{
    for (auto& msg : messages) {
      auto* mptr = &msg;

  // sleep for a bit to give time for tasks to execute *not* in this thread.
  // comment this out to see the difference of scheduling on *this* thread.


  // print out the time between our start and when the first task was submitted
  auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(messages[0].submitTime - start).count();
  LOG_INFO("Time until first submit={}", elapsed);

  for (auto &msg : messages) {
    auto submitTime = std::chrono::time_point_cast<std::chrono::nanoseconds>(msg.submitTime).time_since_epoch().count();
    auto processTime = std::chrono::time_point_cast<std::chrono::nanoseconds>(msg.processTime).time_since_epoch().count();
    LOG_INFO("latency={}", processTime - submitTime);



0 Kudos
4 Replies

@Yonik , it seems that your testcase depends on some testing framework: where are TBBTest, testLatency, and LOG_INFO defined?

0 Kudos

> I was made aware that you submitted the same question directly to our engineering team at GitHub


Yes, sorry.  My message here was originally marked as spam, so I then went to github.

0 Kudos

@Yonik, I was made aware that you submitted the same question directly to our engineering team at GitHub: Please continue there.

0 Kudos

@Yonik, regarding "My message here was originally marked as spam, so I then went to GitHub," -- I'm not sure how this happened. I've not marked your topic as spam. Your topic is very interesting, and thank you so much for posting your question.

  Please consider posting a cleanup code sample for the benefit of other Forum participants.  Your sample above cannot be compiled. There are leftovers in the code (lines 1-5, 49-60), making understanding the problem difficult.  There is no tg.wait() after the first (line 33). I'm not sure about the correctness of the sleep statements as they are currently used.

  I understand you are asking: "How to measure the task start-up latency?" If you agree and post a crisp sample showing your solution to this problem (including an important warmup piece), I hope it would interest Forum participants.  I'll try to comment on this, too. 

  Meanwhile, if you are interested in thread_group performance studies overall, here is a reference for your records:   

0 Kudos