Re: task_arena and task_group slowdown a lot my tasks

diedler_f_ · ‎01-24-2024

Hi everyone,

I need to run 2 heavy tasks in parallel using TBB on Windows 10 x64 bits. I have 6 cores (12 threads with hyper threading enabled). I wan to use 50% for the taskA and 50% for the taskB. I saw that I can use task_arena to limite the number of thread used like that :

tbb::task_arena taskA(6); // limited area with no more than 6 threads
tbb::task_arena taskB(6); // limited area with no more than 6 threads
tbb::task_group dummyGroup;

dummyGroup.run([&]{
   taskA.execute([&]{
       // long work A here
   });
});

dummyGroup.run([&]{
   taskB.execute([&]{
       // long work B here
   });
});

// The main thread waits for taskA and taskB to finish
taskA.execute([&]{
    dummyGroup.wait();
};

Is it the correct method to do it ?

Do I need to use a task_group or only 2 arenas are enough ?

The problem I encounter is that the taskB slowdown a lot the taskA and vice versa.

Maybe I don't use correcly task_arena and task_group.

Thanks a lot,

Mark_L_Intel · ‎01-26-2024

You don't need task group. I'd change variables names for arenas as below because they are really the arenas and not tasks. I added diagnostics for max concurrencies to arenas. Please let me know if it works for you.

#include <iostream>
#include <oneapi/tbb/task_arena.h>

int main() {

tbb::task_arena arenaA(6); //  Create the custom task_arena A with 6 threads
tbb::task_arena arenaB(6); //  Create the custom task_arena B with 6 threads

arenaA.execute([&]{     
   std::cout << "arenaA max concurrency = " << tbb::this_task_arena::max_concurrency() << std::endl;
  // long work A here
});

arenaB.execute([&]{
   std::cout << "arenaB max concurrency = " << tbb::this_task_arena::max_concurrency() << std::endl;
   // long work B here
});
return 0;
}

bash-4.4$ icpx forum.cpp -tbb
bash-4.4$ ./a.out 
arenaA max concurrency = 6
arenaB max concurrency = 6
bash-4.4$

diedler_f_ · ‎01-29-2024

No it does not work because I need that "long work A" and "long work B" run at the same time and not sequentially.

With your code, the "long task A" need to finish before the "long task B" begins.

Mark_L_Intel · ‎01-30-2024

The arenas simply allow you to isolate your work and assign a certain number of slots for threads. You still need to set up a parallel execution. I'd use tbb::parallel_invoke but other TBB algorithms can also be used. I will post a compete example on how to do it shortly.

diedler_f_ · ‎01-30-2024

Hi,

Yes, if you could provide an example (better if you have a dummy long task to test really the performance) it would be nice.

What is the benefit of tbb::parallel_invoke over tbb:task_group ? Is it better in term of performances ?

Thanks,

Mark_L_Intel · ‎01-30-2024

The following code:

#include <iostream>
#include <map>
#include <set>
#include <thread>
#include <tbb/tbb.h>

const int P = 12; //12 threads all together will be used
thread_local int my_tid = -1;
std::vector<std::set<std::string>> tid_regions(3*P);
//tbb::atomic<int> next_tid;
std::atomic<int> next_tid = {0};

void noteParticipation(const std::string& name) {
  if (my_tid == -1) {
    //my_tid = next_tid.fetch_and_increment();
    std::atomic_fetch_add(&next_tid, 1);
    my_tid = next_tid;
  }
  tid_regions[my_tid].insert(name);
}


void dump_participation() {
  int end = next_tid;
  std::map<std::string, int> m;
  for (int i = 0; i < end; ++i) {
    for (auto n : tid_regions[i]) {
      m[n] += 1;
    }
  }
  
  for (auto& kv : m) {
    std::cout << kv.second << " working threads participated in " << kv.first << std::endl;
  }
}


void doWork(const std::string& name, double seconds) {
  noteParticipation(name);
  tbb::tick_count t0 = tbb::tick_count::now();
  while ((tbb::tick_count::now() - t0).seconds() < seconds);
}

int main() {
int N = 10*P;
std::cout << "There are " << tbb::info::default_concurrency() << " logical cores." << std::endl;
tbb::global_control gc(tbb::global_control::max_allowed_parallelism, P + 1); //one more thread
tbb::task_arena arenaA(6); //  Create the custom task_arena A with 6 threads
tbb::task_arena arenaB(6); //  Create the custom task_arena B with 6 threads

tbb::parallel_invoke(
  [&]{
    arenaA.execute([&]{     
      std::cout << "arenaA max concurrency = " << tbb::this_task_arena::max_concurrency() << std::endl;
      // long work A here
      tbb::parallel_for(0, N, [](int) { doWork("arenaA pfor", 0.01); });
    });
  }, 
  [&]{
    arenaB.execute([&]{
      std::cout << "arenaB max concurrency = " << tbb::this_task_arena::max_concurrency() << std::endl;
      // long work B here
      tbb::parallel_for(0, N, [](int) { doWork("arenaB pfor", 0.01); });
    });
  }
);

dump_participation();
return 0;
}

can be compiled with

icpx -g -O2 invoke-forum.cpp -o invoke-forum.x -tbb

It produces this output

bash-4.4$ ./invoke-forum.x
There are 224 logical cores.
arenaB max concurrency = 6
arenaA max concurrency = 6
5 working threads participated in arenaA pfor
6 working threads participated in arenaB pfor
bash-4.4$

This specific system (Sapphire Rapids) had a lot of logical cores, but I used a subset of 12 cores. One of the arenas also used main thread besides working threads. After that, you could run Vtune to debug performance:

vtune -collect hotspots  -result-dir r001-forum-hs -- ./invoke-forum.x

Vtune should produce something similar to the following

Vtune Bottom-up hotspots for TBB parallel invoke with 2 pfors

Also, if you'd like more like OpenMP performance you could try TBB static_partitioner in the above sample tbb::parallel_for(s).

You could also use other TBB parallel algorithms, e.g., parallel_for_each, parallel_pipeline, flow graph and even lower level task_groups to setup parallel execution. The advantage of parallel_invoke is that it is simple and expresses logic of what you're trying to do (as I understand it).

BTW, even a simpler 2 arenas with 2 pfors snippet without parallel_invoke, would not run sequentially. I will expand on this more (and give you some references) in the next post here but this should help you to get started.

Mark_L_Intel · ‎01-30-2024

My last example above includes the logging functionality (noteParticipation, dump_participation) which is based on the example, fig_11_10.cpp, from the repository: pro-TBB-book-samples. The proTBB book is freely available. I'd recommend to look at the section "Using Multiple Arenas with Different Numbers of Slots to Influence Where TBB Places Its Worker Threads" from Chapter 11. This section explains the fig_11_10.cpp. This example still based on old TBB. I had to update to oneTBB revamped version where instead of tbb::task_scheduler_init, the tbb::global_control can be used; and tbb::atomic<int> has been deprecated as well -- instead std::atomic<int> can be used. Please see the Migration Guide.

As you can see from the cited above section, it illustrates the use of implicit and explicit arenas initialized with the certain number of threads, as well as std::thread that can include its own tbb::parallel_for. If you try to run Vtune on this sample, you would see that all these constructs are involved in rather complicated concurrent execution which is not sequential.

Mark_L_Intel · ‎01-30-2024

@diedler_f_ ,

I realized that I misunderstood your first post. You can setup a parallel execution with task groups too. I'd just remove lines 18 and 20 and leave simply line 19: "dummyGroup.wait()" in your initial example. With your example, with the task group responsible for the parallel execution, Vtune shows more "spin and overhead" (comparing picture below vs Vtune results above with parallel_invoke). However, the parallel_for body (I used) is rather contrived example. The performance (e.g., Vtune) studies need further investigation. Said that, parallel_invoke is a high-level function that executes the provided tasks in arenas in parallel and it was designed for that. The task_group is a more flexible and general-purpose method for managing tasks. But at the end, the performance depends on your specific platform and specific workload. Could you share your Vtune data or at least provide more specifics about your workload?

Vtune bottom-up hot spot analysis with task groups setting up parallel execution

diedler_f_ · ‎01-31-2024

@Mark_L_Intel

Thanks for your answers, I read the book Chapter 11 quite difficult to understand for me (I am not good with threads and parallelism programmation).

I don't use VTunes and don't know how it works. I was expected to have 12 threads in your screenshots but only count 10 threads that are created by your snippet code. Is it normal ? What is the difference between the brown color (CPU time) and the green color (Running) ? IS the brown color when the thread is idle and the green color when the thread works ?

Just one more question with arenas : can I set 0 to the number of master slot tbb::task_arena a(6, 0); to speed up the arena concurrency ? I don't understand the role of the slot reserved for the master thread ?

For my workload it is very complicated. To simplify, let's say I use 2 path finding algorithms like Astar (or a greedy search). The first algorithm is launched in foward mode (try to find a path between the initial node and the solution node) and the second algorithm is launched in backward mode (try to find a path between the solution node and the initial node). Each search are threaded with tbb::parallel_for to examine each node like this :

std::priority_queue<Node*> priorityQueue;

while (!priorityQueue.empty())
{
    // get all nodes to analyze
    std::vector<Node*> nodesToExplore = priorityQueue.popNodes();
    tbb::concurrent_set<Node*> successorsWithNoDuplicates;

    if (isSolution(nodesToExplore))
    {
        // solution found
        break;
    }

    tbb::parallel_for(tbb::blocked_range<size_t>(0, nodesToExplore.size()),
        [&](const tbb::blocked_range<size_t>& r) {
            // safe guard just to be sure to not have data / memory corruption
            const auto lThreadId = tbb::this_task_arena::current_thread_index();
            if (lThreadId > tbb::this_task_arena::max_concurrency())
            {
                std::cout << "oups error..." << std::endl;
                exit(-1);
            }

            // get successor nodes
            analyseNodes(bestNodes, r.begin(), r.end(), lThreadId, successorsWithNoDuplicates);
        }
    );

    // do some computation and may run nested tbb::parallel_for loops 
    // maybe it is the problem ? But even if I remove this section, I have the same problem of performance

    // insert all successors inside the priority queue
    for (auto s : successorsWithNoDuplicates)
    {
        priorityQueue.insert(s);
    }
}

Note that I use only one tbb container to handle successor nodes in a tbb::concurrent_set.
Let's say the solution is found by the forward mode in 10 seconds.

1) If I run only one search it work as expected with threads. -> solution found in 10 seconds with 12/ 2 = 6 threads (arena with max concurrency = 6)
2) If I run the two searches in parallel one of the search slow down a lot the second one. (one arena with 6 threads andthe second arena with 6 threads). The solution is found at the same depth (no race conditions) but in 40 seconds or more...

I think there is a huge issue but impossible for me to find it.

Thanks,

Mark_L_Intel · ‎02-02-2024

@diedler_f_ ,

Have you tried parallel_invoke (as shown in my sample above)? Have it helped with performance?

Also, could you try the following parallel pattern? It can improve performance too.

int const n_arenas = 2;

std::vector<tbb::task_arena> arenas(n_arenas);
std::vector<tbb::task_group> task_groups(n_arenas);

for (size_t i = 0; i < n_arenas; i++) {
  arenas[i].execute([&] {
    task_groups[i].run([&] { parallel_for{} });
  });
}

for (size_t i = 0; i < n_arenas; i++) {
  arenas[i].execute( [&, i] {
    task_groups[i].wait();
  });
}

Regarding Vtune. You can download it from our website with oneAPI BAse Toolkit. We have a lot of training material online. For starters, I used Vtune command line in one of my posts above. Actually, Vtune diagrams above show 11 threads including main thread. If you drill into the diagrams using Vtune GUI for each arena, you would see that both arenas use 6 threads, just one thread migrated from one arena to another. I could provide pictures if you like. Regarding brown and green color -- it is opposite, brown is when CPU executes user code.
I would not mess with the main thread slot reservation, please use default for now, tbb::task_arena a(6);

Also, there is a bug in code of thread registration/print.
Please correct noteParticipation:

void noteParticipation(const std::string& name) {
  if (my_tid == -1) {
    //my_tid = next_tid.fetch_and_increment();
    my_tid = next_tid++;
  }
  tid_regions[my_tid].insert(name);
}

Notice next_tid++. During the result accumulation code expected threads ids from 0 to next_tid but because of std::atomic_fetch_add(&next_tid, 1);
Threads ids started from 1 to next_tid + 1
After this fix:

6 working threads participated in arenaA pfor
6 working threads participated in arenaB pfor

Thank you for posting snippet from your code. Would you have some prototype on the GitHub by chance? I could look into performance if prototype is available.

Mark_L_Intel · ‎02-15-2024

Hello @diedler_f_ ,

What platform are you running on? For example, recent Intel desktop CPUs have p-cores and e-core with different performance profile. We have oneTBB APIs to help with the differentiation between p-core and e-cores using arenas but it's a separate topic and so far I assumed that you are not testing on these hybrid systems.
What is the reasoning behind splitting machine cores between 2 algorithms?

diedler_f_ · ‎02-20-2024

Hello @Mark_L_Intel

1) I run on my laptop with a Intel Core i7 9750H, I don't know if this core use a p-cores and e-cores ?

2) Because this function than expand nodes :

 analyseNodes(bestNodes, r.begin(), r.end(), lThreadId, successorsWithNoDuplicates);

is quite slow and threads help to improve the speed of the solving process.

Maybe I need to switch with OpenMP library but I don't know how to use it. I am quite sure there is a problem with TBB library on my computer under Windows and I think it is because the number of cores is not a power of 2. I have 6 cores and 12 threads but maybe I am wrong. I am not enough good with threading and TBB to understand the issue. The only thing I can say is that using arenas / parrallel_invoke / task functions slow down a lot my program that has no issue without threads.

Another thing :

the forward search algorithm launched in an arena with a max concurrency of 6 threads call some tbb::parralel_for and sometimes nested tbb::parralel_for
same for the backward search algorithm
the forward and backward searches are totally independant (no shared variables) -> that is why I don't unerstand the issue with performances because if I run only the forward search with threads, everything work well, same for backward search only but with both searches, the program slow down a lot

Maybe the TBB library does not handle this case in terms of performance ?

Thanks,

Mark_L_Intel · ‎03-08-2024

Hello @diedler_f_ ,

Frist, Intel Core i7 9750H is not a hybrid (p and e cores) system.

What you described, the TBB should be able to parallelize. In fact, you could experiment with my listing above from Jan 30th, and you should see a speed-up on your system with a number of threads -- this code snippet simulates two independent functions that run in parallel (with parallel_for inside) -- I think this is a rough proxy for your application. Otherwise, to make progress we would need a reproducer from you to be able to make a progress.

Regards,

Mark