Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

task queue fence

bmintz
Beginner
561 Views
I am trying to interface intel tbb with a distributed parallel runtime library. I am trying to do the simplest thing, which is to bypass our current task queue and spawn all of the tasks using intel tbb. The task queues are similar, but I think we will see significant thread scaling performance with intel tbb. The problem is that our current task queue has support for a task queue fence, but I do not see the same support in intel tbb. Our task queue fence keeps the main thread from continuing with the main program, and allows it to perform computational tasks. I was wondering if there is a way to implement something like this with intel tbb. I am fairly new to intel tbb, so I will appreciate any advice.
0 Kudos
9 Replies
RafSchietekat
Valued Contributor III
561 Views
If I understand the situation correctly, as a workaround you could make all those tasks children of a dummy root task and wait for the root task to finish (make sure you get the reference count right).
0 Kudos
bmintz
Beginner
561 Views
I understand the concept, but I do not think that this will work for me. The distributed parallel runtime that I am trying to interface with works in the following way. Each compute node has a single message passing interface (MPI) process. Whithin this MPI process, there exists a main thread, pool of worker threads, and a communicator thread for inter-node communication. It was designed to provide a general task-based distributed parallel runtime to build high-level mathematical routines and programs. There is no way for me to know how many tasks will be created through the life of every code that utilizes our library.
The task queue fence was intended to be used within a global fence that would pause the execution of the main program by all the MPI processes (i.e. a distributed barrier). The global fence causes every MPI process to call the task queue fence, which would garuntee that all tasks within every MPI process completes before continuing with the main program. This barrier is currently needed in a lot of different types of applications that are currently built with our software library.The way that our task queue fence currently works is that the main thread is redirected to pull tasks from the task queue. In our library, we only have a single task queue, so it is relatively simple to know when it is empty. I am not sure if something like this could be implemented with intel tbb. Basically, I need the main thread to stop and monitor all of the intel tbb task queues and steal tasks if the queues are not empty. If all of the intel tbb task queues are empty, the main thread should continue executing the program.
Thanks,
Ben
0 Kudos
RafSchietekat
Valued Contributor III
561 Views
"There is no way for me to know how many tasks will be created through the life of every code that utilizes our library."
You could use allocate_addtional_child_of() to have the reference count maintained automatically (at a small cost).

"Basically, I need the main thread to stop and monitor all of the intel tbb task queues and steal tasks if the queues are not empty."
There's only one queue, I think, separate from the task pools (which are now deques, one per arena).

But, although waiting within a TBB thread for new messages would be bad, why not instead feed messages into a parallel_queue and then use parallel_do or a pipeline to process it?
0 Kudos
bmintz
Beginner
561 Views
I thought that there was a queue for each thread, but as I mentioned earlier, I am new to TBB, so I could be wrong. I will try the allocate_additional_child_of().
"But, although waiting within a TBB thread for new messages would be bad"
I am currently not worried about waiting within a dummy thread. Our runtime is still handling dependencies internally through Futures. I am only concerned with making sure that all tasks are complete before processing additional tasks. This is currently required to have synchronization points in a distributed environment (i.e. all tasks and communication complete before continuing). There will be a design choice in the future whether we want to continue supporting Futures, but for now, I am just trying to get something to work.
"why not instead feed messages into a parallel_queue and then use parallel_do or a pipeline to process it?"
I do not know how to create a parallel_queue and do a parallel_do or pipeline if I am writing a generic distributed task queue. This may be another design feature that can be added in the future, but I currently just need a way to push single tasks into TBB, and I need a way to garuntee all tasks have been completed before continuing.
0 Kudos
bmintz
Beginner
561 Views
I do not think that this is working. Here is what I did. First I create a dummy/empty task (tbb_parent_task), then tasks are allocated using allocate_additional_child_of(tbb_parent_task), then the tasks are spawned using tbb_parent_task.spawn(task). I am trying to do the following. I create 20 tasks that print 0-19, and then an additional 20 tasks that print 20-39, and I put tbb_parent_task.wait_for_all() in between the two sets of tasks. The wait_for_all() does not stop the main thread from creating the second group of tasks. I need the first 20 tasks to complete before the second set of tasks (i.e. the main thread should stop and wait for the first 20 tasks to complete, and then the next 20 tasks should be allocated and spawned). Ideally, the main thread should execute tasks in the task queue while waiting, but I can live with the main thread just stopping for now. This is just an extremely simple example that describes the functionality that we require. We have many different types of programs that are built upon our parallel runtime, and I do not know up-front how many tasks will be created before the main thread should stop. Nor do I know how many of these synchronization points there will be in any of the programs. We would like to test out using TBB and maybe incorporate higher-level features in the future, but we can not use TBB if our current functionality breaks. Is there a way to make the main thread stop in its tracks until tasks in the task queue are finished before additional tasks enter the task queue?
0 Kudos
RafSchietekat
Valued Contributor III
561 Views
"I do not think that this is working. Here is what I did. First I create a dummy/empty task (tbb_parent_task), then tasks are allocated using allocate_additional_child_of(tbb_parent_task), then the tasks are spawned using tbb_parent_task.spawn(task)."
You can enqueue those child tasks.

"I am trying to do the following. I create 20 tasks that print 0-19, and then an additional 20 tasks that print 20-39, and I put tbb_parent_task.wait_for_all() in between the two sets of tasks. The wait_for_all() does not stop the main thread from creating the second group of tasks."
The idea is that you stop adding tasks until the wait_for_all() finishes, of course. Or you could already make them children of a new root task and enqueue them later..

"I need the first 20 tasks to complete before the second set of tasks (i.e. the main thread should stop and wait for the first 20 tasks to complete, and then the next 20 tasks should be allocated and spawned). Ideally, the main thread should execute tasks in the task queue while waiting, but I can live with the main thread just stopping for now. "
The main thread probably shouldn't do anything other than add tasks, to avoid oversubscription (assuming that the task execution is a lot more CPU-intensive than the task administration).

"Is there a way to make the main thread stop in its tracks until tasks in the task queue are finished before additional tasks enter the task queue?"
That seems like something for a condition variable.
0 Kudos
bmintz
Beginner
561 Views
I think that the enqueue helped, but I am now running into another problem with the ref count. On occasion I get the error "attempt to enqueue task whose parent has a ref_count<0". I have shown some of my code below. I tried two things. First I tried to use the allocate_additional_child(), which you said would keep track of the ref count automatically. Then I tried to use allocate_child() and increment the ref count manually just before a task is enqueued. I also found that I needed to incrementing the ref just before the wait_for_all(), or the program would give me this error everytime fence is called. Am I doing something wrong with the ref count?
class ThreadPool {
private:
...
  static tbb::empty_task* tbb_parent_task;
...
public:
...
/// Add a new task to the pool
  static void add(PoolTaskInterface* task) {
      tbb_parent_task->increment_ref_count();
      tbb_parent_task->enqueue(*task);
  }
...
};
class WorldTaskQueue {
private:
tbb::empty_task* tbb_parent_task; //(set to ThreadPool::tbb_parent_task)
...
public:
    Future add(function, attr) {
        //   add(new(tbb::task::allocate_additional_child_of(*tbb_parent_task))
              ThreadPool::add(new( tbb_parent_task->allocate_child() ) TaskFunction(result, function, attr));
    }
  void fence() {
     tbb_parent_task->increment_ref_count();
     tbb_parent_task->wait_for_all();
  }
...
};
0 Kudos
RafSchietekat
Valued Contributor III
561 Views
How about setting the reference count to 1 at the beginning, instead of the increment just before wait_for_all()?
0 Kudos
bmintz
Beginner
561 Views
Thank you for your response. I fixed it yesterday. Like you say, I needed to increment the ref count when the parent thread was created. I also had to increment the ref count after the wait_for_all() to reset the parent thread to its initialized state.
Ben
0 Kudos
Reply