Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Reusing the tasks in task group

knmaheshy2k
Beginner
484 Views
I observed that if the tasks are small, the overhead of task creation plays a major factor in performance improvement.

For example I've a code as shown below.
[cpp]void function_1()
{
  ...
}
void function_2()
{
  ...
}
.....
void function_8()
{
  ...
}

int main()
{
.......
task_group g;
.......
do{
	g.run(function_1);
	g.run(function_2);
        ..............
	g.run(function_8);
	g.wait();
	n++;
}while(n < MAX_value);
.......
}[/cpp]

In this code, if the functions, funtion_1, function_2 ..function_8 are not huge, the overhead of task creating tasks is huge. It plays a major factor especially when MAX_value is huge. So is there anyway where I can add tasks once to the group, preserve the tasks till the end?

I looked at Fibonacci example but I was not able to decipher much from it. It would be great if someone can point me in proper direction and/or provide some pseudo code. Thanks in advance!
0 Kudos
6 Replies
Dmitry_Vyukov
Valued Contributor I
484 Views
Quoting - knmaheshy2k
I observed that if the tasks are small, the overhead of task creation plays a major factor in performance improvement.

In this code, if the functions, funtion_1, function_2 ..function_8 are not huge, the overhead of task creating tasks is huge. It plays a major factor especially when MAX_value is huge. So is there anyway where I can add tasks once to the group, preserve the tasks till the end?

I looked at Fibonacci example but I was not able to decipher much from it. It would be great if someone can point me in proper direction and/or provide some pseudo code. Thanks in advance!

Are you sure that it's exactly task creation that is problematic? And not work enqueueing/work distribution/completion detection/etc?
In general your tasks must be at least 10'000 cycles to be worth parallelization.

0 Kudos
ARCH_R_Intel
Employee
484 Views

For TBB, the tasks need to be about 10,000 clocks or more on average to get good speedup.

The template tbb:parallel_invoke is a slightly more efficient way to invoke a fixed number of functions, though I suspect the "slightly" is not going to be enough to do much good in this case.

Cilk has significantly lower task creation overheads, on the order of 4 subroutine calls if I remember correctly. See http://software.intel.com/en-us/articles/intel-cilk/ for the "what if" version. However, task stealing overheads are still high. So if the number of tasks is small, you probably won't see much speedup either.

Can you describe what you are trying to accomplish at a higher level? Often there are ways to restructure algorithms to improve chunk size.

0 Kudos
knmaheshy2k
Beginner
484 Views
Quoting - Dmitriy Vyukov

Are you sure that it's exactly task creation that is problematic? And not work enqueueing/work distribution/completion detection/etc?
In general your tasks must be at least 10'000 cycles to be worth parallelization.


Yes, I'm sure that its the task creation that is problematic. To confirm this, I made all the tasks empty and just ran empty tasks to check the overhead of task creation and it was significant. So I wanted to know if there is a way to add the tasks once to task_group object and run them multiple times but I also have to make sure that set of tasks should be executed once before any of them execute for second time.

I don't think my tasks are 10000 cycles. But I've lot of smaller tasks. So any suggestions apart from combining tasks?
0 Kudos
knmaheshy2k
Beginner
484 Views

For TBB, the tasks need to be about 10,000 clocks or more on average to get good speedup.

The template tbb:parallel_invoke is a slightly more efficient way to invoke a fixed number of functions, though I suspect the "slightly" is not going to be enough to do much good in this case.

Cilk has significantly lower task creation overheads, on the order of 4 subroutine calls if I remember correctly. See http://software.intel.com/en-us/articles/intel-cilk/ for the "what if" version. However, task stealing overheads are still high. So if the number of tasks is small, you probably won't see much speedup either.

Can you describe what you are trying to accomplish at a higher level? Often there are ways to restructure algorithms to improve chunk size.



That is what I'm trying to accomplish. I did try parallel_for, but that seem to be also slow. So I wanted to know if I can use recycle_to_execute() or other recycle_*() methods to reuse the tasks.

Thanks in advance!
0 Kudos
Dmitry_Vyukov
Valued Contributor I
484 Views
Quoting - knmaheshy2k
Yes, I'm sure that its the task creation that is problematic. To confirm this, I made all the tasks empty and just ran empty tasks to check the overhead of task creation and it was significant.

How do you avoid work enqueueing/work distribution/completion detection/etc overheads in your test?
If you run whatever task it's still enqueued, dequeued, scheduled, etc.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
484 Views
Quoting - knmaheshy2k
I don't think my tasks are 10000 cycles. But I've lot of smaller tasks. So any suggestions apart from combining tasks?

Nope.
AFAIR per-task overhead even for non parallelized execution is some 600 cycles. So if your tasks are 10 cycles, well, you get 60x slowdown.

0 Kudos
Reply