Reusing the tasks in task group

knmaheshy2k · ‎12-14-2009

I observed that if the tasks are small, the overhead of task creation plays a major factor in performance improvement.

For example I've a code as shown below.

[cpp]void function_1()
{
  ...
}
void function_2()
{
  ...
}
.....
void function_8()
{
  ...
}

int main()
{
.......
task_group g;
.......
do{
	g.run(function_1);
	g.run(function_2);
        ..............
	g.run(function_8);
	g.wait();
	n++;
}while(n < MAX_value);
.......
}[/cpp]

In this code, if the functions, funtion_1, function_2 ..function_8 are not huge, the overhead of task creating tasks is huge. It plays a major factor especially when MAX_value is huge. So is there anyway where I can add tasks once to the group, preserve the tasks till the end?

I looked at Fibonacci example but I was not able to decipher much from it. It would be great if someone can point me in proper direction and/or provide some pseudo code. Thanks in advance!

Dmitry_Vyukov · ‎12-14-2009

Quoting - knmaheshy2k

I observed that if the tasks are small, the overhead of task creation plays a major factor in performance improvement.

In this code, if the functions, funtion_1, function_2 ..function_8 are not huge, the overhead of task creating tasks is huge. It plays a major factor especially when MAX_value is huge. So is there anyway where I can add tasks once to the group, preserve the tasks till the end?

I looked at Fibonacci example but I was not able to decipher much from it. It would be great if someone can point me in proper direction and/or provide some pseudo code. Thanks in advance!

Are you sure that it's exactly task creation that is problematic? And not work enqueueing/work distribution/completion detection/etc?
In general your tasks must be at least 10'000 cycles to be worth parallelization.

ARCH_R_Intel · ‎12-14-2009

For TBB, the tasks need to be about 10,000 clocks or more on average to get good speedup.

The template tbb:parallel_invoke is a slightly more efficient way to invoke a fixed number of functions, though I suspect the "slightly" is not going to be enough to do much good in this case.

Cilk has significantly lower task creation overheads, on the order of 4 subroutine calls if I remember correctly. See http://software.intel.com/en-us/articles/intel-cilk/ for the "what if" version. However, task stealing overheads are still high. So if the number of tasks is small, you probably won't see much speedup either.

Can you describe what you are trying to accomplish at a higher level? Often there are ways to restructure algorithms to improve chunk size.

knmaheshy2k · ‎12-15-2009

Quoting - Dmitriy Vyukov

Are you sure that it's exactly task creation that is problematic? And not work enqueueing/work distribution/completion detection/etc?
In general your tasks must be at least 10'000 cycles to be worth parallelization.

Yes, I'm sure that its the task creation that is problematic. To confirm this, I made all the tasks empty and just ran empty tasks to check the overhead of task creation and it was significant. So I wanted to know if there is a way to add the tasks once to task_group object and run them multiple times but I also have to make sure that set of tasks should be executed once before any of them execute for second time.

I don't think my tasks are 10000 cycles. But I've lot of smaller tasks. So any suggestions apart from combining tasks?

knmaheshy2k · ‎12-15-2009

Quoting - Arch Robison (Intel)

For TBB, the tasks need to be about 10,000 clocks or more on average to get good speedup.

The template tbb:parallel_invoke is a slightly more efficient way to invoke a fixed number of functions, though I suspect the "slightly" is not going to be enough to do much good in this case.

Cilk has significantly lower task creation overheads, on the order of 4 subroutine calls if I remember correctly. See http://software.intel.com/en-us/articles/intel-cilk/ for the "what if" version. However, task stealing overheads are still high. So if the number of tasks is small, you probably won't see much speedup either.

Can you describe what you are trying to accomplish at a higher level? Often there are ways to restructure algorithms to improve chunk size.

http://software.intel.com/en-us/forums/showthread.php?t=70594

That is what I'm trying to accomplish. I did try parallel_for, but that seem to be also slow. So I wanted to know if I can use recycle_to_execute() or other recycle_*() methods to reuse the tasks.

Thanks in advance!

Dmitry_Vyukov · ‎12-15-2009

Quoting - knmaheshy2k

Yes, I'm sure that its the task creation that is problematic. To confirm this, I made all the tasks empty and just ran empty tasks to check the overhead of task creation and it was significant.

How do you avoid work enqueueing/work distribution/completion detection/etc overheads in your test?
If you run whatever task it's still enqueued, dequeued, scheduled, etc.

Dmitry_Vyukov · ‎12-15-2009

Quoting - knmaheshy2k

I don't think my tasks are 10000 cycles. But I've lot of smaller tasks. So any suggestions apart from combining tasks?

Nope.
AFAIR per-task overhead even for non parallelized execution is some 600 cycles. So if your tasks are 10 cycles, well, you get 60x slowdown.