- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
[cpp]void function_1() { ... } void function_2() { ... } ..... void function_8() { ... } int main() { ....... task_group g; ....... do{ g.run(function_1); g.run(function_2); .............. g.run(function_8); g.wait(); n++; }while(n < MAX_value); ....... }[/cpp]
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you sure that it's exactly task creation that is problematic? And not work enqueueing/work distribution/completion detection/etc?
In general your tasks must be at least 10'000 cycles to be worth parallelization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For TBB, the tasks need to be about 10,000 clocks or more on average to get good speedup.
The template tbb:parallel_invoke is a slightly more efficient way to invoke a fixed number of functions, though I suspect the "slightly" is not going to be enough to do much good in this case.
Cilk has significantly lower task creation overheads, on the order of 4 subroutine calls if I remember correctly. See http://software.intel.com/en-us/articles/intel-cilk/ for the "what if" version. However, task stealing overheads are still high. So if the number of tasks is small, you probably won't see much speedup either.
Can you describe what you are trying to accomplish at a higher level? Often there are ways to restructure algorithms to improve chunk size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Are you sure that it's exactly task creation that is problematic? And not work enqueueing/work distribution/completion detection/etc?
In general your tasks must be at least 10'000 cycles to be worth parallelization.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
For TBB, the tasks need to be about 10,000 clocks or more on average to get good speedup.
The template tbb:parallel_invoke is a slightly more efficient way to invoke a fixed number of functions, though I suspect the "slightly" is not going to be enough to do much good in this case.
Cilk has significantly lower task creation overheads, on the order of 4 subroutine calls if I remember correctly. See http://software.intel.com/en-us/articles/intel-cilk/ for the "what if" version. However, task stealing overheads are still high. So if the number of tasks is small, you probably won't see much speedup either.
Can you describe what you are trying to accomplish at a higher level? Often there are ways to restructure algorithms to improve chunk size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
How do you avoid work enqueueing/work distribution/completion detection/etc overheads in your test?
If you run whatever task it's still enqueued, dequeued, scheduled, etc.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Nope.
AFAIR per-task overhead even for non parallelized execution is some 600 cycles. So if your tasks are 10 cycles, well, you get 60x slowdown.

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page