- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

iam a novice programer to tbb an trying to evaluate the tbb on an intel64 architecture. i have few questions on tbb.

1. actually i would like to know whether tbb onlyachieves better performance(linear speedup) for divide and conquer type problems or is it also possible for data parallel applications as well ?. iam asking this is because i achieved poor performance for data parallel applications and i used only parallel_for() to parallelize the application.

2. one more observation is that the one node execution times of tbb applicationsvery less than others but atlast the speedup achieved on 8-nodes notlinear.

3. Also i would like toknow whether there is an implicit synchronization barriers in the parallel_for()tasksfor the threads oris it the responsibility of the programmer to synchronize.

1 Solution

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - chowdarys

*i have one more question . its unclear for me how to use the barriers in tbb. if i have to use explicit barriers for synchronizations between the parallel_for()then which routine i have to use is it the right way to use empty task that does nothing and call it in the operator method of the parallel_for loop. as mentioned in an example in the doccumentation(but i think that example is only for divide and conquer problems)or is it enough to just use wait_for_all() in the operator method of theparallel_for() task that is defined by theDeveloperfor synchronization.*

Link Copied

9 Replies

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - chowdarys

*1. actually i would like to know whether tbb onlyachieves better performance(linear speedup) for divide and conquer type problems or is it also possible for data parallel applications as well ?. iam asking this is because i achieved poor performance for data parallel applications and i used only parallel_for() to parallelize the application.*

In general, Yes, TBB works best if work is structured into reasonably balanced tree. However note that you can structure parallel processing over array or matrix into balanced tree. I.e. you split an array into 2 halves, then spit that halves into halves, and so on recursively. It's effectively divide and conquer. And you can do the same for matrix (for example during matrix multiplication). And as you may guess that's exactly what tbb::parallel_for does. Other things aside, tbb::parallel_for scales linearly, so the problem must be somewhere else.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - chowdarys

*2. one more observation is that the one node execution times of tbb applicationsvery less than others but atlast the speedup achieved on 8-nodes notlinear.*

What do you mean by nodes here?

Anyway, provided good algorithm, good implementation, good task granularity, etc, TBB achieves linear speedup. So at least linear speedup is possible. Everything else depends on details of a particular application.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - chowdarys

*3. Also i would like toknow whether there is an implicit synchronization barriers in the parallel_for()tasksfor the threads oris it the responsibility of the programmer to synchronize.*

When tbb::parallel_for() returns, all array elements are guaranteed to be already processed. So I guess the answer is No, you do need explicit barrier.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - Dmitriy Vyukov

In general, Yes, TBB works best if work is structured into reasonably balanced tree. However note that you can structure parallel processing over array or matrix into balanced tree. I.e. you split an array into 2 halves, then spit that halves into halves, and so on recursively. It's effectively divide and conquer. And you can do the same for matrix (for example during matrix multiplication). And as you may guess that's exactly what tbb::parallel_for does. Other things aside, tbb::parallel_for scales linearly, so the problem must be somewhere else.

In general, Yes, TBB works best if work is structured into reasonably balanced tree. However note that you can structure parallel processing over array or matrix into balanced tree. I.e. you split an array into 2 halves, then spit that halves into halves, and so on recursively. It's effectively divide and conquer. And you can do the same for matrix (for example during matrix multiplication). And as you may guess that's exactly what tbb::parallel_for does. Other things aside, tbb::parallel_for scales linearly, so the problem must be somewhere else.

thanks for the reply. i will check my algorithm again.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - Dmitriy Vyukov

What do you mean by nodes here?

Anyway, provided good algorithm, good implementation, good task granularity, etc, TBB achieves linear speedup. So at least linear speedup is possible. Everything else depends on details of a particular application.

What do you mean by nodes here?

Anyway, provided good algorithm, good implementation, good task granularity, etc, TBB achieves linear speedup. So at least linear speedup is possible. Everything else depends on details of a particular application.

Actually they are the processors. like for one node execution time i mean to say the execution time on one processor node.Also myquestion is cleared. Thanks for the reply

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - Dmitriy Vyukov

When tbb::parallel_for() returns, all array elements are guaranteed to be already processed. So I guess the answer is No, you do need explicit barrier.

When tbb::parallel_for() returns, all array elements are guaranteed to be already processed. So I guess the answer is No, you do need explicit barrier.

hi,

i have one more question . its unclear for me how to use the barriers in tbb. if i have to use explicit barriers for synchronizations between the parallel_for()then which routine i have to use is it the right way to use empty task that does nothing and call it in the operator method of the parallel_for loop. as mentioned in an example in the doccumentation(but i think that example is only for divide and conquer problems)or is it enough to just use wait_for_all() in the operator method of theparallel_for() task that is defined by theDeveloperfor synchronization.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - chowdarys

*i have one more question . its unclear for me how to use the barriers in tbb. if i have to use explicit barriers for synchronizations between the parallel_for()then which routine i have to use is it the right way to use empty task that does nothing and call it in the operator method of the parallel_for loop. as mentioned in an example in the doccumentation(but i think that example is only for divide and conquer problems)or is it enough to just use wait_for_all() in the operator method of theparallel_for() task that is defined by theDeveloperfor synchronization.*

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - Robert Reed (Intel)

*I am struggling to understand your question and have a question of my own: for what reason do you want to set up an explicit barrier? Certainly you could create a parallel_for with an operator() function containing no content, but since outside any TBB parallel construct only one thread is usually running anyway, what would be the point? It's not like you have some function that contains a rendezvous point using the parallel_for to collect random threads as they pass through. It's more like one thread comes into a parallel construct and kicks off some work that idle threads rush in to help finish like a bunch of worker bees, that hang around until the work is done and then flit off to find more work, leaving the master thread to continue on after the parallel construct. Unless you have need for some critical section or monitor within the body of the operator() (for which you could use a TBB scoped lock), I don't understand why you would need a barrier.*

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Quoting - Robert Reed (Intel)

*I am struggling to understand your question and have a question of my own: for what reason do you want to set up an explicit barrier? Certainly you could create a parallel_for with an operator() function containing no content, but since outside any TBB parallel construct only one thread is usually running anyway, what would be the point? It's not like you have some function that contains a rendezvous point using the parallel_for to collect random threads as they pass through. It's more like one thread comes into a parallel construct and kicks off some work that idle threads rush in to help finish like a bunch of worker bees, that hang around until the work is done and then flit off to find more work, leaving the master thread to continue on after the parallel construct. Unless you have need for some critical section or monitor within the body of the operator() (for which you could use a TBB scoped lock), I don't understand why you would need a barrier.*

Topic Options

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page