Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.
2465 Discussions

performance of tbb for data parallel applications

chowdarys
Beginner
679 Views
hi everyone,
iam a novice programer to tbb an trying to evaluate the tbb on an intel64 architecture. i have few questions on tbb.
1. actually i would like to know whether tbb onlyachieves better performance(linear speedup) for divide and conquer type problems or is it also possible for data parallel applications as well ?. iam asking this is because i achieved poor performance for data parallel applications and i used only parallel_for() to parallelize the application.
2. one more observation is that the one node execution times of tbb applicationsvery less than others but atlast the speedup achieved on 8-nodes notlinear.
3. Also i would like toknow whether there is an implicit synchronization barriers in the parallel_for()tasksfor the threads oris it the responsibility of the programmer to synchronize.
0 Kudos
1 Solution
robert-reed
Valued Contributor II
679 Views
Quoting - chowdarys
i have one more question . its unclear for me how to use the barriers in tbb. if i have to use explicit barriers for synchronizations between the parallel_for()then which routine i have to use is it the right way to use empty task that does nothing and call it in the operator method of the parallel_for loop. as mentioned in an example in the doccumentation(but i think that example is only for divide and conquer problems)or is it enough to just use wait_for_all() in the operator method of theparallel_for() task that is defined by theDeveloperfor synchronization.
I am struggling to understand your question and have a question of my own: for what reason do you want to set up an explicit barrier? Certainly you could create a parallel_for with an operator() function containing no content, but since outside any TBB parallel construct only one thread is usually running anyway, what would be the point? It's not like you have some function that contains a rendezvous point using the parallel_for to collect random threads as they pass through. It's more like one thread comes into a parallel construct and kicks off some work that idle threads rush in to help finish like a bunch of worker bees, that hang around until the work is done and then flit off to find more work, leaving the master thread to continue on after the parallel construct. Unless you have need for some critical section or monitor within the body of the operator() (for which you could use a TBB scoped lock), I don't understand why you would need a barrier.

View solution in original post

0 Kudos
9 Replies
Dmitry_Vyukov
Valued Contributor I
679 Views
Quoting - chowdarys
1. actually i would like to know whether tbb onlyachieves better performance(linear speedup) for divide and conquer type problems or is it also possible for data parallel applications as well ?. iam asking this is because i achieved poor performance for data parallel applications and i used only parallel_for() to parallelize the application.

In general, Yes, TBB works best if work is structured into reasonably balanced tree. However note that you can structure parallel processing over array or matrix into balanced tree. I.e. you split an array into 2 halves, then spit that halves into halves, and so on recursively. It's effectively divide and conquer. And you can do the same for matrix (for example during matrix multiplication). And as you may guess that's exactly what tbb::parallel_for does. Other things aside, tbb::parallel_for scales linearly, so the problem must be somewhere else.


0 Kudos
Dmitry_Vyukov
Valued Contributor I
679 Views
Quoting - chowdarys
2. one more observation is that the one node execution times of tbb applicationsvery less than others but atlast the speedup achieved on 8-nodes notlinear.

What do you mean by nodes here?
Anyway, provided good algorithm, good implementation, good task granularity, etc, TBB achieves linear speedup. So at least linear speedup is possible. Everything else depends on details of a particular application.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
679 Views
Quoting - chowdarys
3. Also i would like toknow whether there is an implicit synchronization barriers in the parallel_for()tasksfor the threads oris it the responsibility of the programmer to synchronize.

When tbb::parallel_for() returns, all array elements are guaranteed to be already processed. So I guess the answer is No, you do need explicit barrier.

0 Kudos
chowdarys
Beginner
679 Views
Quoting - Dmitriy Vyukov

In general, Yes, TBB works best if work is structured into reasonably balanced tree. However note that you can structure parallel processing over array or matrix into balanced tree. I.e. you split an array into 2 halves, then spit that halves into halves, and so on recursively. It's effectively divide and conquer. And you can do the same for matrix (for example during matrix multiplication). And as you may guess that's exactly what tbb::parallel_for does. Other things aside, tbb::parallel_for scales linearly, so the problem must be somewhere else.


Hi,
thanks for the reply. i will check my algorithm again.
0 Kudos
chowdarys
Beginner
679 Views
Quoting - Dmitriy Vyukov

What do you mean by nodes here?
Anyway, provided good algorithm, good implementation, good task granularity, etc, TBB achieves linear speedup. So at least linear speedup is possible. Everything else depends on details of a particular application.

Actually they are the processors. like for one node execution time i mean to say the execution time on one processor node.Also myquestion is cleared. Thanks for the reply
0 Kudos
chowdarys
Beginner
679 Views
Quoting - Dmitriy Vyukov

When tbb::parallel_for() returns, all array elements are guaranteed to be already processed. So I guess the answer is No, you do need explicit barrier.


hi,
i have one more question . its unclear for me how to use the barriers in tbb. if i have to use explicit barriers for synchronizations between the parallel_for()then which routine i have to use is it the right way to use empty task that does nothing and call it in the operator method of the parallel_for loop. as mentioned in an example in the doccumentation(but i think that example is only for divide and conquer problems)or is it enough to just use wait_for_all() in the operator method of theparallel_for() task that is defined by theDeveloperfor synchronization.
0 Kudos
robert-reed
Valued Contributor II
680 Views
Quoting - chowdarys
i have one more question . its unclear for me how to use the barriers in tbb. if i have to use explicit barriers for synchronizations between the parallel_for()then which routine i have to use is it the right way to use empty task that does nothing and call it in the operator method of the parallel_for loop. as mentioned in an example in the doccumentation(but i think that example is only for divide and conquer problems)or is it enough to just use wait_for_all() in the operator method of theparallel_for() task that is defined by theDeveloperfor synchronization.
I am struggling to understand your question and have a question of my own: for what reason do you want to set up an explicit barrier? Certainly you could create a parallel_for with an operator() function containing no content, but since outside any TBB parallel construct only one thread is usually running anyway, what would be the point? It's not like you have some function that contains a rendezvous point using the parallel_for to collect random threads as they pass through. It's more like one thread comes into a parallel construct and kicks off some work that idle threads rush in to help finish like a bunch of worker bees, that hang around until the work is done and then flit off to find more work, leaving the master thread to continue on after the parallel construct. Unless you have need for some critical section or monitor within the body of the operator() (for which you could use a TBB scoped lock), I don't understand why you would need a barrier.
0 Kudos
chowdarys
Beginner
679 Views
I am struggling to understand your question and have a question of my own: for what reason do you want to set up an explicit barrier? Certainly you could create a parallel_for with an operator() function containing no content, but since outside any TBB parallel construct only one thread is usually running anyway, what would be the point? It's not like you have some function that contains a rendezvous point using the parallel_for to collect random threads as they pass through. It's more like one thread comes into a parallel construct and kicks off some work that idle threads rush in to help finish like a bunch of worker bees, that hang around until the work is done and then flit off to find more work, leaving the master thread to continue on after the parallel construct. Unless you have need for some critical section or monitor within the body of the operator() (for which you could use a TBB scoped lock), I don't understand why you would need a barrier.
hi,Your answer solved my question.i actually want to make sure thatafter theexecution of the parallel_for is finished only the master should or only one thread should do the work out side the parallel construct. And my intention is to set up an explicitbarrier if that is not the case. but how ever it is clear from your answer my intention is wrong. thanks for the reply.
0 Kudos
chowdarys
Beginner
679 Views
I am struggling to understand your question and have a question of my own: for what reason do you want to set up an explicit barrier? Certainly you could create a parallel_for with an operator() function containing no content, but since outside any TBB parallel construct only one thread is usually running anyway, what would be the point? It's not like you have some function that contains a rendezvous point using the parallel_for to collect random threads as they pass through. It's more like one thread comes into a parallel construct and kicks off some work that idle threads rush in to help finish like a bunch of worker bees, that hang around until the work is done and then flit off to find more work, leaving the master thread to continue on after the parallel construct. Unless you have need for some critical section or monitor within the body of the operator() (for which you could use a TBB scoped lock), I don't understand why you would need a barrier.
hi,Your answer solved my question.i actually want to make sure thatafter theexecution of the parallel_for is finished only the master should or only one thread should do the work out side the parallel construct. And my intention is to set up an explicitbarrier if that is not the case. but how ever it is clear from your answer my intention is wrong. thanks for the reply.
0 Kudos
Reply