- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
When we use pipeline, do we want to specify tbb::task_scheduler_init? but sometimes, if i mention the size then i can able to get hight performance, than saying automatic? why is that? which is good way, either mentioning the size or saying ::automatic?
I am having some more question?
- Is it possible to use parallel_for inside pipeline. will it give high performance?
- in parallel for, is it must that we have to make operator() as const? cant we declare as normal function. void operator() ( const blocked_range
& r ) {} Because I am changing instant variable, but if we mention const we cant change the value. - If I am using core 2 duo or quad core, do we want to assign job to every core? or it will automatically assign the job
Link Copied
12 Replies
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> which is good way, either mentioning the size or saying ::automatic?
Since you are asking the question, the answer is ::automatic :)
> Is it possible to use parallel_for inside pipeline. will it give high performance?
Maybe. But most likely No.
> Because I am changing instant variable, but if we mention const we cant change the value.
You can. Just use const_cast/mutable/indirection/etc.
> do we want to assign job to every core? or it will automatically assign the job
It's one of the main points of TBB - automatic scheduling/work distribution/load-balancing. So the answer is No, you do not need to assign a job to every code manually, TBB will do that for you, just create enough tasks.
Since you are asking the question, the answer is ::automatic :)
> Is it possible to use parallel_for inside pipeline. will it give high performance?
Maybe. But most likely No.
> Because I am changing instant variable, but if we mention const we cant change the value.
You can. Just use const_cast/mutable/indirection/etc.
> do we want to assign job to every core? or it will automatically assign the job
It's one of the main points of TBB - automatic scheduling/work distribution/load-balancing. So the answer is No, you do not need to assign a job to every code manually, TBB will do that for you, just create enough tasks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
> Is it possible to use parallel_for
inside pipeline. will it give high performance?
I'd say that this is at least harmless (in terms of performance). And as the pipeline based solutions often have limited scalabiliy, having a nested parallel_for will help to ensure full utilization of the available hardware.
Thus if the workload processed by your pipeline filter(s) is sufficiently large, you are likely to benefit from using parallel_for inside them.
I'd say that this is at least harmless (in terms of performance). And as the pipeline based solutions often have limited scalabiliy, having a nested parallel_for will help to ensure full utilization of the available hardware.
Thus if the workload processed by your pipeline filter(s) is sufficiently large, you are likely to benefit from using parallel_for inside them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting Andrey Marochko (Intel)
> Is it possible to use parallel_for
inside pipeline. will it give high performance?
I'd say that this is at least harmless (in terms of performance). And as the pipeline based solutions often have limited scalabiliy, having a nested parallel_for will help to ensure full utilization of the available hardware.
Thus if the workload processed by your pipeline filter(s) is sufficiently large, you are likely to benefit from using parallel_for inside them.
I'd say that this is at least harmless (in terms of performance). And as the pipeline based solutions often have limited scalabiliy, having a nested parallel_for will help to ensure full utilization of the available hardware.
Thus if the workload processed by your pipeline filter(s) is sufficiently large, you are likely to benefit from using parallel_for inside them.
Hi Andrey,
Can you go in more detail here?
Pipeline typically consists of a serial input IO stage + a serial output IO stage + a set of parallel computational stages. If the performance is limited by IO stages, then parallel_for gains nothing cause it's IO. And if the performance is limited by parallel computational stages, then parallel_for gains nothing again cause all cores are already loaded.
Do you have in mind a pipeline consisting only of serial stages?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Not necessarily only serial stages. In general pipeline's performance is limited by its slowest stage throughput. And with many real workloads throuput of individual stages may vary with time. Nested paralellism will increase pipeline resistance to such imbalances.
E.g in your example (serial IO filter - parallel filter - serial IO filter), input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items. If the parallel stage does not have nested parallelism, then during that pause the hardware will be sorely underutilized.
The more complex pipeline structure is, the more opportunities for the nested parallelism to maintain high utilization exist.
E.g in your example (serial IO filter - parallel filter - serial IO filter), input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items. If the parallel stage does not have nested parallelism, then during that pause the hardware will be sorely underutilized.
The more complex pipeline structure is, the more opportunities for the nested parallelism to maintain high utilization exist.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Quoting Andrey Marochko (Intel)
E.g in your example (serial IO filter - parallel filter - serial IO filter), input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items. If the parallel stage does not have nested parallelism, then during that pause the hardware will be sorely underutilized.
Ah, I see, so you are talking about the situation when performance of stages is roughly equal, but work flow is bursty and irregular.
I think your right. Nested parallelization can help here.
Btw, can't that be solved by restructuring the pipeline - introduction of additional stages and splitting of big work items... or, perhaps, even simpler - input IO stage must issue only small work items (that are not feasible to divide further), i.e. divide big items into several small pieces. I think it will do the thing in some cases.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Since you are asking the question, the answer is ::automatic :)
:)
i need to know how to domanually, because i got better performance than ::automatic, can you give some smalldescription about that?
> Is it possible to use parallel_for inside pipeline. will it give high performance?
Maybe. But most likely No.
Maybe. But most likely No.
But i got better performance by using both, but i ddnt try wit only parallel_for loop
pipeline: 79sec
pipeline + paralle_for: 55 sec.
I got above performance for background subtraction.
Video data
Number of frame: 906
fps: 25
I am having another one question, how to control thread in pipeline, Im having three stages in pipeline, first and final stages are runs in serial, middle one runs parallel, but middle one need lot of time to do job compare to others. So i have to give more priority for middle one thanothers, how to give more priority for middle stage than others? is it possible?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Try instantiating the tread pool with 2 additional threads than you have hardware threads. Keep the parallel for in the middle.
This won't be a perfect solution, but I think you will see better performance. If/when your app exits the pipelinephase, and enter another computational phase, then consider closing the TBB session (with the +2 threads) and starting a new TBB session with the default number of threads.
This should be an easy enough experiment for you to perform.
BTW
In QuickThread, for this type of problem, we configure the thread pool for number of compute class threads == number of hardware threads, and number of I/O class threads == number of I/O pipe ends. Our parallel pipelines scale quite well. Now, as to if using parallel_for in middle pipe (in QuickThread), this would depend on the I/O performance of the system. From your description it appears as if the I/O is a very small portion of the process, therefore the output end will be starved for data, the input end will starve once all the tokens (buffers) are stacked up behind the middle pipe. Would this in minde, I think you will need +1 thread instead of +2. But try both levels of oversubscription.
Jim Dempsey
This won't be a perfect solution, but I think you will see better performance. If/when your app exits the pipelinephase, and enter another computational phase, then consider closing the TBB session (with the +2 threads) and starting a new TBB session with the default number of threads.
This should be an easy enough experiment for you to perform.
BTW
In QuickThread, for this type of problem, we configure the thread pool for number of compute class threads == number of hardware threads, and number of I/O class threads == number of I/O pipe ends. Our parallel pipelines scale quite well. Now, as to if using parallel_for in middle pipe (in QuickThread), this would depend on the I/O performance of the system. From your description it appears as if the I/O is a very small portion of the process, therefore the output end will be starved for data, the input end will starve once all the tokens (buffers) are stacked up behind the middle pipe. Would this in minde, I think you will need +1 thread instead of +2. But try both levels of oversubscription.
Jim Dempsey
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I believe forceful manual partitioning of large data blocks in the input filter will be generally less efficient than adaptive partitioning done by parallel_for in the processing stage. Don't you think so :) ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Andrey,
Adaptive partitioning is not inherent to parallel_for. So what about pipeline adaptive partitioning vs. parallel_for adaptive partitioning? I believe that pipeline adaptive partitioning can be more efficient in general because it's "higher-level", so to say, it's able to do more knowledgeable decisions (parallel_for decides as to split or not basing only on the parallel_for's current splitting, which may be sub-optimal).
So I think taking into account limitations of the current TBB version (no ability to do adaptive partitioning on a pipeline level) manual pipeline partitioning is indeed less efficient than parallel_for adaptive partitioning in general, but that just a disappointing mistake ;)
The main conclusion for me is that in your original example ("input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items") the root problem is an item containing large amount of work (too coarse-grained tasks). And too coarse-grained tasks can always sacrifice parallelization, a parallel programming system must provide a means to fight too coarse-grained tasks one way or another.
Adaptive partitioning is not inherent to parallel_for. So what about pipeline adaptive partitioning vs. parallel_for adaptive partitioning? I believe that pipeline adaptive partitioning can be more efficient in general because it's "higher-level", so to say, it's able to do more knowledgeable decisions (parallel_for decides as to split or not basing only on the parallel_for's current splitting, which may be sub-optimal).
So I think taking into account limitations of the current TBB version (no ability to do adaptive partitioning on a pipeline level) manual pipeline partitioning is indeed less efficient than parallel_for adaptive partitioning in general, but that just a disappointing mistake ;)
The main conclusion for me is that in your original example ("input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items") the root problem is an item containing large amount of work (too coarse-grained tasks). And too coarse-grained tasks can always sacrifice parallelization, a parallel programming system must provide a means to fight too coarse-grained tasks one way or another.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Ok, the point of the previous post is just that it's not that easy to catch me on an incorrect statement :)
Granularity on a pipeline items may be dictated by a problem (think of the video conversion where a logical item is a frame) (of course, it's theoretically possible to use another decomposition, but that will raise complexity significantly). And performance of various pipeline stage highly depends on an environment (number of CPUs, CPU performance, number of disks, disk performance, file data cached or not, activity of external processes), so in general it's impossible to predict performance ratio between and evenness of various stages. So there are good reasons to do nested parallelization with parallel_for (tasks/whatever).
I have to agree that I was not right.
Granularity on a pipeline items may be dictated by a problem (think of the video conversion where a logical item is a frame) (of course, it's theoretically possible to use another decomposition, but that will raise complexity significantly). And performance of various pipeline stage highly depends on an environment (number of CPUs, CPU performance, number of disks, disk performance, file data cached or not, activity of external processes), so in general it's impossible to predict performance ratio between and evenness of various stages. So there are good reasons to do nested parallelization with parallel_for (tasks/whatever).
I have to agree that I was not right.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Good to see people coming around :) Sometimes I also tend to be stubborn for a while until the new vision sinks in :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
"in parallel for, is it must that we have to make operator() as const?"
Why would you want to change a throw-away copy?
Why would you want to change a throw-away copy?

Reply
Topic Options
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page