Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Do we want to specify number of thread?

hitach__0
Beginner
351 Views
When we use pipeline, do we want to specify tbb::task_scheduler_init? but sometimes, if i mention the size then i can able to get hight performance, than saying automatic? why is that? which is good way, either mentioning the size or saying ::automatic?
I am having some more question?
  • Is it possible to use parallel_for inside pipeline. will it give high performance?
  • in parallel for, is it must that we have to make operator() as const? cant we declare as normal function. void operator() ( const blocked_range& r ) {} Because I am changing instant variable, but if we mention const we cant change the value.
  • If I am using core 2 duo or quad core, do we want to assign job to every core? or it will automatically assign the job
0 Kudos
12 Replies
Dmitry_Vyukov
Valued Contributor I
351 Views
> which is good way, either mentioning the size or saying ::automatic?

Since you are asking the question, the answer is ::automatic :)

> Is it possible to use parallel_for inside pipeline. will it give high performance?

Maybe. But most likely No.

> Because I am changing instant variable, but if we mention const we cant change the value.

You can. Just use const_cast/mutable/indirection/etc.

> do we want to assign job to every core? or it will automatically assign the job

It's one of the main points of TBB - automatic scheduling/work distribution/load-balancing. So the answer is No, you do not need to assign a job to every code manually, TBB will do that for you, just create enough tasks.
0 Kudos
Andrey_Marochko
New Contributor III
351 Views
> Is it possible to use parallel_for inside pipeline. will it give high performance?

I'd say that this is at least harmless (in terms of performance). And as the pipeline based solutions often have limited scalabiliy, having a nested parallel_for will help to ensure full utilization of the available hardware.

Thus if the workload processed by your pipeline filter(s) is sufficiently large, you are likely to benefit from using parallel_for inside them.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
351 Views
> Is it possible to use parallel_for inside pipeline. will it give high performance?

I'd say that this is at least harmless (in terms of performance). And as the pipeline based solutions often have limited scalabiliy, having a nested parallel_for will help to ensure full utilization of the available hardware.

Thus if the workload processed by your pipeline filter(s) is sufficiently large, you are likely to benefit from using parallel_for inside them.

Hi Andrey,

Can you go in more detail here?

Pipeline typically consists of a serial input IO stage + a serial output IO stage + a set of parallel computational stages. If the performance is limited by IO stages, then parallel_for gains nothing cause it's IO. And if the performance is limited by parallel computational stages, then parallel_for gains nothing again cause all cores are already loaded.

Do you have in mind a pipeline consisting only of serial stages?

0 Kudos
Andrey_Marochko
New Contributor III
351 Views
Not necessarily only serial stages. In general pipeline's performance is limited by its slowest stage throughput. And with many real workloads throuput of individual stages may vary with time. Nested paralellism will increase pipeline resistance to such imbalances.

E.g in your example (serial IO filter - parallel filter - serial IO filter), input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items. If the parallel stage does not have nested parallelism, then during that pause the hardware will be sorely underutilized.

The more complex pipeline structure is, the more opportunities for the nested parallelism to maintain high utilization exist.
0 Kudos
Dmitry_Vyukov
Valued Contributor I
351 Views
E.g in your example (serial IO filter - parallel filter - serial IO filter), input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items. If the parallel stage does not have nested parallelism, then during that pause the hardware will be sorely underutilized.

Ah, I see, so you are talking about the situation when performance of stages is roughly equal, but work flow is bursty and irregular.
I think your right. Nested parallelization can help here.
Btw, can't that be solved by restructuring the pipeline - introduction of additional stages and splitting of big work items... or, perhaps, even simpler - input IO stage must issue only small work items (that are not feasible to divide further), i.e. divide big items into several small pieces. I think it will do the thing in some cases.


0 Kudos
hitach__0
Beginner
351 Views
Since you are asking the question, the answer is ::automatic :)
:)
i need to know how to domanually, because i got better performance than ::automatic, can you give some smalldescription about that?
> Is it possible to use parallel_for inside pipeline. will it give high performance?
Maybe. But most likely No.

But i got better performance by using both, but i ddnt try wit only parallel_for loop
pipeline: 79sec
pipeline + paralle_for: 55 sec.
I got above performance for background subtraction.
Video data
Number of frame: 906
fps: 25
I am having another one question, how to control thread in pipeline, Im having three stages in pipeline, first and final stages are runs in serial, middle one runs parallel, but middle one need lot of time to do job compare to others. So i have to give more priority for middle one thanothers, how to give more priority for middle stage than others? is it possible?
0 Kudos
jimdempseyatthecove
Honored Contributor III
351 Views
Try instantiating the tread pool with 2 additional threads than you have hardware threads. Keep the parallel for in the middle.

This won't be a perfect solution, but I think you will see better performance. If/when your app exits the pipelinephase, and enter another computational phase, then consider closing the TBB session (with the +2 threads) and starting a new TBB session with the default number of threads.

This should be an easy enough experiment for you to perform.

BTW

In QuickThread, for this type of problem, we configure the thread pool for number of compute class threads == number of hardware threads, and number of I/O class threads == number of I/O pipe ends. Our parallel pipelines scale quite well. Now, as to if using parallel_for in middle pipe (in QuickThread), this would depend on the I/O performance of the system. From your description it appears as if the I/O is a very small portion of the process, therefore the output end will be starved for data, the input end will starve once all the tokens (buffers) are stacked up behind the middle pipe. Would this in minde, I think you will need +1 thread instead of +2. But try both levels of oversubscription.

Jim Dempsey
0 Kudos
Andrey_Marochko
New Contributor III
351 Views
I believe forceful manual partitioning of large data blocks in the input filter will be generally less efficient than adaptive partitioning done by parallel_for in the processing stage. Don't you think so :) ?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
351 Views
Hi Andrey,

Adaptive partitioning is not inherent to parallel_for. So what about pipeline adaptive partitioning vs. parallel_for adaptive partitioning? I believe that pipeline adaptive partitioning can be more efficient in general because it's "higher-level", so to say, it's able to do more knowledgeable decisions (parallel_for decides as to split or not basing only on the parallel_for's current splitting, which may be sub-optimal).
So I think taking into account limitations of the current TBB version (no ability to do adaptive partitioning on a pipeline level) manual pipeline partitioning is indeed less efficient than parallel_for adaptive partitioning in general, but that just a disappointing mistake ;)
The main conclusion for me is that in your original example ("input filter may produce an item containing a large amount of work, and then after a pause a bunch of smaller items") the root problem is an item containing large amount of work (too coarse-grained tasks). And too coarse-grained tasks can always sacrifice parallelization, a parallel programming system must provide a means to fight too coarse-grained tasks one way or another.




0 Kudos
Dmitry_Vyukov
Valued Contributor I
351 Views
Ok, the point of the previous post is just that it's not that easy to catch me on an incorrect statement :)
Granularity on a pipeline items may be dictated by a problem (think of the video conversion where a logical item is a frame) (of course, it's theoretically possible to use another decomposition, but that will raise complexity significantly). And performance of various pipeline stage highly depends on an environment (number of CPUs, CPU performance, number of disks, disk performance, file data cached or not, activity of external processes), so in general it's impossible to predict performance ratio between and evenness of various stages. So there are good reasons to do nested parallelization with parallel_for (tasks/whatever).
I have to agree that I was not right.
0 Kudos
Andrey_Marochko
New Contributor III
351 Views
Good to see people coming around :) Sometimes I also tend to be stubborn for a while until the new vision sinks in :)
0 Kudos
RafSchietekat
Valued Contributor III
351 Views
"in parallel for, is it must that we have to make operator() as const?"
Why would you want to change a throw-away copy?
0 Kudos
Reply