Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

pipeline vs task (vs raw thread?)

softarts
Beginner
736 Views
after reading a serveral articles about pipeline, some questions come to me.
is there any advantage in pipeline vs raw thread?
i.e.I have divided my job into several raw thread,they communicate with each other via queues(lock-free/concurrent,etc),
will there be any benefit from pipeline mode?

BTW,I think TBB::task pattern is easy to understand,but anyone can tell its advantages vs raw thread?
benchmark/test result comparision are appreciated.
0 Kudos
17 Replies
RafSchietekat
Valued Contributor III
736 Views
"is there any advantage in pipeline vs raw thread?"
tbb::pipeline will mostly avoid dragging data across caches (cache locality is a big concern for performance): the filters will tend to make house calls instead (and since code doesn't change it can live in several caches at the same time).

"BTW,I think TBB::task pattern is easy to understand,but anyone can tell its advantages vs raw thread?"
TBB will take care of several things related to getting the most out of the machine that would otherwise distract you, if you can accept that it sacrifices fairness and currently sometimes even global progress for throughput-oriented performance. Instead of wasting time reinventing the wheel (and getting the bumps out), you can let TBB's scheduler provide better cache locality, scalability, etc.

"benchmark/test result comparision are appreciated."
Anyone?
0 Kudos
Charles_Tucker
Beginner
736 Views
"BTW,I think TBB::task pattern is easy to understand,but anyone can tell its advantages vs raw thread?"

I would argue that the advantage of task-based parallelism over raw thread implementations are two fold.

The first advantage is portability to multiple numbers of cores. Given a 2-, 4-, and 8-threaded machine, you have to either over-provision the number of threads (i.e. 8 threads on the 2-core machine) or write three different versions of the code, to avoid running too many threads. Tasking does this naturally, dividing the work among how ever many worker threads are being used.

The second advantage is automatic load balancing. If your parallel workload includes jobs that are dynamic in the amount of computation they do, it's impossible to statically load-balance these jobs across threads. Task schedulers deal with this by asking the programmer to express all available parallelism in the program and keeping unassigned tasks around for "bored" worker threads to grab and do.
0 Kudos
softarts
Beginner
736 Views
Quoting - Charles Tucker
"BTW,I think TBB::task pattern is easy to understand,but anyone can tell its advantages vs raw thread?"

I would argue that the advantage of task-based parallelism over raw thread implementations are two fold.

The first advantage is portability to multiple numbers of cores. Given a 2-, 4-, and 8-threaded machine, you have to either over-provision the number of threads (i.e. 8 threads on the 2-core machine) or write three different versions of the code, to avoid running too many threads. Tasking does this naturally, dividing the work among how ever many worker threads are being used.

The second advantage is automatic load balancing. If your parallel workload includes jobs that are dynamic in the amount of computation they do, it's impossible to statically load-balance these jobs across threads. Task schedulers deal with this by asking the programmer to express all available parallelism in the program and keeping unassigned tasks around for "bored" worker threads to grab and do.

can pipeline grauanteethe data go through the pipeline as its original order?
raw thread is not easy to do this
0 Kudos
RafSchietekat
Valued Contributor III
736 Views
"can pipeline grauantee the data go through the pipeline as its original order?"
Please consult http://www.threadingbuildingblocks.org/, Documentation, Reference Manual (Open Source or Commercial), and look for "serial_in_order". Note that this is only guaranteed to be the order of the input filter if that input filter is also serial_in_order (which might be made more explicit in the documentation).

(Correction) Only "Reference (Open Source).pdf" currently mentions "serial_in_order", perhaps because it is revision 1.13 vs. 1.10 for the commercial documentation, but maybe itthat just lags the actual software (I have not checked).
0 Kudos
Alexey-Kukanov
Employee
736 Views
Quoting - softarts
can pipeline grauanteethe data go through the pipeline as its original order?
raw thread is not easy to do this

All ordered serial filters in the pipeline process data in the same order, as established by the first such filter.
So you can make your first filter ordered, as well as every other filter where the order is important.
0 Kudos
Alexey-Kukanov
Employee
736 Views
Quoting - Raf Schietekat
Note that this is only guaranteed to be the order of the input filter if that input filter is also serial_in_order (which might be made more explicit in the documentation).


Thanks for pointing out yet another place to improve our docs.
After internal discussion, we would like to add the following tip right after the introduction of stage types:

"Tip: Use a serial_in_order input filter if there are any subsequent serial_in_order stages that should process items in their input order."

The question to everyone: would such a tip be sufficient to makeit more obvious how to process items in the order of input?

0 Kudos
RafSchietekat
Valued Contributor III
736 Views

I also thought that "in no particular order" looks rather scary (althoughI wouldn't promise FIFO either), but thenI'm an excellent nitpicker if nothing else. :-)

0 Kudos
softarts
Beginner
736 Views


Thanks for pointing out yet another place to improve our docs.
After internal discussion, we would like to add the following tip right after the introduction of stage types:

"Tip: Use a serial_in_order input filter if there are any subsequent serial_in_order stages that should process items in their input order."

The question to everyone: would such a tip be sufficient to makeit more obvious how to process items in the order of input?


how about the parallel filter? i.e. class MyTransformFilter in text_filter, how does pipeline grauantee the 'MyBuffer' will enter 'MyOutputFilter'in correct order? because 'MyTransformFilter is not serial filter
0 Kudos
RafSchietekat
Valued Contributor III
736 Views
"how about the parallel filter? i.e. class MyTransformFilter in text_filter, how does pipeline grauantee the 'MyBuffer' will enter 'MyOutputFilter' in correct order? because 'MyTransformFilter is not serial filter"
Conceptually, the first serial_in_order filter gives each data item a ticket from a pipeline-global dispenser, which the data item has to present to each subsequent serial_in_order filter (other kinds of filters just ignore it). When I was playing around with the code (although I haven't completed yet what I set out to do), I thought it might be useful to make the metaphor more explicit by substituting names like ticket_dispenser, now_serving, next_please, although I didn't go as far as renaming input_buffer to waiting_room (yet). :-)
0 Kudos
Alexey-Kukanov
Employee
736 Views
Quoting - softarts
how about the parallel filter? i.e. class MyTransformFilter in text_filter, how does pipeline grauantee the 'MyBuffer' will enter 'MyOutputFilter'in correct order? because 'MyTransformFilter is not serial filter

It does not matter in which order items are put into the buffer of an ordered filter. The proper ordering is established when the items are taken from the buffer to be processed by the filter. If the item that should be processed next has not yet arrived, all other items will wait in the buffer.
0 Kudos
RafSchietekat
Valued Contributor III
736 Views
Quoting - Raf Schietekat

I also thought that "in no particular order" looks rather scary (althoughI wouldn't promise FIFO either), but thenI'm an excellent nitpicker if nothing else. :-)

(Corrected) I have removed what I wrote here before, with my apologies both for having written it in a distracted state of mind (even forgetting things I already knew), but also for then cowardly hiding it. Here's a cleaned-up version:

I mean that there is no promise to prevent pipeline-local starvation of any particular data item, which, as far as I can tell, can in fact happen in a parallel filter (but see below), because, even though a pipeline's filters are given different priorities in an effort to promote pipeline-local progress, such a data item might be held captive in a thread that is now busy executing tasks not related to the pipeline.

The maximum number of tokens in flight will ensure that this will not cause an unbounded backlog in the second of two serial_in_order filters that occur somewhere before and after the parallel filter (and I have to go check what this means for my pipeline reimplementation), but this will instead cause the entire pipeline to stall.

Does this seem plausible?
0 Kudos
Alexey-Kukanov
Employee
736 Views
Quoting - Raf Schietekat
I mean that there is no promise to prevent pipeline-local starvation of any particular data item, which, as far as I can tell, can in fact happen in a parallel filter (but see below), because, even though a pipeline's filters are given different priorities in an effort to promote pipeline-local progress, such a data item might be held captive in a thread that is now busy executing tasks not related to the pipeline.

The maximum number of tokens in flight will ensure that this will not cause an unbounded backlog in the second of two serial_in_order filters that occur somewhere before and after the parallel filter (and I have to go check what this means for my pipeline reimplementation), but this will instead cause the entire pipeline to stall.

Does this seem plausible?

I think you are correct, though I probably do not understand why you stressed the parallel filter.
However I am not sure if this level of details makes sense in the documentation; it's too implementation specific to my sense.
0 Kudos
RafSchietekat
Valued Contributor III
736 Views
"I think you are correct, though I probably do not understand why you stressed the parallel filter."
In a parallel filter, a data item can suffer an unbounded delay relative to other data items passing through the same pipeline. This does not happen in a serial filter, incidentally because of their current implementation (either type), or inherently because of their specification (all but the first serial_in_order filters).

"However I am not sure if this level of details makes sense in the documentation; it's too implementation specific to my sense."
I would tend to agree if we were talking about a possible warning not to use a microwave oven to dry pets, but in this case I would probably be somewhat cross if my application malfunctioned because of something somebody knew but didn't care to tell me.
0 Kudos
Alexey-Kukanov
Employee
736 Views
Quoting - Raf Schietekat
"I think you are correct, though I probably do not understand why you stressed the parallel filter."
In a parallel filter, a data item can suffer an unbounded delay relative to other data items passing through the same pipeline. This does not happen in a serial filter, incidentally because of their current implementation (either type), or inherently because of their specification (all but the first serial_in_order filters).

"However I am not sure if this level of details makes sense in the documentation; it's too implementation specific to my sense."
I would tend to agree if we were talking about a possible warning not to use a microwave oven to dry pets, but in this case I would probably be somewhat cross if my application malfunctioned because of something somebody knew but didn't care to tell me.

Thanks for clarification; now I think I really understand what you meant. So this is about the "no particular order" note for unordered filters, how much reordering could be there, and progress of processing some items relatively to their peers.

A caution could be added, though I am not sure what it should say beyond what is already said. Also I do not think that the "unbounded delay" problem will ever have place in practice. As far as I remember, there are just two places in the implementation where tasks get spawned - at the start of the pipeline, and in note_done after processing by a serial filter. The former creates input tasks, and those do not carry any item yet, so there is no "unbounded delay" problem. The latter place spawns a task that should process the serial filter, and advances the current task to the next filter which might be parallel. So in the current implementation there are no parallel stage tasks sitting in the pools.
0 Kudos
RafSchietekat
Valued Contributor III
736 Views
In the current implementation, in a parallel filter an unbounded relative delay would indeed not occur before execution starts, but it could still occur during the task's execution (because of stealing, i.e., not considering inherent differences in task execution times or external blocking). In a serial_out_of_order filter or the first serial_in_order filter, it's the current FIFO input buffer implementation that prevents unbounded delays (or, as you write, in the input buffer it wouldn't matter). In both cases, a different implementation is possible: in the latter case, a conceivably more efficient input buffer organised as several affinity-based FIFO lanes would require special remedies to prevent unbounded delays; in the former case, a dynamic number of worker threads could be orchestrated to avoid a partially executed task's getting trapped under a load of other work while still keeping each core mostly busy. But I believe we've been there already.
0 Kudos
Alexey-Kukanov
Employee
736 Views
If not considering nested blocking style parallelism, I do not see how a task could be "delayed" before or during execution, even by stealing. If considering nested blocking style parallelism, then the executing task could be blocked for longer time than necessary to complete the nested computation. And with the recent task pool changes that also make stealing not limited by depth, I agree it could cause relative delay, bounded by the allowed number of simultaneously active tokens as you mentioned before.

I still have troubles to devise a caution that would be specific enough to not just scary the reader (such as "in the worst case, a token can be completed after N-1 tokens that it would precede in sequential execution"), and yet high-level enough to not bury the reader under the amount of implementation specific details, like we are discussing.
0 Kudos
RafSchietekat
Valued Contributor III
736 Views
"If not considering nested blocking style parallelism, I do not see how a task could be "delayed" before or during execution, even by stealing."
Stealing without nested blocking-style parallelism? Anyway, this is not the current situation.

"If considering nested blocking style parallelism, then the executing task could be blocked for longer time than necessary to complete the nested computation."
You and I know that (now), but we're not the intended audience of such a statement...

"And with the recent task pool changes that also make stealing not limited by depth, I agree it could cause relative delay, bounded by the allowed number of simultaneously active tokens as you mentioned before."
This is about unbounded relative delays, aka. starvation.

"I still have troubles to devise a caution that would be specific enough to not just scary the reader (such as "in the worst case, a token can be completed after N-1 tokens that it would precede in sequential execution"), and yet high-level enough to not bury the reader under the amount of implementation specific details, like we are discussing."
I'm sure you'll find a way to sugarcoat what the user should know. :-)
0 Kudos
Reply