Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Pipeline

RafSchietekat
Valued Contributor III
1,838 Views

In tbb21_20081109oss, shouldn't tbb::pipeline::end_of_input be atomic (at least for pedantic reasons), and what's the deal with tbb::pipeline::token_counter (next_token_number() is executed with a lock on stage_task, not pipeline)? Is there or could somebody write a user guide that makes clear how the parts fit together?

0 Kudos
62 Replies
RafSchietekat
Valued Contributor III
993 Views

Here are the sources preprocessed for my walnut-sized brain to have any chance of understanding what's going on (hey, I was watching TV at the same time...). Perhaps they might inspire one of you to study and explain them (I'll need some more energy after I get to understand them myself)? There's a list of changes in pipeline.h.

0 Kudos
Alexey-Kukanov
Employee
993 Views
Quoting - Raf Schietekat

Here are the sources preprocessed for my walnut-sized brain to have any chance of understanding what's going on (hey, I was watching TV at the same time...). Perhaps they might inspire one of you to study and explain them (I'll need some more energy after I get to understand them myself)? There's a list of changes in pipeline.h.


After a cursory look, I can say that some of the changes would break binary compatibility with older TBB versions, and thus they would have to wait to the moment for pipeline being reimplemented within a new set of classes (that will have to have a version suffix according to TBB versioning rules, e.g. pipeline_v4).

Meanwhile we continue improving the pipeline functionality, though the issue of simultaneous progress on multiple pipelines discussed in another thread is not yet targeted. The last changes introduced support for serial out-of-orderfilters, and more work is in progress.

0 Kudos
Alexey-Kukanov
Employee
993 Views
Quoting - Raf Schietekat

In tbb21_20081109oss, shouldn't tbb::pipeline::end_of_input be atomic (at least for pedantic reasons), and what's the deal with tbb::pipeline::token_counter (next_token_number() is executed with a lock on stage_task, not pipeline)? Is there or could somebody write a user guide that makes clear how the parts fit together?


end_of_input only changes once per a pipeline run, from false to true to indicate there is no more input. For serial input filters, there is no data race between this write and other reads, because another task to take the next input item will not be spawn after the end of input was reached. For parallel input filters, there is a data race but it is benign, because parallel input filters should be tolerant to latecomers anyway, i.e. the filter might be called a few times even after the end of input was reached.

The token counter is incremented in the very first serial ordered filter in the pipeline. Since the filter is serial, it contains a lock on its internal buffer of pending items. So it makes sense to increase the token number when holding this lock, rather than introduce a special lock in the pipeline. In other words, the lock is there for more reasons than just counting tokens.

The idea of explaining more about how the parts of pipeline implementation fit together definitely makes sense. As I said in another message, we are working on further improvements in the pipeline, which might involve some refactoring as well. During the course of this work, we will put more comments in place, and might be write some formal design documents.

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

"there is a data race but it is benign" I tend to side with Hans Boehm here ("no such beast"), no matter if the atomic may well make no difference at the machine code level.

"The token counter is incremented in the very first serial ordered filter in the pipeline. Since the filter is serial, it contains a lock on its internal buffer of pending items. So it makes sense to increase the token number when holding this lock, rather than introduce a special lock in the pipeline. In other words, the lock is there for more reasons than just counting tokens." If there's a unique ordered_buffer (not a "stage_task" like I wrote by mistake) whose mutex governs access to the pipeline's token_counter, that would indeed do the trick, but I probably need more input before I'll be able to see the light (it doesn't help that it's well into the night now).

"The idea of explaining more about how the parts of pipeline implementation fit together definitely makes sense. As I said in another message, we are working on further improvements in the pipeline, which might involve some refactoring as well. During the course of this work, we will put more comments in place, and might be write some formal design documents." That may be too late for me. The problem is that it takes too much time and energy to come up with various hypotheses and test them, compared to verifying documentation, before getting to the creative part.

0 Kudos
Alexey-Kukanov
Employee
993 Views
Quoting - Raf Schietekat
"there is a data race but it is benign" I tend to side with Hans Boehm here ("no such beast"), no matter if the atomic may well make no difference at the machine code level.

My argument was not about difference at machine code level, but about no harm due to this race. Its only consequence is that some threads may receive the end_of_input signal somewhat later in case of parallel input, and it's harmless because the parallel input filter must be tolerant to latecomers anyway.

Quoting - Raf Schietekat
If there's a unique ordered_buffer (not a "stage_task" like I wrote by mistake) whose mutex governs access to the pipeline's token_counter, that would indeed do the trick, but I probably need more input before I'll be able to see the light (it doesn't help that it's well into the night now).

The buffers are per serial filter, and the lock is per buffer. Every serial filter uses the lock to serialize buffer operations. The very first serial ordered filter also uses this lock to put tokens in order. As only one filter might be "the first", then there is a unique lock governing access to the pipeline's token counter.

Quoting - Raf Schietekat
That may be too late for me. The problem is that it takes too much time and energy to come up with various hypotheses and test them, compared to verifying documentation, before getting to the creative part.

I understand and agree. At this point, however, I see no big sense in formal documentation for the current design which is being changed; though I must admit that the set of classes will likely remain, though responsibilities may change.

Let me suggest you to ask more questions here (including your hypotheses), and I will be glad to answer those if it saves your time.

To start, I will briefly descrive the classes and their main intent (probably not much news there):

- tbb::pipeline is the main class for the algorithm; it contains a sequense of filters, and provides the APIand encapsulates the mechanics of how to form and start the pipeline.

- tbb::filter is the base class for user-defined filters; it represents filter abstraction for the pipeline. Its pure virtual function call operator should be overridden by user-defined descendant classes to implement the logic of data processing on different pipeline stages.

- tbb::internal::ordered_buffer represents the temporary storage for items that can not be processed immediately by a serial filter. It's name resembles historical fact that serial filters were always ordered until very recently; now serial filters can be out-of-order, but still the requirement of processing items one at a time dictates the need for a buffer. The buffer is dynamically created for every serial filter; it can grow in size during execution.Operations on the buffer are protected by the buffer-specific lock.

- tbb::internal::stage_task is a descendant of tbb::task that represents pipeline operations to the scheduler. Execution ofa stage_task applies one filter to one data item, and spawns other stage_tasks for subsequent operations on the data, as well as fetching new data (as defined by the very first, or input,filter).

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

"My argument was not about difference at machine code level, but about no harm due to this race." And mine that this should still be formalised by using an atomic even if it can be fenceless (only relaxed operations).

"The very first serial ordered filter also uses this lock to put tokens in order. As only one filter might be "the first", then there is a unique lock governing access to the pipeline's token counter." That's an example of a high-level constraint that is tiresome to find just by induction.

"Let me suggest you to ask more questions here (including your hypotheses), and I will be glad to answer those if it saves your time." Hmm, OK, thanks, but feel free to volunteer some more must-know stuff about the current implementation (things that are not obvious from the code).

0 Kudos
Alexey-Kukanov
Employee
993 Views

> this should still be formalised by using an atomic even if it can be fenceless (only relaxed operations).

Oh, of course, since your reworked atomics have the means to express that.

> That's an example of a high-level constraint that is tiresome to find just by induction.

I agree.

> ... feel free to volunteer some more must-know stuff about the current implementation (things that are not obvious from the code).

Sure, when something related comes to my mind... Though the best way to help this happening is to ask questions :)

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

Alexey:
"end_of_input only changes once per a pipeline run, from false to true to indicate there is no more input. For serial input filters, there is no data race between this write and other reads, because another task to take the next input item will not be spawn after the end of input was reached. For parallel input filters, there is a data race but it is benign, because parallel input filters should be tolerant to latecomers anyway, i.e. the filter might be called a few times even after the end of input was reached."
Actually, serial input filters in a non-trivial pipeline must also be prepared to be called after returning NULL, because of a race on end_of_input (I'm not really happy about C++0x's limited definition of a race). Maybe you should document this, because it does not seem to be clear to everybody... :-)

Below this paragraph,
[Tutorial] means Tutorial (Open Source).pdf" revision 1.11, and
[Reference] means "Reference Manual (Open Source).pdf" revision 1.9.

[Tutorial] p. 26:
"// Must remove filters from pipeline before they are implicitly destroyed.
pipeline.clear();"
[Tutorial] p. 27:
"This top-level code also shows the method clear that removes all stages from the pipeline. This call is required if the filters have to be destroyed before the pipeline. The pipeline is a container that holds filters, and as with most containers in C++, it is illegal to destroy an item while it is in the container."
[Reference] p. 37:
"A filter must be removed from the pipeline before destroying it. You can accomplish this by destroying the pipeline first, or calling pipeline::clear()."
In tbb21_20081109oss filters remove themselves from their pipeline before they are destroyed, as specified in [Reference] p. 39, so clear() is merely a few cycles faster and the erroneous statements should be modified.

[Tutorial] mentions "pipeline template" twice, but this is merely confusing because pipeline is not a C++ class template and I don't see a useful other meaning (surely objects designed for use together are not a "template"?).

[Reference] p. 37:
"The parameter max_number_of_live_tokens puts an upper bound on the number of stages that will be run concurrently."
Well, sure, but that's not what it means.

Why does task depth (seem to?) increase (roughly or exactly?) linearly with the number of stages traversed?

In stage_task::execute(), is "recycle_as_continuation(); return this;" (or code to that effect) the same as "goto restart;" (with a restart label at the top)? If not, what's the difference?

0 Kudos
Alexey-Kukanov
Employee
993 Views

Quoting - Raf Schietekat
> Actually, serial input filters in a non-trivial pipeline must also be prepared to be called after returning NULL,
> because of a race on end_of_input (I'm not really happy about C++0x's limited definition of a race).
> Maybe you should document this, because it does not seem to be clear to everybody... :-)

I will try to prove that the data race does not exist for serial input.

In the code executed during the pipeline run, there are 2 places where end_of_input is read, and 2 places where it is written. One of each is for the parallel input filter, which I covered before. So one write (end_of_input is set to true after the input filter returned NULL) and one read (at the very end of stage_task::execute, after an item was processed by every stage, the flag is tested before spawning a new task to start processing another item) are left for consideration. I claim there is no data race between these two, because it is protected by my_pipeline.input_tokens atomic variable acting as a semaphore:
- input_tokens starts as the user-specified maximal number of items under processing allowed at any given time;
- each time a new item is taken from the input filter, input_tokens is decremented;
- each time an item is processed by the last filter, input_tokens is incremented;
- it is used to decide if a task should be spawned to take a new item from the input.

When the input filter is serial, a new task is spawned right after the input filter returned a valid item. For the parallel input filter, a new task is optimistically spawned before the input filter returned. In both cases, however, the semaphore value is first decreased, and tested; if it reached 0, the new task is not spawned. In this case, a new task is spawned the first time the semaphore is open (i.e. changes from 0 to 1), and this is the place where end_of_input is tested.

Let's now see how it works for the serial input. At any time, only a single task runs the input filter (see the spawning rules above). Also at any time, the input filter only runs if the input_tokens semaphore is opened (>0). When the input ends, the end_of_input flag is set, but the semaphore value is untouched - which means, it is still open. Effectively it means that, after end_of_input was set, its value won't ever be read at the end of the execute method. Well, I must admit this conclusion surprised me :) So it seems for the serial input filter, the flag is excessive. Need to address it during the course of refactoring.

By the way, a little unrelated detail came to my mind: the constructor of stage_task takes either one or two parameters. With one parameter only (the reference to the pipeline), the task will run the input filter. Constructed with the second parameter being a pointer to a filter, it will run that filter.

> In tbb21_20081109oss filters remove themselves from their pipeline before they are destroyed, as specified in
> [Reference] p. 39, so clear() is merely a few cycles faster and the erroneous statements should be modified.

The documents on the Web appear outdated; I will askto update it. I will also check the most actual docs for the issues you reported; thanksa lot!

> Why does task depth (seem to?) increase (roughly or exactly?) linearly with the number of stages traversed?

Depth is increased for every next filter in the chain, to prioritize processing of existing items over taking new ones. When a thread hasexecuteda serial filter, it might then spawn a task to process the next awaiting item; this can happena few times before the end of the pipeline. At the end, a task to take out a new item might be spawned as well, as I described above. Increasing the depth makes items closer to the end of pipeline being processed earlier. Also as you remember stealing is "shallow tasks first out" with regard to the depth, and LIFO(bad bad bad :)) on the same level of depth; so the trick with increased depth also makes stealing de facto FIFO (except for input filter tasks, which we _want_ to be stolen earlier).

> In stage_task::execute(), is "recycle_as_continuation(); return this;" (or code to that effect)
> the same as "goto restart;" (with a restart label at the top)? If not, what's the difference?

A few additional things are done inside the scheduler, of whichthe most important one is the check for cancellation; also tracing, gathering statistics etc. (now only for developers' use, controlled by compile-time macro switches). There is a very little additional overhead that could be scratched out with goto; we considered that and decided it is not worth even duplicating thecancellation checks inside the stage_task itself.

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

"I will try to prove that the data race does not exist for serial input." And so you did: I stand corrected.

"check for cancellation" Of course...

Thanks a lot, I'll be doing some more staring at the code now before my next question.

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

I'm having a problem with my changes to pipeline: test_pipeline.exe seems to take forever to execute. I just do "make all", and much later, when I execute "top", I see over 86 minutes in the "TIME+" column. No, wait, it's the original version!

The curious thing is that when I do other things on the system, like switching to a different terminal to run "top", and typing this text, the test seems to get unstuck somehow. And it's not the first time that I've seen this with TBB, either.

P.S.: In the first paragraph, I was in fact verifying that I didn't introduce this particular problem, but I felt like introducing a bit of dramatic effect. Any idea what causes this?

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

Some more revisited and new questions...

I suppose that ordered filters are only relevant relative to each other, and that if only one occurs in the pipeline it might as well be merely serial? If the first serial filter in the pipeline is not ordered, currently the sequence in which it processes tokens is normally not guaranteed to be reproduced in any ordered filters down the line, except if that first serial filter is also the input filter: intentional or an oversight?

So it appears that a serial input filter will not be invoked after it has returned NULL, but is this going to be documented as part of the contract, or is it just how it happens to work right now but without any guarantees for the future and therefore without a need to preserve this behaviour if the implementation is changed?

Don't overlook the matter of excessive execution times described in the previous posting #11, but I think I may have seen it happen also where a pipeline was not involved.

Why the fences on input_tokens? Doesn't spawn() do a release already (and can you put a timeline on the documentation of how fences are used with tasks, a question that has arisen earlier)? But again, regardless of what spawn() does, why the fences? Is it a mere statement of fact that the current implementation has no unfenced RMW?

At the end of the pipeline, what does set_depth() do that would affect the additional child of the counter task? It only seems meaningful with a recycle_as_continuation() instead.

0 Kudos
Alexey-Kukanov
Employee
993 Views

Quoting - Raf Schietekat

> I'm having a problem with my changes to pipeline: test_pipeline.exe seems to take forever to execute. I just do "make all", and much later, when I execute "top", I see over 86 minutes in the "TIME+" column. No, wait, it's the original version!

> The curious thing is that when I do other things on the system, like switching to a different terminal to run "top", and typing this text, the test seems to get unstuck somehow. And it's not the first time that I've seen this with TBB, either.

> P.S.: In the first paragraph, I was in fact verifying that I didn't introduce this particular problem, but I felt like introducing a bit of dramatic effect. Any idea what causes this?

So in your system, with the vanilla TBB (which package?), the test_pipeline gets stuck, unless you switch to some other application? What is your system? Does the effect take place if you work in text-only console mode (as opposed to some windowing GUI?)

So far, I have no idea what might cause this.

0 Kudos
RafSchietekat
Valued Contributor III
993 Views

Alexey: "So in your system, with the vanilla TBB (which package?), the test_pipeline gets stuck, unless you switch to some other application? What is your system? Does the effect take place if you work in text-only console mode (as opposed to some windowing GUI?)"

x86 (dual core)/Ubuntu 6.06 (which uses Gnome)/g++ 4.0.3/tbb21_20081109oss; it doesn't reliably get stuck, but this has happened a few times if I leave it alone, and starting another "make all" tends to get things unstuck. If you think you're onto something (what exactly?), I might look for a way to log in without the GUI, or open a remote shell, but it was over 86 minutes of "CPU Time" measured by "top", not a frozen display. It may have happened with earlier versions and other TBB programs (I've actually mentioned it before and Dmitriy wrote at the time that there might be variability in benchmarking), but maybe I just haven't left the program alone for so long yet. Then again, I have no idea why my system doesn't show the button to upgrade the O.S. without doing a fresh install, so maybe it's just earth rays...

0 Kudos
jimdempseyatthecove
Honored Contributor III
993 Views


>>
So in your system, with the vanilla TBB (which package?), the test_pipeline gets stuck, unless you switch to some other application? What is your system? Does the effect take place if you work in text-only console mode (as opposed to some windowing GUI?)

So far, I have no idea what might cause this.
<<

Alexey,

(hypothetical postulation here)

One potential cause for the peculiarity of the lock-up being "fixed" by the introduction of an additional load on the systemn could be a situation where multiple mutex or other synchronization objects are required and in this particular case two threads are competing for the multiple mutex/synch objects, where each thread is located on seperate hardware threads (dual core system), and where the retry code is optimized to use SwitchToThread(), and since during lock-up condition there are no other pending threads on each core, SwitchToThread immediately returns without context switch. Further, the particular sequencing of obtaining the multiple mutex/synch objects is such that the dependent task has faster access to the multiple mutex/synch objects (butting in line so to speak), such that the other task is inhibited from advancing the state (pipeline in this case). The "fix" is observed/occures when an additional thread is injected into the scenario thus causing the SwitchToThread to cause a delay to be introduced into the mix.

Jim Dempsey

0 Kudos
RafSchietekat
Valued Contributor III
993 Views
How can I spawn the current task again and return NULL? recycle_as_continuation() causes a complaint about ready, and recycle_to_reexecute() causes a complaint about NULL. Should I use a hack using recycle_to_reexecute() and return a dummy task that does recycle_as_continuation() for reuse as a dummy or something?

(Added) Or can the assert with recycle_to_reexecute() simply be relaxed to allow NULL?
0 Kudos
Alexey-Kukanov
Employee
993 Views

Quoting - Raf Schietekat

> How can I spawn the current task again and return NULL? recycle_as_continuation() causes a complaint about ready, and recycle_to_reexecute() causes a complaint about NULL. Should I use a hack using recycle_to_reexecute() and return a dummy task that does recycle_as_continuation() for reuse as a dummy or something?

> (Added) Or can the assert with recycle_to_reexecute() simply be relaxed to allow NULL?

Spawning the task that is currently executed is dangerous, because of possible execution race if it is stolen while the current execute() not yet completed.
Recycle_to_reexecute complains about NULL returned because in this case it is highly likely that the same task will be taken for execution right away. Usually, a task is recycled to repeat execution after some other tasks got executed, and this means some other task could possibly be returned from execute() to immediately start with.

What do you want to achieve byspawning the currently executedtask from itself? In particular, why recycle_to_reexecute won't work for you? It causes the task to get spawned right after its execution completes.
After understanding your needs, I might be able to suggest the way to achieve it. If nothing else works, adding another recycling method is possible.

P.S. I also owe you an answer onyour post #12; it requires more thinking and writing, while I am still on holidays and notyet back to working mood :)

0 Kudos
RafSchietekat
Valued Contributor III
993 Views
"Spawning the task that is currently executed is dangerous, because of possible execution race if it is stolen while the current execute() not yet completed." Sorry for the inaccurate formulation, I did mean having it spawned again after execute() returns, as in recycle_to_reexecute().

[...]

It seems that relaxing the assert to allow NULL works fine, so "refurbish(); recycle_to_reexecute(); return NULL;" is a presumedly significant improvement over the alternative.

"P.S. I also owe you an answer on your post #12; it requires more thinking and writing, while I am still on holidays and not yet back to working mood :)" OK!

Another question: Roughly how much do allocation and destruction of tasks contribute to their overhead? Am I chasing a red herring trying to recycle them?
0 Kudos
ARCH_R_Intel
Employee
993 Views
Quoting - Raf Schietekat
It seems that relaxing the assert to allow NULL works fine, so "refurbish(); recycle_to_reexecute(); return NULL;" is a presumedly significant improvement over the alternative.

Another question: Roughly how much do allocation and destruction of tasks contribute to their overhead? Am I chasing a red herring trying to recycle them?

Tasks are allocated and deallocated off a thread's local free list, so they are fairly cheap. Nonetheless, sometimes a task has heavy user-defined state associated with it, and recycling will be worth the trouble.

It may be time to weaken the assertion to allow NULL. The assertion was written was motivated by two concerns:

  • We had a narrow use case of recycle_to_reexecute() in mind, specifically parallel_while (now parallel_do). When in doubt, we start out with conservative assertions because it's always eaiser to weaken an assertion than to strengthen it and break existing code.
  • There was a nannyish intent to avoid inefficient code. If all the code is doing is re-executing the task without returning another task to execute, it might as well loop within the task, or use recycle_as_continuation as described below.
But I have found other nannies (notably Microsoft's /W4!) to be repulsive, so I think we should stop being nannyish here and weaken the assertion.

Note, however, that there is a way to accomplish reexecution of a task within the current TBB. If a task is to be re-executed immediately, it is essentially a continuation of itself with no children. The direct way to say this in code is for method execute() to call recycle_as_continuation() and return a pointer to the task. Below is an example.

#include
#include"tbb/task_scheduler_init.h"
#include"tbb/task.h"
classloop:publictbb::task{
inti;
public:
loop(){i=0;}
tbb::task*execute(){
std::printf("i=%dn",i);
if(++i!=1000){
recycle_as_continuation();
returnthis;
}
returnNULL;
}
};
intmain(){
tbb::task_scheduler_initinit;
tbb::task*t=new(tbb::task::allocate_root())loop;
tbb::task::spawn_root_and_wait(*t);
return0;
}
0 Kudos
RafSchietekat
Valued Contributor III
907 Views
"Tasks are allocated and deallocated off a thread's local free list, so they are fairly cheap." Roughly how cheap, as a fraction of total task overhead? It obviously can't all be local, e.g., there's a (potentially costly) little shake to the shared parent's reference count that refurbishing would avoid. Perhaps no urgent concern for normal applications, but still something a toolkit would do well to at least take into consideration.

"Nonetheless, sometimes a task has heavy user-defined state associated with it, and recycling will be worth the trouble." So it's all just to avoid using a pointer to external state? Really?

pipeline.cpp already uses "recycle_as_continuation(); return this;", but "recycle_to_reexecute(); return NULL;" will also yield to higher-priority tasks (at greater depths in the local array), as required for (local) progress.
0 Kudos
Reply