Intel® oneAPI Threading Building Blocks
Ask questions and share information about adding parallelism to your applications when using this threading library.

Pipeline

RafSchietekat
Valued Contributor III
1,901 Views

In tbb21_20081109oss, shouldn't tbb::pipeline::end_of_input be atomic (at least for pedantic reasons), and what's the deal with tbb::pipeline::token_counter (next_token_number() is executed with a lock on stage_task, not pipeline)? Is there or could somebody write a user guide that makes clear how the parts fit together?

0 Kudos
62 Replies
ARCH_R_Intel
Employee
464 Views
Quoting - Raf Schietekat
"Tasks are allocated and deallocated off a thread's local free list, so they are fairly cheap." Roughly how cheap, as a fraction of total task overhead? It obviously can't all be local, e.g., there's a (potentially costly) little shake to the shared parent's reference count that refurbishing would avoid. Perhaps no urgent concern for normal applications, but still something a toolkit would do well to at least take into consideration.

"Nonetheless, sometimes a task has heavy user-defined state associated with it, and recycling will be worth the trouble." So it's all just to avoid using a pointer to external state? Really?

pipeline.cpp already uses "recycle_as_continuation(); return this;", but "recycle_to_reexecute(); return NULL;" will also yield to higher-priority tasks (at greater depths in the local array), as required for (local) progress.

Right. Recycling avoids the reference count modification, which could easily dominate the time for the rest of the task allocation/deallocation bookkeeping. I've never measured the times.

The priority issue is a good point, more so because we plan to deprecate the "depth()" attribute and use a deque like Cilk does. If so, what are the implications for the pipeline implementation? My impression was that there was not a problem as long as we chose the order of spawning (pushing on the deque) carefully, so that higher priority tasks were pushed last.

0 Kudos
RafSchietekat
Valued Contributor III
464 Views
It seems a radical change you're contemplating (not merely replacing the array of stacks with an array of deques but really only having one deque? what would happen to depth-selective stealing?)... one that could do with a bit of explanation (from Cilk to something entirely different and back again: the people, their motives and their adventures along the way).

Meditate on this, I will.

0 Kudos
RafSchietekat
Valued Contributor III
464 Views

Lesson learnt last week: RTFM (again and again). I made a change to have a task reexecutedat the later of two events: return from execute() (obviously), and notification from another task. To simulate this as yet nonexisting functionality I useda dummy child empty_task just for reference count administration, but for the notification I just destroy()'ed this dummy task. Unfortunately, failure only happened after an annoyingly long time of running the test code, so it was not obvious that it should be something so simple. After instead spawn()'ing the dummy task (because destroy() doesn't take any action when the reference count reaches zero, for reasons that are not clear to me), things worked fine... except that that other problem mentioned earlier was still occurring, so I have to run, e.g.,another TBB "make all" to get to the end (could it perhaps be something in the test code and its simulated waiting?).

So now I have a pipeline that (conceptually) only recycles tasks instead of spawning new ones and throwing old ones away, which should not make it any more sluggish, I suppose. I would now recommend some tweaks to the scheduler to make this official and optimised, but I'll have to clean up the code first. It also seemed nice to let pipelines with parallel input filters start up in O(log n) time instead of O(n), so to say.

There's one more change (that I remember without having the code at hand) that seems appropriate before embarking on the more interesting stuff of achieving global progress (this was just to get acquainted): a serial_out_of_order filter somewhere between two serial_in_order filters should pick the oldest buffered token, because otherwise the token that it chooses based on local "first come, first served" is likely to have to waitin the downstream serial_in_order filter's input buffer until the older token finally gets processed. Well, it may not make much of a difference most of the time, but it's too obvious to leave as-is.

0 Kudos
RafSchietekat
Valued Contributor III
464 Views
I don't have any other changes in the pipeline (pun intended), just a remark at this time.

In all but the first filter, the object value is ignored, but the source code has a comment "/** Returns NULL if filter is a sink. */", and [Reference] p. 40 says "The first filter in a pipeline should return NULL if there are no more items to process. The result of the last filter in a pipeline is ignored.", which doesn't seem quite accurate. I suppose that the source code comment can be removed and that the second sentence in the reference should read "The result of other filters is ignored."?

I still have to add task::release_ref() and do some cleanup before I can show the code.

0 Kudos
Dmitry_Vyukov
Valued Contributor I
464 Views
Quoting - Raf Schietekat
It seems a radical change you're contemplating (not merely replacing the array of stacks with an array of deques but really gonly havin one deque? what would happen to depth-selective stealing?)


I've recently finally got the idea of depth based stealing (thanx to Alexey Kukanov), and I think that absence of depth based stealing is not as bad as one might think. Java Fork/Join uses plain deque, AFAIK .NET TPL uses plain deque, Cilk uses plain deque. And I going to investigate Cilk++ source today, probably it also uses plain deque (Yes, it's already available for downloading for free: http://www.cilk.com/home/get-cilk).
If one's application runs into stack overflow w/o depth-based stealing, then probably there is something utterly wrong with it... And all one has to do is to rewrite application w/o that already doomed synchronous waiting for children. Btw, what about deprecating synchronous waiting at one? ;)

0 Kudos
Dmitry_Vyukov
Valued Contributor I
464 Views

> this should still be formalised by using an atomic even if it can be fenceless (only relaxed operations).

Oh, of course, since your reworked atomics have the means to express that.



Relaxed atomics are also required for formal verification tools - which may otherwise bark at such places, and also is very useful documentation - otherwise it will be completely obscured that the variable is used from several threads concurrently. So I personally consider them as very important thing. Actually I want as much syntactic noise around accesses to shared variables as possible :)

0 Kudos
RafSchietekat
Valued Contributor III
464 Views

"If one's application runs into stack overflow w/o depth-based stealing, then probably there is something utterly wrong with it..." Why?

0 Kudos
Dmitry_Vyukov
Valued Contributor I
464 Views
Quoting - Raf Schietekat

"If one's application runs into stack overflow w/o depth-based stealing, then probably there is something utterly wrong with it..." Why?


(1) First of all, it uses blocking wait for children instead of continuations, it's already can be considered as a crime :)
(2) Then, other people's programs run on flat deques, so why his program is unable to run?
Hmmm... I think it's enough to justify the statement starting with "probably" :)

0 Kudos
RafSchietekat
Valued Contributor III
464 Views
Hmm, I'm one of those guys who think that a toolkit should work for him instead of the other way around (even though I have a disturbing weakness for tinkering with toolkits). I'm not very happy with continuations as a requirement.

Maybe it's a bit like distributed programming... the idea that you can hide object locations and pretend that everything is local turned out to be a doomed approach, however appealing initially.

Still, why be so quick to throw in the towel this time? And specifically for TBB, why not have an array of deques?
0 Kudos
Dmitry_Vyukov
Valued Contributor I
464 Views
Quoting - Raf Schietekat
Hmm, I'm one of those guys who think that a toolkit should work for him instead of the other way around (even though I have a disturbing weakness for tinkering with toolkits). I'm not very happy with continuations as a requirement.

Maybe it's a bit like distributed programming... the idea that you can hide object locations and pretend that everything is local turned out to be a doomed approach, however appealing initially.

Still, why be so quick to throw in the towel this time? And specifically for TBB, why not have an array of deques?

Ok, I flew into a passion. Nothing wrong with one's application. Especially taking into account that stack overflow provided flat deque can occur even with application that structures its work DAG into exemplary balanced binary tree.

Just some other means for preventing stack overflow must be employed.

Why not have an array of deques? I think that Arch must have some good answer ;) IMVHO Single plain deque is simpler and potentially faster.


0 Kudos
RafSchietekat
Valued Contributor III
464 Views
"Why not have an array of deques? I think that Arch must have some good answer ;)" I'm not so optimistic.

"IMVHO Single plain deque is simpler and potentially faster." That may be the case, but simplicity in the application should not be traded away for simplicity in the toolkit (unless you plan to only have a few applications).

I have attached a snapshot of some changes (to be patched into tbb21_20081109oss). Anyone for benchmarking them?
0 Kudos
ARCH_R_Intel
Employee
464 Views
Quoting - Raf Schietekat
"Why not have an array of deques? I think that Arch must have some good answer ;)" I'm not so optimistic.

"IMVHO Single plain deque is simpler and potentially faster." That may be the case, but simplicity in the application should not be traded away for simplicity in the toolkit (unless you plan to only have a few applications).


None of the other algorithm templates require multiple deques. Having multiple deques would tax users of all other algorithms for sake of a particular pipeline implementation. We think that any of the significant benefits of the depth parameter can be similarly achieved by carefully choosing the order in which items are pushed onto the deque.

Of course, in some cases we do tax everyone for sake of a few. A notable example is the affinity support, which is currently exploited by only a few of the algorithm templates. However, it inflicts only a few branches on everyone, and gives a major scalability boost for some common idioms. The benefit has been measured at >5x for some cases. It's inevitably a judgement call of which taxes are worth making everyone pay.

0 Kudos
RafSchietekat
Valued Contributor III
464 Views

"None of the other algorithm templates require multiple deques."

That's beside the point: a pipeline would be only too happy with a deque to provide global progress, so much so that it probably would not mind if later data units bump into earlier ones that have stalled because they don't have a higher priority anymore (the only reason I see at this time for having more than two depths in pipeline).

The point is what a worker will do while waiting for a task's children to complete. Are you going to require the use of continuations, as Dmitriy indicates? That would be like throwing in the towel just because the opponent isn't showing you a sunny smile. Or is there another solution to this (I think I have an idea, but I'd like to hear yours first)? An inquiring mind wants to know...

0 Kudos
RafSchietekat
Valued Contributor III
464 Views
So who would like to benchmark the changes I made against the original? I think there are problems with the test code that make it unsuitable for benchmarking, but you might have a real example that you can use. I just want to know that it's not a totally insignificant change or worse.
0 Kudos
Alexey-Kukanov
Employee
464 Views
Due to some forum issues, I can't reply to your last post and had to reply to the thread.

The test_pipeline is definitely not good to serve as a benchmark. But have you tried "text_filter", and even better "square" examples shipped with TBB? That could give at least some digits.

Also regarding your earlier questions that I promised to answer. Pardon me that I did not answered at that time. Teh discussion moved quite far since that. Are there still questions you'd like to know the answer for, and if there are, could you please repeat, or point me to them, so that I could help you the best way?
0 Kudos
RafSchietekat
Valued Contributor III
464 Views
"Are there still questions you'd like to know the answer for" Thanks, but I think I've come to know pipeline quite well now. I would still want to know about this one deque per thread, but otherwise you might just comb the posts for feedback information and ignore the questions. I think I might try to partially restore order in a sandwiched non-ordered filter (except that I'm wondering whether and how much cache locality plays a role here), and look into the possibility of making the input buffers lock-free, and then submit the changes (or should I do that already?).
0 Kudos
Alexey-Kukanov
Employee
464 Views
I am quite sure that if an unordered serial filter directly follows an ordered serial filter directly, its execution will likely be ordered:the previous filter outputs tokens in order, and the unordered serial filter processes tokens in FIFO order as far as Ican tell, thus it will also be ordered. Consequently, the next serial filter will receive tokens in order as well. Looks like ordering will only change after a parallel filter is met. So I wouldn't bother with any optimizations for the "sandwich" case.

Making the buffers lock-free sounds interesting, though off the top of the headI do not know how to combine it with on-demand dynamic growth of the buffers.
0 Kudos
RafSchietekat
Valued Contributor III
464 Views

Regarding #37 "So I wouldn't bother with any optimizations for the "sandwich" case.": Yes, you would need an ordered filter, a parallel filter, a non-ordered filter, an ordered filter, in that order, between any others, so unless anybody has a compelling use case...

"combine it with on-demand dynamic growth of the buffers" It's a hybrid approach. Most often, completely lock-free algorithms are probably too risky and too cumbersome for what they deliver (if anything), but if there's low-hanging fruit to be picked by using a lock-free "pre-input buffer", then why not give it a try... I'm adding tasks to a multi-producer singly linked list (compare_and_swap()), which at certain points is cleared out (fetch_and_store(NULL)), to be fed into the existing code; one of those points is if the code finds out that before the addition the token was not yet equal to the one being served (otherwise it would be returned from execute()), but after the addition it is, because in that case the addition and the notification may have crossed each other.

One thing occurred to me after my posting yesterday: I should give wrap-around tasks that won't replicate priority over tasks that will (based on their m_spawn value), otherwise I'm creating overpopulation that might undo any benefits of recycling and not touching a shared reference count, or worse. Isn't the current pipeline supposed to work well with an unlimited value for tokens in flight (disregarding external considerations like use of a circular buffer)?

0 Kudos
RafSchietekat
Valued Contributor III
464 Views
Riddle me this: how do you determine whether a task was stolen if all tasks are children of a single abstract parent and get recycled over and over, making is_stolen_task() as currently specified meaningless?

Answer: hack the scheduler some more...

Rationale: have your pipeline cake and eat it too. The only way I see to let pipeline start up in logarithmic time and not cause a task replication runaway chain reaction even with an excessive or unlimited maximum number of tasks in flight is to keep replication-capable tasks at a separate low priority/depth. When such a task is stolen, it performs meiosis, with each task getting half the replication potential. When not stolen, it performs mitosis, mobilising a task for pipeline duty and then going to sleep again (tasks on pipeline duty continuously traverse through the pipeline and wrap around to the beginning until pipeline shutdown). To complete the picture, meiosis would imply mitosis as an obvious optimisation, and the original zygote would be twofold because it would be silly for the first thread to have to go steal back another replication task after a single theft instead of two.

Any obvious holes in the argument (and I don't mean the imagery)? Any pointers on the hacking?
0 Kudos
RafSchietekat
Valued Contributor III
469 Views
Call me obsessed, but I keep having doubts about memory semantics guarantees in TBB. Specifically, I cannot make out whether a serial filter is supposed to have state that is automatically released and acquired from one data unit to the next, even though it seems to me that it should. Am I mistaken? It's probably too much work to plug into some nifty validation tool, though... but I'm still hoping for an official statement.
0 Kudos
Alexey-Kukanov
Employee
469 Views
Quoting - Raf Schietekat

Call me obsessed, but I keep having doubts about memory semantics guarantees in TBB. Specifically, I cannot make out whether a serial filter is supposed to have state that is automatically released and acquired from one data unit to the next, even though it seems to me that it should. Am I mistaken? It's probably too much work to plug into some nifty validation tool, though... but I'm still hoping for an official statement.


Isn't ordered_buffer::low_token the state you are looking for? More precisely, with regard to the memory semantics it's actually ordered_buffer::array_mutex.

Added: but if youthink ofreplacing the aforementioned mutex with something else, then yes, there should be some special memory fences built into a serial stage, at least if the current design of the pipeline is generally kept. Consider a parallel stage preceding a serial one.A few workers can process the parallel stage tasks simultaneously; after completion, each one would try to advance its current item to the serial stage. The serial stage tasks would be either scheduled to execute right away, or spawned to local pools; in both cases, they do not have any syncronization that would ensure"oneitem at a time" constraintrequired for a serial filter.

Of course this reply does not pretend to documentmemory barriers inside the TBB scheduler.
0 Kudos
Reply